Tag Archives: artificial intelligence

How Cynamics built a high-scale, near-real-time, streaming AI inference system using AWS

Post Syndicated from Aviv Yehezkel original https://aws.amazon.com/blogs/big-data/how-cynamics-built-a-high-scale-near-real-time-streaming-ai-inference-system-using-aws/

This post is co-authored by Dr. Yehezkel Aviv, Co-Founder and CTO of Cynamics and Sapir Kraus, Head of Engineering at Cynamics.

Cynamics provides a new paradigm of cybersecurity — predicting attacks long before they hit by collecting small network samples (less than 1%), inferring from them how the full network (100%) behaves, and predicting threats using unique AI breakthroughs. The sample approach allows Cynamics to be generic, agnostic, and work for any client’s network architecture, no matter how messy the mix between legacy, private, and public clouds. Furthermore, the solution is scalable and provides full cover to the client’s network, no matter how large it is in volume and size. Moreover, because any network gateway (physical or virtual, legacy or cloud) supports one of the standard sampling protocols and APIs, Cynamics doesn’t require any installation of appliances nor agents, as well as no network changes and modifications, and the onboarding usually takes less than an hour.

In the crowded cybersecurity market, Cynamics is the first-ever solution based on small network samples, which has been considered a hard and unsolved challenge in academia (our academic paper “Network anomaly detection using transfer learning based on auto-encoders loss normalization” was recently presented in ACM CCS AISec 2021) and industry to this day.

The problem Cynamics faced

Early in the process, with the growth of our customer base, we were required to seamlessly support the increased scale and network throughput by our unique AI algorithms. We faced a few different challenges:

  • How can we perform near-real-time analysis on our streaming clients’ incoming data into our AI inference system to predict threats and attacks?
  • How can we seamlessly auto scale our solution to be cost-efficient with no impact on the platform ingestion rate?
  • Because many of our customers are from the public sector, how can we do this while supporting both AWS commercial and government environments (GovCloud)?

This post shows how we used AWS managed services and in particular Amazon Kinesis Data Streams and Amazon EMR to build a near-real-time streaming AI inference system serving hundreds of production customers in both AWS commercial and government environments, while seamlessly auto scaling.

Overview of solution

The following diagram illustrates our solution architecture:

To provide a cost-efficient, highly available solution that scales easily with user growth, while having no impact on near-real-time performance, we turned to Amazon EMR.

We currently process over 50 million records per day, which translates to just over 5 billion flows, and keeps growing on a daily basis. Using Amazon EMR along with Kinesis Data Streams provided the scalability we needed to achieve inference times of just a few seconds.

Although this technology was new to us, we minimized our learning curve by turning to the available guides from AWS for best practices on scale, partitioning, and resource management.

Workflow

Our workflow contains the following steps:

  1. Flow samples are sent by the client’s network devices directly to the Cynamics cloud. A network flow (or connection) is a set of packets with the same five-tuple ID: source-IP-address, destination-IP-address, source-port, destination-port, and protocol.
  2. The samples are analyzed by Network Load Balancers, which forward them into an auto scaling group of stateless flow transformers running on Graviton-powered Amazon Elastic Compute Cloud (Amazon EC2) instances. With Graviton-based processors in the flow transformers, we reduced our operational costs by over 30%.
  3. The flows are transformed to the Cynamics data format and enriched with additional information from Cynamics’ databases and in-house sources such as IP resolutions, intelligence, and reputation.

The following figures show the network scale for a single flow transformer machine over a week. The first figure illustrates incoming network packets for a single flow transformer machine.

The following shows outcoming network packets for a single flow transformer machine.

The following shows incoming network bytes for a single flow transformer machine.

The following shows outcoming network bytes for a single flow transformer machine.

  1. The flows are sent using Kinesis Data Streams to the real-time analysis engine.
  2. The Amazon EMR-based real-time engine consumes records in a few seconds batches using Yarn/Spark. The sampling rate of each client is dynamically tuned according to its throughput to ensure a fixed incoming data rate for all clients. We achieved this using Amazon EMR Managed Scaling with a custom policy (available with Amazon EMR versions 5.30.1 and later), which allows us to scale EMR nodes in or out based on Amazon CloudWatch metrics, with two different rules for scale-out and scale-in. The metric we created is based on the Amazon EMR running time, because our real-time AI threat detection runs on a sliding window interval of a few seconds.
    1. The scale-out policy tracks the average running time over a period of 10 minutes, and scales the EMR nodes if it’s longer than 95% of the required interval. This allows us to prevent processing delays.
    2. Similarly, the scale-in policy uses the same metric but measures the average over a 30-minute period, and scales the cluster accordingly. This enables us to optimize cluster costs and reduce the number of EMR nodes in off-hours.
  3. To optimize and seamlessly scale our AI inference calls, these were made available through an ALB and another auto scaling group of servers (AI model-service).
  4. We use Amazon DynamoDB as a fast and highly available states table.

The following figure shows the number of records processed by the Kinesis data stream over a single day.

The following shows the Kinesis data streams records rate per minute.

AI predictions and threat detections are sent to continued processing and alerting, and are saved in Amazon DocumentDB (with MongoDB compatibility).

Conclusion

With the approach described in this post, Cynamics has been providing threat prediction based on near-real-time analysis of its unique AI algorithms for a constantly growing customer base in a seamless and automatically scalable way. Since first implementing the solution, we’ve managed to easily and linearly scale our architecture, and were able to further optimize our costs by transitioning to Graviton-based processors in the flow transformers, which reduced over 30% of our flow transformers costs.

We’re considering the following next steps:

  • An automatic machine learning lifecycle using an Amazon SageMaker Studio pipeline, which includes the following steps:
  • Additional cost reduction by moving the EMR instances to be Graviton-based as well, which should yield an additional 20% reduction.

About the Authors

Dr. Yehezkel Aviv is the co-founder and CTO of Cynamics, leading the company innovation and technology. Aviv holds a PhD in Computer Science from the Technion, specializing in cybersecurity, AI, and ML.

Sapir Kraus is Head of Engineering at Cynamics, where his core focus is managing the software development lifecycle. His responsibilities also include software architecture and providing technical guidance to team members. Outside of work, he enjoys roasting coffee and barbecuing.

Omer Haim is a Startup Solutions Architect at Amazon Web Services. He helps startups with their cloud journey, and is passionate about containers and ML. In his spare time, Omer likes to travel, and occasionally game with his son.

Linking AI education to meaningful projects

Post Syndicated from Sue Sentance original https://www.raspberrypi.org/blog/ai-education-meaningful-projects-tara-chklovski/

Our seminars in this series on AI and data science education, co-hosted with The Alan Turing Institute, have been covering a range of different topics and perspectives. This month was no exception. We were delighted to be able to host Tara Chklovski, CEO of Technovation, whose presentation was called ‘Teaching youth to use AI to tackle the Sustainable Development Goals’.

Tara Chklovski.
Tara Chklovski

The Technovation Challenge

Tara started Technovation, formerly called Iridescent, in 2007 with a family science programme in one school in Los Angeles. The nonprofit has grown hugely, and Technovation now runs computing education activities across the world. We heard from Tara that over 350,000 girls from more than 100 countries take part in their programmes, and that the nonprofit focuses particularly on empowering girls to become tech entrepreneurs. The girls, with support from industry volunteers, parents, and the Technovation curriculum, work in teams to solve real-world problems through an annual event called the Technovation Challenge. Working at scale with young people has given the Technovation team the opportunity to investigate the impact of their programmes as well as more generally learn what works in computing education. 

Tara Chklovski describes the Technovation Challenge in an online seminar.
Click to enlarge

Tara’s talk was extremely engaging (you’ll find the recording below), with videos of young people who had participated in recent years. Technovation works with volunteers and organisations to reach young people in communities where opportunities may be lacking, focussing on low- and middle-income countries. Tara spoke about the 900 million teenage girls in the world, a  substantial number of whom live in countries where there is considerable inequality. 

To illustrate the impact of the programme, Tara gave a number of examples of projects that students had developed, including:

  • An air quality sensor linked to messaging about climate change
  • A support circle for girls living in domestic violence situation
  • A project helping mothers communicate with their daughters
  • Support for water collection in Kenya

Early on, the Technovation Challenge had involved the creation of mobile apps, but in recent years, the projects have focused on using AI technologies to solve problems. An key message that Tara wanted to get across was that the focus on real-world problems and teamwork was as important, if not more, than the technical skills the young people were developing.

Technovation has designed an online curriculum to support teams, who may have no prior computing experience, to learn how to design an AI project. Students work through units on topics such as data analysis and building datasets. As well as the technical activities, young people also work through activities on problem-solving approaches, design, and system thinking to help them tackle a real-world problem that is relevant to them. The curriculum supports teams to identify problems in their community and find a path to prototype and share an invention to tackle that problem.

Tara Chklovski describes the Technovation Challenge in an online seminar.
Click to enlarge

While working through the curriculum, teams develop AI models to address the problem that they have chosen. They then submit them to a global competition for beginners, juniors, and seniors. Many of the girls enjoy the Technovation Challenge so much that they come back year on year to further develop their team skills. 

AI Families: Children and parents using AI to solve problems

Technovation runs another programme, AI Families, that focuses on families working together to learn AI concepts and skills and use them to develop projects together. Families worked together with the help of educators to identify meaningful problems in their communities, and developed AI prototypes to address them.

A list of lessons in the AI Families programme from Technovation.

There were 20,000 participants from under-resourced communities in 17 countries through 2018 and 2019. 70% of them were women (mothers and grandmothers) who wanted their children to participate; in this way the programme encouraged parents to be role models for their daughters, as well as enabling families to understand that AI is a tool that could be used to think about what problems in their community can be solved with the help of AI skills and principles. Tara was keen to emphasise that, given the importance of AI in the world, the more people know about it, the more impact they can make on their local communities.

Tara shared links to the curriculum to demonstrate what families in this programme would learn week by week. The AI modules use tools such as Machine Learning for Kids.

The results of the AI Families project as investigated over 2018 and 2019 are reported in this paper.  The findings of the programme included:

  • Learning needs to focus on more than just content; interviews showed that the learners needed to see the application to real-world applications
  • Engaging parents and other family members can support retention and a sense of community, and support a culture of lifelong learning
  • It takes around 3 to 5 years to iteratively develop fun, engaging, effective curriculum, training, and scalable programme delivery methods. This level of patience and commitment is needed from all community and industry partners and funders.

The research describes how the programme worked pre-pandemic. Tara highlighted that although the pandemic has prevented so much face-to-face team work, it has allowed some young people to access education online that they would not have otherwise had access to.

Many perspectives on AI education

Our goal is to listen to a variety of perspectives through this seminar series, and I felt that Tara really offered something fresh and engaging to our seminar audience, many of them (many of you!) regular attendees who we’ve got to know since we’ve been running the seminars. The seminar combined real-life stories with videos, as well as links to the curriculum used by Technovation to support learners of AI. The ‘question and answer’ session after the seminar focused on ways in which people could engage with the programme. On Twitter, one of the seminar participants declared this seminar “my favourite thus far in the series”.  It was indeed very inspirational.

As we near the end of this series, we can start to reflect on what we’ve been learning from all the various speakers, and I intend to do this more formally in a month or two as we prepare Volume 3 of our seminar proceedings. While Tara’s emphasis is on motivating children to want to learn the latest technologies because they can see what they can achieve with them, some of our other speakers have considered the actual concepts we should be teaching, whether we have to change our approach to teaching computer science if we include AI, and how we should engage young learners in the ethics of AI.

Join us for our next seminar

I’m really looking forward to our final seminar in the series, with Stefania Druga, on Tuesday 1 March at 17:00–18:30 GMT. Stefania, PhD candidate at the University of Washington Information School, will also focus on families. In her talk ‘Democratising AI education with and for families’, she will consider the ways that children engage with smart, AI-enabled devices that they are becoming part of their everyday lives. It’s a perfect way to finish this series, and we hope you’ll join us.

Thanks to our seminars series, we are developing a list of AI education resources that seminar speakers and attendees share with us, plus the free resources we are developing at the Foundation. Please do take a look.

You can find all blog posts relating to our previous seminars on this page.

The post Linking AI education to meaningful projects appeared first on Raspberry Pi.

New for Amazon CodeGuru Reviewer – Detector Library and Security Detectors for Log-Injection Flaws

Post Syndicated from Danilo Poccia original https://aws.amazon.com/blogs/aws/new-for-amazon-codeguru-reviewer-detector-library-and-security-detectors-for-log-injection-flaws/

Amazon CodeGuru Reviewer is a developer tool that detects security vulnerabilities in your code and provides intelligent recommendations to improve code quality. For example, CodeGuru Reviewer introduced Security Detectors for Java and Python code to identify security risks from the top ten Open Web Application Security Project (OWASP) categories and follow security best practices for AWS APIs and common crypto libraries. At re:Invent, CodeGuru Reviewer introduced a secrets detector to identify hardcoded secrets and suggest remediation steps to secure your secrets with AWS Secrets Manager. These capabilities help you find and remediate security issues before you deploy.

Today, I am happy to share two new features of CodeGuru Reviewer:

  • A new Detector Library describes in detail the detectors that CodeGuru Reviewer uses when looking for possible defects and includes code samples for both Java and Python.
  • New security detectors have been introduced for detecting log-injection flaws in Java and Python code, similar to what happened with the recent Apache Log4j vulnerability we described in this blog post.

Let’s see these new features in more detail.

Using the Detector Library
To help you understand more clearly which detectors CodeGuru Reviewer uses to review your code, we are now sharing a Detector Library where you can find detailed information and code samples.

These detectors help you build secure and efficient applications on AWS. In the Detector Library, you can find detailed information about CodeGuru Reviewer’s security and code quality detectors, including descriptions, their severity and potential impact on your application, and additional information that helps you mitigate risks.

Note that each detector looks for a wide range of code defects. We include one noncompliant and compliant code example for each detector. However, CodeGuru uses machine learning and automated reasoning to identify possible issues. For this reason, each detector can find a range of defects in addition to the explicit code example shown on the detector’s description page.

Let’s have a look at a few detectors. One detector is looking for insecure cross-origin resource sharing (CORS) policies that are too permissive and may lead to loading content from untrusted or malicious sources.

Detector Library screenshot.

Another detector checks for improper input validation that can enable attacks and lead to unwanted behavior.

Detector Library screenshot.

Specific detectors help you use the AWS SDK for Java and the AWS SDK for Python (Boto3) in your applications. For example, there are detectors that can detect hardcoded credentials, such as passwords and access keys, or inefficient polling of AWS resources.

New Detectors for Log-Injection Flaws
Following the recent Apache Log4j vulnerability, we introduced in CodeGuru Reviewer new detectors that check if you’re logging anything that is not sanitized and possibly executable. These detectors cover the issue described in CWE-117: Improper Output Neutralization for Logs.

These detectors work with Java and Python code and, for Java, are not limited to the Log4j library. They don’t work by looking at the version of the libraries you use, but check what you are actually logging. In this way, they can protect you if similar bugs happen in the future.

Detector Library screenshot.

Following these detectors, user-provided inputs must be sanitized before they are logged. This avoids having an attacker be able to use this input to break the integrity of your logs, forge log entries, or bypass log monitors.

Availability and Pricing
These new features are available today in all AWS Regions where Amazon CodeGuru is offered. For more information, see the AWS Regional Services List.

The Detector Library is free to browse as part of the documentation. For the new detectors looking for log-injection flaws, standard pricing applies. See the CodeGuru pricing page for more information.

Start using Amazon CodeGuru Reviewer today to improve the security of your code.

Danilo

Optimize AI/ML workloads for sustainability: Part 1, identify business goals, validate ML use, and process data

Post Syndicated from Benoit de Chateauvieux original https://aws.amazon.com/blogs/architecture/optimize-ai-ml-workloads-for-sustainability-part-1-identify-business-goals-validate-ml-use-and-process-data/

Training artificial intelligence (AI) services and machine learning (ML) workloads uses a lot of energy—and they are becoming bigger and more complex. As an example, the Carbontracker: Tracking and Predicting the Carbon Footprint of Training Deep Learning Models study estimates that a single training session for a language model like GPT-3 can have a carbon footprint similar to traveling 703,808 kilometers by car.

Although ML uses a lot of energy, it is also one of the best tools we have to fight the effects of climate change. For example, we’ve used ML to help deliver food and pharmaceuticals safely and with much less waste, reduce the cost and risk involved in maintaining wind farms, restore at-risk ecosystems, and predict and understand extreme weather.

In this series of three blog posts, we’ll provide guidance from the Sustainability Pillar of the AWS Well-Architected Framework to reduce the carbon footprint of your AI/ML workloads.

This first post follows the first three phases provided in the Well-Architected machine learning lifecycle (Figure 1):

  • Business goal identification
  • ML problem framing
  • Data processing (data collection, data preprocessing, feature engineering)

You’ll learn best practices for each phase to help you review and refine your workloads to maximize utilization and minimize waste and the total resources deployed and powered to support your workload.

ML lifecycle

Figure 1. ML lifecycle

Business goal identification

Define the overall environmental impact or benefit

Measure your workload’s impact and its contribution to the overall sustainability goals of the organization. Questions you should ask:

  • How does this workload support our overall sustainability mission?
  • How much data will we have to store and process? What is the impact of training the model? How often will we have to re-train?
  • What are the impacts resulting from customer use of this workload?
  • What will be the productive output compared with this total impact?

Asking these questions will help you establish specific sustainability objectives and success criteria to measure against in the future.

ML problem framing

Identify if ML is the right solution

Always ask if AI/ML is right for your workload. There is no need to use computationally intensive AI when a simpler, more sustainable approach might succeed just as well.

For example, using ML to route Internet of Things (IoT) messages may be unwarranted; you can express the logic with a Rules Engine.

Consider AI services and pre-trained models 

Once you decide if AI/ML is the right tool, consider whether the workload needs to be developed as a custom model.

Many workloads can use the managed AWS AI services shown in Figure 2. Using these services means that you won’t need the associated resources to collect/store/process data and to prepare/train/tune/deploy an ML model.

Managed AWS AI services

Figure 2. Managed AWS AI services

If adopting a fully managed AI service is not appropriate, evaluate if you can use pre-existing datasets, algorithms, or models. AWS Marketplace offers over 1,400 ML-related assets that customers can subscribe to. You can also fine-tune an existing model starting from a pre-trained model, like those available on Hugging Face. Using pre-trained models from third parties can reduce the resources you need for data preparation and model training.

Select sustainable Regions

Select an AWS Region with sustainable energy sources. When regulations and legal aspects allow, choose Regions near Amazon renewable energy projects and Regions where the grid has low published carbon intensity to host your data and workloads.

Data processing (data collection, data preprocessing, feature engineering)

Avoid datasets and processing duplication

Evaluate if you can avoid data processing by using existing publicly available datasets like AWS Data Exchange and Open Data on AWS (which includes the Amazon Sustainability Data Initiative). They offer weather and climate datasets, satellite imagery, air quality or energy data, among others. When you use these curated datasets, it avoids duplicating the compute and storage resources needed to download the data from the providers, store it in the cloud, organize, and clean it.

For internal data, you can also reduce duplication and rerun of feature engineering code across teams and projects by using a feature storage, such as Amazon SageMaker Feature Store.

Once your data is ready for training, use pipe input mode to stream it from Amazon Simple Storage Service (Amazon S3) instead of copying it to Amazon Elastic Block Store (Amazon EBS). This way, you can reduce the size of your EBS volumes.

Minimize idle resources with serverless data pipelines

Adopt a serverless architecture for your data pipeline so it only provisions resources when work needs to be done. For example, when you use AWS Glue and AWS Step Functions for data ingestion and preprocessing, you are not maintaining compute infrastructure 24/7. As shown in Figure 3, Step Functions can orchestrate AWS Glue jobs to create event-based serverless ETL/ELT pipelines.

Orchestrating data preparation with AWS Glue and Step Functions

Figure 3. Orchestrating data preparation with AWS Glue and Step Functions

Implement data lifecycle policies aligned with your sustainability goals

Classify data to understand its significance to your workload and your business outcomes. Use this information to determine when you can move data to more energy-efficient storage or safely delete it.

Manage the lifecycle of all your data and automatically enforce deletion timelines to minimize the total storage requirements of your workload using Amazon S3 Lifecycle policies. The Amazon S3 Intelligent-Tiering storage class will automatically move your data to the most sustainable access tier when access patterns change.

Define data retention periods that support your sustainability goals while meeting your business requirements, not exceeding them.

Adopt sustainable storage options

Use the appropriate storage tier to reduce the carbon impact of your workload. On Amazon S3, for example, you can use energy-efficient, archival-class storage for infrequently accessed data, as shown in Figure 4. And if you can easily recreate an infrequently accessed dataset, use the Amazon S3 One Zone-IA class to reduce by 3x or more its carbon footprint.

Data access patterns for Amazon S3

Figure 4. Data access patterns for Amazon S3

Don’t over-provision block storage for notebooks and use object storage services like Amazon S3 for common datasets.

Tip: You can check the free disk space on your SageMaker Notebooks using !df -h.

Select efficient file formats and compression algorithms 

Use efficient file formats such as Parquet or ORC to train your models. Compared to CSV, they can help you reduce your storage by up to 87%.

Migrating to a more efficient compression algorithm can also greatly contribute to your storage reduction efforts. For example, Zstandard produces 10–15% smaller files than Gzip at the same compression speed. Some SageMaker built-in algorithms accept x-recordio-protobuf input, which can be streamed directly from Amazon S3 instead of being copied to a notebook instance.

Minimize data movement across networks

Compress your data before moving it over the network.

Minimize data movement across networks when selecting a Region; store your data close to your producers and train your models close to your data.

Measure results and improve

To monitor and quantify improvements, track the following metrics:

  • Total size of your S3 buckets and storage class distribution, using Amazon S3 Storage Lens
  • DiskUtilization metric of your SageMaker processing jobs
  • StorageBytes metric of your SageMaker Studio shared storage volume

Conclusion

In this blog post, we discussed the importance of defining the overall environmental impact or benefit of your ML workload and why managed AI services or pre-trained ML models are sustainable alternatives to custom models. You also learned best practices to reduce the carbon footprint of your ML workload in the data processing phase.

In the next post, we will continue our sustainability journey through the ML lifecycle and discuss the best practices you can follow in the model development phase.

Want to learn more? Check out the Sustainability Pillar of the AWS Well-Architected Framework, the Architecting for sustainability session at re:Invent 2021, and other blog posts on architecting for sustainability.

Looking for more architecture content? AWS Architecture Center provides reference architecture diagrams, vetted architecture solutions, Well-Architected best practices, patterns, icons, and more!

Being Naughty to See Who Was Nice: Machine Learning Attacks on Santa’s List

Post Syndicated from Erick Galinkin original https://blog.rapid7.com/2022/01/14/being-naughty-to-see-who-was-nice-machine-learning-attacks-on-santas-list/

Being Naughty to See Who Was Nice: Machine Learning Attacks on Santa’s List

Editor’s note: We had planned to publish our Hacky Holidays blog series throughout December 2021 – but then Log4Shell happened, and we dropped everything to focus on this major vulnerability that impacted the entire cybersecurity community worldwide. Now that it’s 2022, we’re feeling in need of some holiday cheer, and we hope you’re still in the spirit of the season, too. Throughout January, we’ll be publishing Hacky Holidays content (with a few tweaks, of course) to give the new year a festive start. So, grab an eggnog latte, line up the carols on Spotify, and let’s pick up where we left off.

Santa’s task of making the nice and naughty list has gotten a lot harder over time. According to estimates, there are around 2.2 billion children in the world. That’s a lot of children to make a list of, much less check it twice! So like many organizations with big data problems, Santa has turned to machine learning to help him solve the issue and built a classifier using historical naughty and nice lists. This makes it easy to let the algorithm decide whether they’ll be getting the gifts they’ve asked for or a lump of coal.

Being Naughty to See Who Was Nice: Machine Learning Attacks on Santa’s List

Santa’s lists have long been a jealously guarded secret. After all, being on the naughty list can turn one into a social pariah. Thus, Santa has very carefully protected his training data — it’s locked up tight. Santa has, however, made his model’s API available to anyone who wants it. That way, a parent can check whether their child is on the nice or naughty list.

Santa, being a just and equitable person, has already asked his data elves to tackle issues of algorithmic bias. Unfortunately, these data elves have overlooked some issues in machine learning security. Specifically, the issues of membership inference and model inversion.

Membership inference attacks

Membership inference is a class of machine learning attacks that allows a naughty attacker to query a model and ask, in effect, “Was this example in your training data?” Using the techniques of Salem et al. or a tool like PrivacyRaven, an attacker can train a model that figures out whether or not a model has seen an example before.

Being Naughty to See Who Was Nice: Machine Learning Attacks on Santa’s List

From a technical perspective, we know that there is some amount of memorization in models, and so when they make their predictions, they are more likely to be confident on items that they have seen before — in some ways, “memorizing” examples that have already been seen. We can then create a dataset for our “shadow” model — a model that approximates Santa’s nice/naughty system, trained on data that we’ve collected and labeled ourselves.

We can then take the training data and label the outputs of this model with a “True” value — it was in the training dataset. Then, we can run some additional data through the model for inference and collect the outputs and label it with a “False” value — it was not in the training dataset. It doesn’t matter if these in-training and out-of-training data points are nice or naughty — just that we know if they were in the “shadow” training dataset or not. Using this “shadow” dataset, we train a simple model to answer the yes or no question: “Was this in the training data?” Then, we can turn our naughty algorithm against Santa’s model — “Dear Santa, was this in your training dataset?” This lets us take real inputs to Santa’s model and find out if the model was trained on that data — effectively letting us de-anonymize the historical nice and naughty lists!

Model inversion

Now being able to take some inputs and de-anonymize them is fun, but what if we could get the model to just tell us all its secrets? That’s where model inversion comes in! Fredrikson et al. proposed model inversion in 2015 and really opened up the realm of possibilities for extracting data from models. Model inversion seeks to take a model and, as the name implies, turn the output we can see into the training inputs. Today, extracting data from models has been done at scale by the likes of Carlini et al., who have managed to extract data from large language models like GPT-2.

Being Naughty to See Who Was Nice: Machine Learning Attacks on Santa’s List

In model inversion, we aim to extract memorized training data from the model. This is easier with generative models than with classifiers, but a classifier can be used as part of a larger model called a Generative Adversarial Network (GAN). We then sample the generator, requesting text or images from the model. Then, we use the membership inference attack mentioned above to identify outputs that are more likely to belong to the training set. We can iterate this process over and over to generate progressively more training set-like outputs. In time, this will provide us with memorized training data.

Note that model inversion is a much heavier lift than membership inference and can’t be done against all models all the time — but for models like Santa’s, where the training data is so sensitive, it’s worth considering how much we might expose! To date, model inversion has only been conducted in lab settings on models for text generation and image classification, so whether or not it could work on a binary classifier like Santa’s list remains an open question.

Mitigating model mayhem

Now, if you’re on the other side of this equation and want to help Santa secure his models, there are a few things we can do. First and foremost, we want to log, log, log! In order to carry out the attacks, the model — or a very good approximation — needs to be available to the attacker. If you see a suspicious number of queries, you can filter IP addresses or rate limit. Additionally, limiting the return values to merely “naughty” or “nice” instead of returning the probabilities can make both attacks more difficult.

For extremely sensitive applications, the use of differential privacy or optimizing with DPSGD can also make it much more difficult for attackers to carry out their attacks, but be aware that these techniques come with some accuracy loss. As a result, you may end up with some nice children on the naughty list and a naughty hacker on your nice list.

Santa making his list into a model will save him a whole lot of time, but if he’s not careful about how the model can be queried, it could also lead to some less-than-jolly times for his data. Membership inference and model inversion are two types of privacy-related attacks that models like this may be susceptible to. As a best practice, Santa should:

  • Log information about queries like:
    • IP address
    • Input value
    • Output value
    • Time
  • Consider differentially private model training
  • Limit API access
  • Limit the information returned from the model to label-only

NEVER MISS A BLOG

Get the latest stories, expertise, and news about security today.

More Hacky Holidays blogs

How can AI-based analysis help educators support students?

Post Syndicated from Henna Gorsia original https://www.raspberrypi.org/blog/ai-sytems-in-education-learner-support-research-seminar/

We are hosting a series of free research seminars about how to teach artificial intelligence (AI) and data science to young people, in partnership with The Alan Turing Institute.

In the fifth seminar of this series, we heard from Rose Luckin, Professor of Learner Centred Design at the University College London (UCL) Knowledge Lab. Rose is Founder of EDUCATE Ventures Research Ltd., a London consultancy service working with start-ups, researchers, and educators to develop evidence-based educational technology.

Rose Luckin.
Rose Luckin, UCL

Based on her experience at EDUCATE, Rose spoke about how AI-based analysis could help educators gain a deeper understanding of their students, and how educators could work with AI systems to provide better learning resources to their students. This provided us with a different angle to the first four seminars in our current series, where we’ve been thinking about how young people learn to understand AI systems.

Rose Luckin's definition of AI: technology capable of actions and behaviours "requiring intelligence when done by humans".
Rose’s definition of artificial intelligence for this presentation.

Education and AI systems

AI systems have the potential to impact education in a number of different ways, which Rose distilled into three areas: 

  1. Using AI in education to tackle some of the big educational challenges
  2. Educating teachers about AI so that they can use it safely and effectively 
  3. Changing education so that we focus on human intelligence and prepare people for an AI world

It is clear that the three areas are interconnected, meaning developments in one area will affect the others. Rose’s focus during the seminar was the second area: educating people about AI.

Rose Luckin's definition of the three intersections of education and artificial intelligence, see text in list above.

What can AI systems do in education? 

Through giving examples of existing AI-based systems used for education, Rose described what in particular it is about AI systems that can be useful in an education setting. The first point she raised was that AI systems can adapt based on learning from data. Her main example was the AI-based platform ENSKILLS, which detects the user’s level of competency with spoken English through the user’s interactions with a virtual character, and gradually adapts the character to the user’s level. Other examples of adaptive AI systems for education include Carnegie Learning and Century Intelligent Learning.

We know that AI systems can respond to different forms of data. Rose introduced the example of OyaLabs to demonstrate how AI systems can gather and process real-time sensory data. This is an app that parents can use in a young child’s room to monitor the child’s interactions with others. The app analyses the data it gathers and produces advice for parents on how they can support their child’s language development.

AI system creators can also combine adaptivity and real-time sensory data processing  in their systems. One example Rosa gave of this was SimSensei from the University of Southern California. This is a simulated coach, which a student can interact with and which gathers real-time data about how the student is speaking, including their tone, speed of speech, and facial expressions. The system adapts its coaching advice based on these interactions and on what it learns from interactions with other students.

Getting ready for AI systems in education

For the remainder of her presentation, Rose focused on the framework she is involved in developing, as part of the EDUCATE service, to support organisations to prepare for implementing AI systems, including educators within these organisations. The aim of this ETHICAI framework is to enable organisations and educators to understand:

  • What AI systems are capable of doing
  • The strengths and weaknesses of AI systems
  • How data is used by AI systems to learn
The EDUCATE consultancy service's seven-part AI readiness framework, see test below for list.

Rose described the seven steps of the framework as:

  1. Educate, enthuse, excite – about building an AI mindset within your community 
  2. Tailor and Hone – the particular challenges you want to focus on
  3. Identify – identify (wisely), collate and …
  4. Collect – new data relevant to your focus
  5. Apply – AI techniques to the relevant data you have brought together
  6. Learn – understand what the data is telling you about your focus and return to step 5 until you are AI ready
  7. Iterate

She then went on to demonstrate how the framework is applied using the example of online teaching. Online teaching has been a key part of education throughout the coronavirus pandemic; AI systems could be used to analyse datasets generated during online teaching sessions, in order to make decisions for and recommendations to educators.

The first step of the ETHICAI framework is educate, enthuse, excite. In Rose’s example, this step consisted of choosing online teaching as a scenario, because it is very pertinent to a teacher’s practice. The second step is to tailor and hone in on particular challenges that are to be the focus, capitalising on what AI systems can do. In Rose’s example, the challenge is assessing the quality of online lessons in a way that would be useful to educators. The third step of the framework is to identify what data is required to perform this quality assessment.

Examples of data to be fed into an AI system for education, see text.

The fourth step is the collection of new data relevant to the focus of the project. The aim is to gain an increased understanding of what happens in online learning across thousands of schools. Walking through the online learning example, Rose suggested we might be able to collect the following types of data:

  • Log data
  • Audio data
  • Performance data
  • Video data, which includes eye-movement data
  • Historical data from tests and interviews
  • Behavioural data from surveying teachers and parents about how they felt about online learning

It is important to consider the ethical implications of gathering all this data about students, something that was a recurrent theme in both Rose’s presentation and the Q&A at the end.

Step five of the ETHICAI framework focuses on applying AI techniques to the relevant data to combine and process it. The figure below shows that in preparation, the various data sets need to be collated, cleaned, organised, and transformed.

Presentation slide showing that data for an AI system needs to be collated, cleaned, organised, and transformed.

From the correctly prepared data, interaction profiles can be produced in order to put characteristics from different lessons into groups/profiles. Rose described how cluster analysis using a combination of both AI and human intelligence could be used to sort lessons into groups based on common features.

The sixth step in Rose’s example focused on what may be learned from analysing collected data linked to the particular challenge of online teaching and learning. Rose said that applying an AI system to students’ behavioural data could, for example, give indications about students’ focus and confidence, and make or recommend interventions to educators accordingly.

Presentation slide showing example graphs of results produced by an AI system in education.

Where might we take applications of AI systems in education in the future?

Rose described that AI systems can possess some types of intelligence humans have or can develop: interdisciplinary academic intelligence, meta-knowing intelligence, and potentially social intelligence. However, there are types such as meta-contextual intelligence and perceived self-efficacy that AI systems are not able to demonstrate in the way humans can.

The seven types of human intelligence as defined by Rose Luckin: interdisciplinary academic knowledge, meta-knowing intelligence, social intelligence, metacognitive intelligence, meta-subjective intelligence, meta-contextual knowledge, perceived self-efficacy.

The use of AI systems in education can cause ethical issues. As an example, Rose pointed out the use of virtual glasses to identify when students need help, even if they do not realise it themselves. A system like this could help educators with assessing who in their class needs more help, and could link this back to student performance. However, using such a system like this has obvious ethical implications, and some of these were the focus of the Q&A that followed Rose’s presentation.

It’s clear that, in the education domain as in all other domains, both positive and negative outcomes of integrating AI are possible. In a recent paper written by Wayne Holmes (also from the UCL Knowledge Lab) and co-authors, ‘Ethics of AI in Education: Towards a Community Wide Framework’ [1], the authors suggest that the interpretation of data, consent and privacy, data management, surveillance, and power relations are all ethical issues that should be taken into consideration. Finding consensus for a practical ethical framework or set of principles, with all stakeholders, at the very start of an AI-related project is the only way to ensure ethics are built into the project and the AI system itself from the ground up.

Two boys at laptops in a classroom.

Ethical issues of AI systems more broadly, and how to involve young people in discussions of AI ethics, were the focus of our seminar with Dr Mhairi Aitken back in September. You can revisit the seminar recording, presentation slides, and summary blog post.

I really enjoyed both the focus and content of Rose’s talk: educators understanding how AI systems may be applied to education in order to help them make more informed decisions about how to best support their students. This is an important factor to consider in the context of the bigger picture of what young people should be learning about AI. The work that Rose and her colleagues are doing also makes an important contribution to translating research into practical models that teachers can use.

Join our next free seminars

You may still have time to sign up for our Tuesday 11 January seminar, today at 17:00–18:30 GMT, where we will welcome Dave Touretzky and Fred Martin, founders of the influential AI4K12 framework, which identifies the five big ideas of AI and how they can be integrated into education.

Next month, on 1 February at 17:00–18:30 GMT, Tara Chklovski (CEO of Technovation) will give a presentation called Teaching youth to use AI to tackle the Sustainable Development Goals at our seminar series.

If you want to join any of our seminars, click the button below to sign up and we will send you information on how to join. We look forward to seeing you there!

You’ll always find our schedule of upcoming seminars on this page. For previous seminars, you can visit our past seminars and recordings page.

The post How can AI-based analysis help educators support students? appeared first on Raspberry Pi.

Snapshots from the history of AI, plus AI education resources

Post Syndicated from Janina Ander original https://www.raspberrypi.org/blog/machine-learning-education-snapshots-history-ai-hello-world-12/

In Hello World issue 12, our free magazine for computing educators, George Boukeas, DevOps Engineer for the Astro Pi Challenge here at the Foundation, introduces big moments in the history of artificial intelligence (AI) to share with your learners:

The story of artificial intelligence (AI) is a story about humans trying to understand what makes them human. Some of the episodes in this story are fascinating. These could help your learners catch a glimpse of what this field is about and, with luck, compel them to investigate further.                   

The imitation game

In 1950, Alan Turing published a philosophical essay titled Computing Machinery and Intelligence, which started with the words: “I propose to consider the question: Can machines think?” Yet Turing did not attempt to define what it means to think. Instead, he suggested a game as a proxy for answering the question: the imitation game. In modern terms, you can imagine a human interrogator chatting online with another human and a machine. If the interrogator does not successfully determine which of the other two is the human and which is the machine, then the question has been answered: this is a machine that can think.

A statue of Alan Turing on a park bench in Manchester.
The Alan Turing Memorial in Manchester

This imitation game is now a fiercely debated benchmark of artificial intelligence called the Turing test. Notice the shift in focus that Turing suggests: thinking is to be identified in terms of external behaviour, not in terms of any internal processes. Humans are still the yardstick for intelligence, but there is no requirement that a machine should think the way humans do, as long as it behaves in a way that suggests some sort of thinking to humans.

In his essay, Turing also discusses learning machines. Instead of building highly complex programs that would prescribe every aspect of a machine’s behaviour, we could build simpler programs that would prescribe mechanisms for learning, and then train the machine to learn the desired behaviour. Turing’s text provides an excellent metaphor that could be used in class to describe the essence of machine learning: “Instead of trying to produce a programme to simulate the adult mind, why not rather try to produce one which simulates the child’s? If this were then subjected to an appropriate course of education one would obtain the adult brain. We have thus divided our problem into two parts: the child-programme and the education process.”

A chess board with two pieces of each colour left.
Chess was among the games that early AI researchers like Alan Turing developed algorithms for.

It is remarkable how Turing even describes approaches that have since been evolved into established machine learning methods: evolution (genetic algorithms), punishments and rewards (reinforcement learning), randomness (Monte Carlo tree search). He even forecasts the main issue with some forms of machine learning: opacity. “An important feature of a learning machine is that its teacher will often be very largely ignorant of quite what is going on inside, although he may still be able to some extent to predict his pupil’s behaviour.”

The evolution of a definition

The term ‘artificial intelligence’ was coined in 1956, at an event called the Dartmouth workshop. It was a gathering of the field’s founders, researchers who would later have a huge impact, including John McCarthy, Claude Shannon, Marvin Minsky, Herbert Simon, Allen Newell, Arthur Samuel, Ray Solomonoff, and W.S. McCulloch.   

Go has vastly more possible moves than chess, and was thought to remain out of the reach of AI for longer than it did.

The simple and ambitious definition for artificial intelligence, included in the proposal for the workshop, is illuminating: ‘making a machine behave in ways that would be called intelligent if a human were so behaving’. These pioneers were making the assumption that ‘every aspect of learning or any other feature of intelligence can in principle be so precisely described that a machine can be made to simulate it’. This assumption turned out to be patently false and led to unrealistic expectations and forecasts. Fifty years later, McCarthy himself stated that ‘it was harder than we thought’.

Modern definitions of intelligence are of distinctly different flavour than the original one: ‘Intelligence is the quality that enables an entity to function appropriately and with foresight in its environment’ (Nilsson). Some even speak of rationality, rather than intelligence: ‘doing the right thing, given what it knows’ (Russell and Norvig).

A computer screen showing a complicated graph.
The amount of training data AI developers have access to has skyrocketed in the past decade.

Read the whole of this brief history of AI in Hello World #12

In the full article, which you can read in the free PDF copy of the issue, George looks at:

  • Early advances researchers made from the 1950s onwards while developing games algorithms, e.g. for chess.
  • The 1997 moment when Deep Blue, a purpose-built IBM computer, beating chess world champion Garry Kasparov using a search approach.
  • The 2011 moment when Watson, another IBM computer system, beating two human Jeopardy! champions using multiple techniques to answer questions posed in natural language.
  • The principles behind artificial neural networks, which have been around for decades and are now underlying many AI/machine learning breakthroughs because of the growth in computing power and availability of vast datasets for training.
  • The 2017 moment when AlphaGo, an artificial neural network–based computer program by Alphabet’s DeepMind, beating Ke Jie, the world’s top-ranked Go player at the time.
Stacks of server hardware behind metal fencing in a data centre.
Machine learning systems need vast amounts of training data, the collection and storage of which has only become technically possible in the last decade.

More on machine learning and AI education in Hello World #12

In your free PDF of Hello World issue 12, you’ll also find:

  • An interview with University of Cambridge statistician David Spiegelhalter, whose work shaped some of the foundations of AI, and who shares his thoughts on data science in schools and the limits of AI 
  • An introduction to Popbots, an innovative project by MIT to open AI to the youngest learners
  • An article by Ken Kahn, researcher in the Department of Education at the University of Oxford, on using the block-based Snap! language to introduce your learners to natural language processing
  • Unplugged and online machine learning activities for learners age 7 to 16 in the regular ‘Lesson plans’ section
  • And lots of other relevant articles

You can also read many of these articles online on the Hello World website.

Find more resources for AI and data science education

In Hello World issue 16, the focus is on all things data science and data literacy for your learners. As always, you can download a free copy of the issue. And on our Hello World podcast, we chat with practicing computing educators about how they bring AI, AI ethics, machine learning, and data science to the young people they teach.

If you want a practical introduction to the basics of machine learning and how to use it, take our free online course.

Drawing of a machine learning ars rover trying to decide whether it is seeing an alien or a rock.

There are still many open questions about what good AI and data science education looks like for young people. To learn more, you can watch our panel discussion about the topic, and join our monthly seminar series to hear insights from computing education researchers around the world.

We are also collating a growing list of educational resources about these topics based on our research seminars, seminar participants’ recommendations, and our own work. Find the resource list here.

The post Snapshots from the history of AI, plus AI education resources appeared first on Raspberry Pi.

New AWS Scholarship Program Helps Underrepresented and Underserved Students Prep for Careers in AI and ML

Post Syndicated from Antje Barth original https://aws.amazon.com/blogs/aws/new-aws-scholarship-program-helps-underrepresented-and-underserved-students-prep-for-careers-in-ai-and-ml/

As a woman working in information technology (IT) for many years, it has always been close to my heart to challenge long-standing gender stereotypes and inspire more young learners to consider a career in tech. With artificial intelligence (AI) and machine learning (ML) defining the future of technology, this future also depends on diverse representation.

The World Economic Forum estimates that technological advances and automation will create 97 million new technology jobs by 2025, including in the field of AI and ML. Yet, according to their research, women make up just 32% of AI jobs globally. The Pew Research Center found that Black and Hispanic workers in the U.S. comprise just 9% and 8% of workers in the science, technology, engineering, and mathematics (STEM) careers respectively.

At Amazon, we believe that technology should be built in a way that’s inclusive, diverse, and equitable. We already offer STEM programs to enable underrepresented groups to build or advance their technical skills and open up new career possibilities. Programs such as We Power Tech, Amazon Future Engineer, AWS Girls’ Tech Day, and AWS GetIT aim to build a future of tech that is inclusive, diverse, and accessible.

Today, I am excited to announce the launch of the AWS AI & ML Scholarship program in collaboration with Intel and Udacity, designed to prepare underrepresented and underserved students globally for careers in ML.

AWS AI & ML Scholarship Program Debuts with AWS DeepRacer Student
The AWS AI & ML Scholarship program is launching as part of the all-new AWS DeepRacer Student service and Student League. This is a new student division of the popular AWS DeepRacer program, a cloud-based 3D car racing simulator that provides a fun way to learn about ML and reinforcement learning (RL). Through DeepRacer Student, you have access to free online trainings to learn the ML and RL basics. You also have access to 10 hours of model training and 5 GB of storage per month to participate in the DeepRacer Student League, a global autonomous racing competition exclusively for AWS AI & ML students.

As part of DeepRacer Student, the AWS AI & ML Scholarship program in collaboration with Intel and Udacity is geared toward underserved and underrepresented high school and college students globally who are at least 16 years old. These are students who may have faced financial barriers growing up or are part of underrepresented groups, including women, people with disabilities, LGBTQ+ persons, as well as people of color. Students in these groups who take part in the DeepRacer Student League are eligible for a chance to win one or two of 2,500 annual scholarships from Udacity, an online learning platform focused on technology skills.

How to Qualify for the AWS AI & ML Scholarship
AWS DeepRacer Student League LogoTo be considered for a scholarship, you must successfully finish all AWS DeepRacer Student learning modules and achieve a score of at least 80% on all course quizzes, reach a certain lap time performance with your DeepRacer car in the Student League, and submit an essay.

Each year, 2,000 students will win a scholarship to the AI Programming with Python Udacity Nanodegree program. Udacity Nanodegrees are massive open online courses (MOOCs) designed to bridge the gap between learning and career goals. This aims to equip students with programming and ML fundamentals to solve real world problems with ML. The top 500 participants in this first Nanodegree will be eligible to join a second customized Nanodegree program curated specifically for AWS AI & ML Scholarship program recipients.

Coaching and Mentorship Opportunities for Scholarship Recipients
Scholarship recipients not only get access to the educational content, but also receive up to 85 hours of support from Udacity instructors to make sure that students successfully learn and progress through the Nanodegree. Instructors and students meet weekly in small groups, review the content for the week, answer questions, and work on a case study as a group exercise.

The top 500 participants who advance into the second Nanodegree will receive 12 months of mentorship from tenured technology leaders at Amazon and Intel to help prepare for a tech career.

All scholarship recipients will be given exclusive access to Ask-Me-Anythings (AMAs), fireside chats, and office hours with AI/ML professionals and diversity experts from Amazon, Intel, and AWS collaborators such as Girls in Tech. These will help familiarize students with different job functions in the AI/ML field.

Enroll via AWS DeepRacer Student
To enroll in the AWS AI & ML Scholarship program, first sign up at the AWS DeepRacer Student service with a valid email address. Note that this student player account is separate from an AWS account and doesn’t require you to provide any billing or credit card information.

AWS DeepRacer Student Sign Up

You will receive a verification code via email. Once you have verified your email address, log in to the DeepRacer Student service and complete the sign-up process.

AWS DeepRacer Student Complete Sign Up

The form will ask for some additional information, including your school, major, and planned year of graduation. You must also self-certify that you are a student enrolled in either high school, university, or community college.

AWS DeepRacer Student Complete Sign Up with Personal Information

Next, enroll in the AWS AI & ML Scholarship program by selecting the corresponding checkbox. The scholarship qualification will start on March 1, 2022.

AWS DeepRacer Student Opt In To Scholarship

Welcome to AWS DeepRacer Student! Start learning the fundamentals of ML through the provided online trainings. You can also participate in the pre-season DeepRacer Student League before the qualifying races start on March 1, 2022.

AWS DeepRacer Student Dashboard

Available Now
Start preparing for the AWS AI & ML Scholarship program by signing up for AWS DeepRacer Student, available globally today, and opt in to the scholarship program. Start learning with learning modules, train your first AWS DeepRacer model, and see how it performs in AWS DeepRacer Student League Pre-season happening now. Join the AWS Slack community to connect with experts and ask questions.

Sign up for AWS DeepRacer Student and the AWS AI & ML Scholarship program today.

Antje

Now in Preview – Amazon SageMaker Studio Lab, a Free Service to Learn and Experiment with ML

Post Syndicated from Antje Barth original https://aws.amazon.com/blogs/aws/now-in-preview-amazon-sagemaker-studio-lab-a-free-service-to-learn-and-experiment-with-ml/

Our mission at AWS is to make machine learning (ML) more accessible. Through many conversations over the past years, I learned about barriers that many ML beginners face. Existing ML environments are often too complex for beginners, or too limited to support modern ML experimentation. Beginners want to quickly start learning and not worry about spinning up infrastructure, configuring services, or implementing billing alarms to avoid going over budget. This emphasizes another barrier for many people: the need to provide billing and credit card information at sign-up.

What if you could have a predictable and controlled environment for hosting your Jupyter notebooks in which you can’t accidentally run up a big bill? One that doesn’t require billing and credit card information at all at sign-up?

Today, I am extremely happy to announce the public preview of Amazon SageMaker Studio Lab, a free service that enables anyone to learn and experiment with ML without needing an AWS account, credit card, or cloud configuration knowledge.

At AWS, we believe technology has the power to solve the world’s most pressing issues. And, we proudly support the new and innovative ways that our customers are using these technologies to deliver social impacts.

This is why I am also excited to announce the launch of the AWS Disaster Response Hackathon using Amazon SageMaker Studio Lab. The hackathon, starting today and running through February 7, 2022, is a great way to start learning ML while doing good in the world. I will share more details on how to get involved at the end of the post.

Getting Started with Amazon SageMaker Studio Lab
Studio Lab is based on open-source JupyterLab and gives you free access to AWS compute resources to quickly start learning and experimenting with ML. Studio Lab is also simple to set up. In fact, the only configuration you have to do is one click to choose whether you need a CPU or GPU instance for your project. Let me show you.

The first step is to request a free Studio Lab account here.

Amazon SageMaker Studio Lab

When your account request is approved, you will receive an email with a link to the Studio Lab account registration page. You can now create your account with your approved email address and set a password and your username. This account is separate from an AWS account and doesn’t require you to provide any billing information.

Amazon SageMaker Studio Lab - Create Account

Once you have created your account and verified your email address, you can sign in to Studio Lab. Now, you can select the compute type for your project. You can choose between 12 hours of CPU or 4 hours of GPU per user session, with an unlimited number of user sessions available to you. Furthermore, you get a minimum of 15 GB of persistent storage per project. When your session expires, Studio Lab will take a snapshot of your environment. This enables you to pick up right where you left off. Let’s select CPU for this demo, and choose Start runtime.

Amazon SageMaker Studio Lab - Select Compute

Once the instance is running, select Open project to go to your free Studio Lab environment and start building. No additional configuration is required.

Amazon SageMaker Studio Lab - Open Project

Amazon SageMaker Studio Lab Environment

Customize your environment
Studio Lab comes with a Python base image to get you started. The image only has a few libraries pre-installed to save the available space for the frameworks and libraries that you actually need.

Amazon SageMaker Studio Lab - Select Kernel

You can customize the Conda environment and install additional packages using the %conda install <package> or %pip install <package> command right from within your notebook. You can also create entirely new, custom Conda environments, or install open-source JupyterLab and Jupyter Server extensions. For detailed instructions, see the Studio Lab documentation.

GitHub integration
Studio Lab is tightly integrated with GitHub and offers full support for the Git command line. This lets you easily clone, copy, and save your projects. Moreover, you can add an Open in Studio Lab badge to the README.md file or notebooks in your public GitHub repo to share your work with others.

Open in Amazon SageMaker Studio Lab Badge

This will let everyone open and view the notebook in Studio Lab. If they have a Studio Lab account, then they can also run the notebook. Add the following markdown to the top of your README.md file or notebook to add the Open in Studio Lab badge:

[![Open In Studio Lab](https://studiolab.sagemaker.aws/studiolab.svg)](https://studiolab.sagemaker.aws/import/github/org/repo/blob/master/path/to/notebook.ipynb)

Replace org, repo, path and the notebook filename with those for your repo. Then, when you click the Open in Studio Lab badge, it will preview the notebook in Studio Lab. If your repo is private within a GitHub account or organization and you would like other people to use it, then you must additionally install the Amazon SageMaker GitHub App at the GitHub account or organization level.

Amazon SageMaker Studio Lab Notebook Preview

If you have a Studio Lab account, you can click Copy to project and choose to either copy just the notebook or to clone the entire repo into your Studio Lab account. Moreover, Studio Lab can check if the repository contains a Conda environment file and build the custom Conda environment for you.

Learn the Fundamentals of ML
If you are new to ML, then Studio Lab provides access to free, educational content to get you started. Dive into Deep Learning (D2L) is a free interactive book that teaches the ideas, the math, and the code behind ML and DL. The AWS Machine Learning University (MLU) gives you access to the same ML courses used to train Amazon’s own developers on ML. Hugging Face is a large open source community and a hub for pre-trained deep learning (DL) models. This is mainly aimed at natural language processing. In just a few clicks, you can import the relevant notebooks from D2L, MLU, and Hugging Face into your Studio Lab environment.

Join the AWS Disaster Response Hackathon using Amazon SageMaker Studio Lab
The frequency and severity of natural disasters are increasing. This year alone, we have seen significant wildfires across the Western United States and in countries like Greece and Turkey; major floods across Europe; and Hurricane Ida’s impact to the coast of Louisiana. In response, governments, businesses, nonprofits, and international organizations are placing more emphasis on disaster preparedness and response than ever before.

AWS Disaster Response Hackathon

Through the AWS Disaster Response Hackathon offering a total of $54,000 USD in prices, we hope to stimulate ways of applying ML to solve pressing challenges in natural disaster preparedness and response.

Join the hackathon today, start building, and don’t forget to submit your project before February 7, 2022. This hackathon is also an attempt to set the Guinness World Record for the “largest machine learning competition.”

Join the Preview
You can request a free Amazon SageMaker Studio Lab account starting today. The number of new account registrations will be limited to ensure a high quality of experience for all customers. You can find sample notebooks in the Studio Lab GitHub repository. Give it a try and let us know your feedback.

Request a free Amazon SageMaker Studio Lab account.

Antje

New – Introducing SageMaker Training Compiler

Post Syndicated from Sean M. Tracey original https://aws.amazon.com/blogs/aws/new-introducing-sagemaker-training-compiler/

An image explaining the benefits of using Amazon SageMaker Training CompilerToday, we’re pleased to announce Amazon SageMaker Training Compiler, a new Amazon SageMaker capability that can accelerate the training of deep learning (DL) models by up to 50%.

As DL models grow in complexity, so too does the time it can take to optimize and train them. For example, it can take 25,000 GPU-hours to train popular natural language processing (NLP) model “RoBERTa“. Although there are techniques and optimizations that customers can apply to reduce the time it can take to train a model, these also take time to implement and require a rare skillset. This can impede innovation and progress in the wider adoption of artificial intelligence (AI).

How has this been done to date?
Typically, there are three ways to speed up training:

  1. Using more powerful, individual machines to process the calculations
  2. Distributing compute across a cluster of GPU instances to train the model in parallel
  3. Optimizing model code to run more efficiently on GPUs by utilizing less memory and compute.

In practice, optimizing machine learning (ML) code is difficult, time-consuming, and a rare skill set to acquire. Data scientists typically write their training code in a Python-based ML framework, such as TensorFlow or PyTorch, relying on ML frameworks to convert their Python code into mathematical functions that can run on GPUs, commonly known as kernels. However, this translation from the Python code of a user is often inefficient because ML frameworks use pre-built, generic GPU kernels, instead of creating kernels specific to the code and model of the user.

It can take even the most skilled GPU programmers months to create custom kernels for each new model and optimize them. We built SageMaker Training Compiler to solve this problem.

Today’s launch lets SageMaker Training Compiler automatically compile your Python training code and generate GPU kernels specifically for your model. Consequently, the training code will use less memory and compute, and therefore train faster. For example, when fine-tuning Hugging Face’s GPT-2 model, SageMaker Training Compiler reduced training time from nearly 3 hours to 90 minutes.

Automatically Optimizing Deep Learning Models
So, how have we achieved this acceleration? SageMaker Training Compiler accelerates training jobs by converting DL models from their high-level language representation to hardware-optimized instructions that train faster than jobs with off-the-shelf frameworks. Under the hood, SageMaker Training Compiler makes incremental optimizations beyond what the native PyTorch and TensorFlow frameworks offer to maximize compute and memory utilization on SageMaker GPU instances.

More specifically, SageMaker Training Compiler uses graph-level optimization (operator fusion, memory planning, and algebraic simplification), data flow-level optimizations (layout transformation, common sub-expression elimination), and back-end optimizations (memory latency hiding, loop oriented optimizations) to produce an optimized model that efficiently uses hardware resources. As a result, training is accelerated by up to 50%, and the returned model is the same as if SageMaker Training Compiler had not been used.

But how do you use SageMaker Training Compiler with your models? It can be as simple as adding two lines of code!

SageMaker Training Compiler Code Changes

The shortened training times mean that customers gain more time for innovating and deploying their newly-trained models at a reduced cost and a greater ability to experiment with larger models and more data.

Getting the most from SageMaker Training Compiler
Although many DL models can benefit from SageMaker Training Compiler, larger models with longer training will realize the greatest time and cost savings. For example, training time and costs fell by 30% on a long-running RoBERTa-base fine-tuning exercise.

Jorge Lopez Grisman, a Senior Data Scientist at Quantum Health – an organization on a mission to “make healthcare navigation smarter, simpler, and more cost-effective for everyone” – said:

“Iterating with NLP models can be a challenge because of their size: long training times bog down workflows and high costs can discourage our team from trying larger models that might offer better performance. Amazon SageMaker Training Compiler is exciting because it has the potential to alleviate these frictions. Achieving a speedup with SageMaker Training Compiler is a real win for our team that will make us more agile and innovative moving forward.”

Further Resources
To learn more about how Amazon SageMaker Training Compiler can benefit you, you can visit our page here. And to get started see our technical documentation here.

New – Create and Manage EMR Clusters and Spark Jobs with Amazon SageMaker Studio

Post Syndicated from Sean M. Tracey original https://aws.amazon.com/blogs/aws/new-create-and-manage-emr-clusters-and-spark-jobs-with-amazon-sagemaker-studio/

Today, we’re very excited to offer three new enhancements to our Amazon SageMaker Studio service.

As of now, users of SageMaker Studio can create, terminate, manage, discover, and connect to Amazon EMR clusters running within a single AWS account and in shared accounts across an organization—all directly from SageMaker Studio. Furthermore, SageMaker Studio Notebook users can able to utilize SparkUI to monitor and debug Spark jobs running on an Amazon EMR cluster—directly from the SageMaker Studio Notebooks!

The story so far…
Before today, SageMaker Studio users had some ability to find and connect with EMR clusters, provided that they were running in the same account as SageMaker Studio. While useful in many circumstances, if a cluster did not exist that would suit the requirements of the model or analysis being run, then data scientists would have to leave their development environment and manually configure a cluster that suited their needs. As well as being disruptive to workflow of data scientists, there are no guarantees that the data scientists would have either the permissions or depth of knowledge required to provision a cluster that would enable them to continue with their work. Additionally, being restricted to creating and managing clusters in a single account could become prohibitive in organizations working across many AWS accounts.

What’s new?
Data scientists can:

  • Discover, manage, create, terminate, and connect to Amazon EMR clusters from within SageMaker Studio
  • Utilize “templates” – a new way to configure and provision clusters for your workload needs with support from seasoned DevOps practitioners
  • Connect to, debug, and monitor Spark jobs running on an Amazon EMR cluster from within a SageMaker Studio Notebook

Creating, Connecting to, and Managing EMR Clusters

Connecting to an EMR Cluster from a SageMaker Studio Notebook

With the ability to connect to and manage EMR clusters from within SageMaker Studio, data scientists no longer have to leave their familiar environment to create, configure and provision the EMR clusters where they run their workloads.

Introducing Templates
A template is a collection of off-the-shelf cluster configurations optimized for numerous workloads. Templates can be created and managed by DevOps administrators and made available through the AWS Service Catalog to data scientists within SageMaker Studio. This lets them quickly spin up a cluster to meet their needs, all while safe in the knowledge that a trusted DevOps admin has correctly configured a cluster per the project’s requirements. Furthermore, this lets data scientists get on with the work they do best, and it gives DevOps administrators within these teams greater ability to manage the types of provisioned infrastructure.

Managing EMR Clusters from within SageMaker Studio Notebooks

Directly Connect to and monitor Spark Jobs
Finally, to make the job of data scientists even simpler, we’ve built the ability to connect to, debug, and monitor Spark jobs running on an Amazon EMR cluster from within a SageMaker Studio Notebook. Before now, to access the monitoring UI of a Spark Job, one needed to configure secure tunnels and web proxies to gain direct access to currently executing jobs, adding friction to the workflow of a data scientist trying to observe and debug their workloads. Now, with these new features, users will have one-click access directly from the interface that they already know. This enables them to build and put their workloads to work, rather than spending time on configuring infrastructure and workloads.

Connecting to a Spark Job from within a SageMaker Studio Notebook

These new features let data scientists can use a simple, consistent UI to provision and manage infrastructure as needed without ever having to leave SageMaker Studio or dive into the minutiae of the provisioning of such hardware – Moreover, they won’t have to spend time configuring proxies and SSH tunnels to debug and monitor ongoing Spark jobs.

Find out more
These features are generally available in all AWS Regions where SageMaker Studio is available, and there are no additional charges to use this capability. For complete information on pricing and regional availability, please refer to the SageMaker Studio pricing page .

To learn more, see our documentation.

Announcing Amazon SageMaker Ground Truth Plus

Post Syndicated from Sean M. Tracey original https://aws.amazon.com/blogs/aws/announcing-amazon-sagemaker-ground-truth-plus/

Today, we’re pleased to announce the latest service in the Amazon SageMaker suite that will make labeling datasets easier than ever before. Ground Truth Plus is a turn-key service that uses an expert workforce to deliver high-quality training datasets fast, and reduces costs by up to 40 percent.

The Challenges of Machine Learning Model Creation
One of the biggest challenges in building and training machine learning (ML) models is sourcing enough high-quality, labeled data at scale to feed into and train those models so that they can make an accurate prediction.

On the face of it, labeling data might seem like a fairly straightforward task…

  • Step 1: Get data
  • Step 2: Label it

…but this is far from the reality.

Even before you have labelers begin annotations, you need a custom labeling workflow and user interface specific to your project so that you get a high-quality dataset. This relies on a combination of robust tooling and skilled workers, and the effort spent can be significant.

Once the data labeling workflow and user interface has been constructed, a workforce to use those systems must be organized and trained – and this is all before a single point of data has been labeled!

Finally, once the labeling systems have been built, the workflows designed, and the workforce trained and deployed, the process of passing data through that system must be monitored and checked to ensure a consistent, high-quality output. After enough data has been passed through and labeled by the system, you have arrived at the point you’ve been trying to get to all along: you finally have enough data to train the ML model.

Each of these steps represents a significant investment in time, costs, and energy. You could be spending these resources building ML models instead of labeling and managing data, and using Ground Truth Plus can help free you up to do just that.

Introducing Amazon SageMaker Ground Truth Plus
Amazon SageMaker Ground Truth Plus enables you to easily create high-quality training datasets without having to build labeling applications and manage the labeling workforce on your own. Which means you don’t even need to have deep ML expertise or extensive knowledge of workflow design and quality management. You simply provide data along with labeling requirements and Ground Truth Plus sets up the data labeling workflows and manages them on your behalf in accordance with your requirements.

For example, if you need medical experts to label radiology images, you can specify that in the guidelines you provide to Ground Truth Plus. The service will then automatically select labelers trained in radiology to label your data, and from there an expert workforce that is trained on a variety of ML tasks will start labeling the data. Ground Truth Plus brings ML-powered automation to data labeling, which increases the quality of the output dataset and decreases the data labeling costs.

Amazon SageMaker Ground Truth Plus uses a multi-step labeling workflow including ML techniques for active learning, pre-labeling, and machine validation. This reduces the time required to label datasets for a variety of use cases including computer vision and natural language processing. Finally, Ground Truth Plus provides transparency into data labeling operations and quality management through interactive dashboards and user interfaces. This lets you monitor the progress of training datasets across multiple projects, track project metrics such as daily throughput, inspect labels for quality, and provide feedback on the labeled data.

How Does It Work?
A SageMaker Ground Truth Plus screenshot showing the request formFirst, let’s head to the new Ground Truth Plus console and fill out a form outlining the requirements for the data labeling project. Following that, our team of AWS Experts will schedule a call to discuss your data labeling project.

After the call, you simply upload data in an Amazon Simple Storage Service (Amazon S3) bucket for labeling.

Once the data has been uploaded, our experts will set-up the data labeling workflow per your requirements and create a team of labelers with the expertise necessary to label your data effectively. This helps make sure that you have the best people possible working on your projects.

These expert labelers use the Ground Truth Plus tools we’ve built to label these datasets quickly and effectively.

Initially, labelers will annotate the data you’ve uploaded, much like the following example image that we’ve uploaded from the CBCL StreetScenes dataset. However, as the labelers start to submit examples of labeled data, something cool begins happening: our ML systems kick in and start to pre-label the images on behalf of the expert workforce!

An example of the raw dataset used to demonstrate Amazon SageMaker Ground Truth Plus functionality

As more and more data is labeled by the expert workforce, the ML model becomes better at pre-labeling those images. This means that there’s less need for a human to spend as much time creating each individual label for every object of interest in a dataset. Less time spent on labeling means lower costs for you, and it also means a quicker turnaround in creating a dataset that can be used for training a model – all without sacrificing quality.

A screenshot showing one of the labelling interfaces for SageMaker Ground Truth Plus

As the process continues, these ML models will also start to highlight potential areas of interest that the labeling workforce may have missed or incorrectly labeled through machine validation (indicated below by the purple box). Once an area of interest has been highlighted, a human labeler can view and either confirm or delete the suggestion that the model has made. This iteratively improves the pre-labeling and machine validation stages, further reducing the time needed by a labeler to manually label the data, and ensures a high-quality output throughout the process.

A screenshot showing one of the labelling interfaces filed in my a machine learning model for SageMaker Ground Truth Plus

While this is all going on, you can monitor the progress and output of the project using the Ground Truth Plus Project Portal. Within this portal, you can track the amount of data labeled on a day-by-day basis, and make sure that the project is progressing at an acceptable rate.

A screenshot showing the metrics dashboard enabling users to track the progress of their labelling jobs in SageMaker Ground Truth Plus

With each batch of images uploaded and labeled, you can decide whether to accept them or send them back for relabeling if something has been missed.

Finally, when the labeling process has completed, you can retrieve the labeled data from a secure S3 bucket and get to the business of training models.

Find out more
Today, Amazon SageMaker Ground Truth Plus is available in the N. Virginia (us-east-1) region.

To learn more:

New – Amazon DevOps Guru for RDS to Detect, Diagnose, and Resolve Amazon Aurora-Related Issues using ML

Post Syndicated from Marcia Villalba original https://aws.amazon.com/blogs/aws/new-amazon-devops-guru-for-rds-to-detect-diagnose-and-resolve-amazon-aurora-related-issues-using-ml/

Today we are announcing Amazon DevOps Guru for RDS, a new capability for Amazon DevOps Guru. It allows developers to easily detect, diagnose, and resolve performance and operational issues in Amazon Aurora.

Hundreds of thousands of customers nowadays are using Amazon Aurora because it is highly available, scalable, and durable. But as applications grow in size and complexity, it becomes more challenging for these customers to detect and resolve operational and performance issues quickly.

During last year’s re:Invent, we announced DevOps Guru, a service that uses machine learning (ML) to automatically detect and alert customers of application issues, including database problems. Today we are announcing DevOps Guru for RDS to help developers using Amazon Aurora databases to detect, diagnose, and resolve database performance issues fast and at scale. Now developers will have enough information to determine the exact cause for a database performance issue. This launch will save developers and engineers many hours of work trying to uncover and remediate the performance-related database issues.

DevOps Guru for RDS uses ML to automatically identify and analyze a wide range of performance-related database issues, such as over-utilization of host resources, database bottlenecks, or misbehavior of SQL queries. It also recommends solutions to remediate the issues it finds. To use this capability, you don’t need to be a database or ML expert.

When an issue is detected, DevOps Guru for RDS displays the finding in the DevOps Guru console and sends notifications using Amazon EventBridge or Amazon Simple Notification Service (SNS). This allows developers to automatically manage and take real-time action on the issues.

How DevOps Guru for RDS Works
DevOps Guru for RDS uses anomaly detection on the database load (DB load) performance metric to detect issues. DB load is measured in units of Average Active Sessions (AAS). DB load measures the level of activity in your database, making it a great metric to understand the health of your database. If the DB load is high, this can result in performance issues. This metric can be compared to the number of virtual CPUs (vCPUs), and if the DB load is higher than that number, issues can arise.

The most useful dimensions for this metric are the wait events and the top SQL. The wait event describes what the system conditions that are currently running SQLs are waiting on. The most common reasons why a statement is waiting is that it is waiting for the CPU, waiting for a read or write, or waiting for a locked resource. The top SQL dimension shows which queries are contributing the most to DB load.

The following image is an example of a finding that DevOps Guru for RDS reported. The graph shows that from the AAS, most of them were waiting for access to a table or for CPU.

Example of anomaly detection
If you continue scrolling on the DevOps Guru for RDS analysis page, you can discover the cause for the problem and some recommendations to fix it. In this particular example, two problems were detected: high-load wait events and CPU capacity exceeded.

DevOps Guru for RDS looks more in-depth into these problems. First, it looks at the high-load wait events, where there were 27 AAS for the IO and CPU wait types, which is 99 percent of the total DB load.

Second, it tells us that the running tasks exceeded six processes. This database only has two vCPUs, and the recommended number of running processes should be a maximum of four (2x vCPUs). DevOps Guru for RDS also makes recommendations to fix these issues.

Recommendations

In another anomaly, the graph shows that there was a high load of wait events, and one SQL query was found to require further investigation. You can even see the exact SQL query if you click on the SQL digest IDs. The insight’s analysis and recommendation section is full of information on how to investigate further and fix the issue. You can get a lot of detailed information by clicking on the wait event, for example, on the wait event wait/io/table/sql/handler or in the View troubleshooting doc link.

Analysis and recommendations

Get started with DevOps Guru for RDS
To get started with this new capability of DevOps Guru, make sure that Performance Insights is enabled for your Amazon Aurora DB instances. It supports Amazon Aurora with MySQL- and PostgreSQL-compatibility. For instructions on how to enable Performance Insights, see Enabling and disabling Performance Insights.

The next step is to enable DevOps Guru to start monitoring your AWS resources. You can specify the resources you want to be covered by DevOps Guru.

If you are already using DevOps Guru, whenever there is a new insight for an Amazon Aurora database resource, you will see it in the console.

To see the detailed database analysis, navigate to the Insight page and select the new View analysis button under the DB load aggregated metric. That button will take you to the detailed analysis by DevOps Guru for RDS.

View analysis

Pricing and Availability
DevOps Guru for RDS is offered to customers at no additional charge, as part of the existing price that DevOps Guru charges customers for RDS resources.

DevOps Guru for RDS is available in all Regions where DevOps Guru is available, US East (Ohio), US East (N. Virginia), US West (Oregon), Asia Pacific (Singapore), Asia Pacific (Sydney), Asia Pacific (Tokyo), Europe (Frankfurt), Europe (Ireland), and Europe (Stockholm).

Learn more about DevOps Guru for RDS and check out the talk at AWS re:Invent “Automatically detect and resolve performance issues with Amazon DevOps Guru for RDS” (Session Id 15877).

Marcia

Announcing Amazon SageMaker Canvas – a Visual, No Code Machine Learning Capability for Business Analysts

Post Syndicated from Alex Casalboni original https://aws.amazon.com/blogs/aws/announcing-amazon-sagemaker-canvas-a-visual-no-code-machine-learning-capability-for-business-analysts/

As an organization facing business problems and dealing with data on a daily basis, the ability to build systems that can predict business outcomes becomes very important. This ability lets you solve problems and move faster by automating slow processes and embedding intelligence in your IT systems.

But how do you make sure that all teams and individual decision makers in the organization are empowered to create these machine learning (ML) systems at scale, and without depending on other data science and data engineering teams? As a business user or data analyst, you’d like to build and use prediction systems based on the data that you analyze and process every day, without having to learn about hundreds of algorithms, training parameters, evaluation metrics, and deployment best practices.

Today, I’m excited to announce the general availability of Amazon SageMaker Canvas, a new visual, no code capability that allows business analysts to build ML models and generate accurate predictions without writing code or requiring ML expertise. Its intuitive user interface lets you browse and access disparate data sources in the cloud or on-premises, combine datasets with the click of a button, train accurate models, and then generate new predictions once new data is available.

SageMaker Canvas leverages the same technology as Amazon SageMaker to automatically clean and combine your data, create hundreds of models under the hood, select the best performing one, and generate new individual or batch predictions. It supports multiple problem types such as binary classification, multi-class classification, numerical regression, and time series forecasting. These problem types let you address business-critical use cases, such as fraud detection, churn reduction, and inventory optimization, without writing a single line of code.

SageMaker Canvas in Action
Imagine that I’m an e-commerce manager who needs to predict whether or not a product will be shipped on time. The datasets at my disposal consist of a product catalog and the historical shipping dataset, both in CSV format.

First, I enter the SageMaker Canvas application where all of my models and datasets are created and inspected.

I select Import, and upload two CSV files: ProductData.csv and ShippingData.csv. I have 120 products and 10,000 shipping records.

I could also fetch data from Amazon Simple Storage Service (Amazon S3) or connect to other cloud or on-premises data sources, such as Amazon Redshift or Snowflake. For this use case, I prefer to upload 1.6 MB of data directly from my computer.

Before confirming the import, I have a chance to preview the two datasets, their columns, and their respective values. For example, each product has a ComputerBrand, ScreenSize, and PackageWeight. In addition to useful columns such as ShippingOrigin, OrderDate, and ShippingPriority, each record in the shipping dataset also contains OnTimeDelivery, which is either On Time or Late. This column will be used by SageMaker Canvas to generate a prediction model based on historical data.

After a few seconds of processing, the datasets are ready, and I decide to join them to create a single dataset containing both product and shipping information. This is an optional step that often lets you increase the precision of a prediction model.

Now I can simply drag and drop the two datasets: SageMaker Canvas will automatically identify the shared ProductId column and apply an Inner Join transformation.

The join preview lets me visualize the resulting columns, identify missing or invalid values, and optionally deselect unwanted columns.

I select Save joined data and provide a new name for this joined dataset, which now includes 16 columns and 10,000 records.

Next, I want to create a model and start by selecting New model in the Models section on the left menu. I call it On Time Prediction Model.

The first step is selecting a dataset.

I select a target column that my model will predict: OnTimeDelivery.

SageMaker Canvas shows me the value distribution and already recommends the most appropriate model type: two categories classification.

Before proceeding with the model training, I have the option to generate an analysis report. This analysis gives me two very important pieces of information: the estimated accuracy and the impact of each column.

The estimated accuracy of 99.9% gives me confidence, but then I notice that the highest impact is provided by the ActualShippingDays column. Unfortunately, this column is not available in advance and I can’t use it for my predictions. So I deselect it and run the analysis again.

The new estimated accuracy is 94.2%, which is still pretty high. The most impactful columns are ShippingPriority, YShippingDistance, XShippingDistance, and Carrier. This is great because all of this information is available in advance and can be used for a prediction. On the other hand, product-related columns, such PackageWeight and ScreenSize, have very small impacts on the prediction. This means that in the future I could simplify the overall process by feeding only shipping information into the training and prediction phases.

I’m happy with the analysis insights. Therefore, I decide to proceed and build a prediction model by selecting the Standard build option.

Now I can go for a walk, attend a few productive meetings, or simply spend some time with family. SageMaker Canvas is doing all of the work for me, training hundreds of models behind the scenes. It will select the best performing one, so that I can start generating accurate predictions in a couple of hours. Of course, the training duration will vary depending on the dataset size and problem type.

After about an hour and a half, the model is ready and the console lets me analyze its accuracy and the column impacts visually. I’m also happy to see that the model predicts the correct value 95.8% of the time, which is even higher than the estimated accuracy.

Optionally, I could also inspect advanced metrics such as Precision, Recall, F1 Score, and so on. These metrics help me understand how the model is performing and what kind of false positives and false negatives I can expect from this model.

From here, I could share the model into Amazon SageMaker Studio or continue using the Canvas UI to generate new predictions.

I decide to continue with the intuitive UI and select Predict. Now I can work with individual records or with a dataset for batch predictions.

When selecting Single prediction, SageMaker Canvas simplifies my life and lets me start from an existing record. I modify the column values and get immediate feedback on the prediction and the corresponding feature importance.

This quick feedback loop and intuitive UI allows me to use the ML model without having to write custom code. In case I decide to integrate the model into an automated production system, the Amazon SageMaker Studio integration lets me share the model easily with other data scientists in my team.

Generally Available Today
SageMaker Canvas is generally available in US East (Ohio), US East (N. Virginia), US West (Oregon), Europe (Frankfurt), and Europe (Ireland). You can start using it with your local datasets, as well as data already stored on Amazon S3, Amazon Redshift, or Snowflake. With just a few clicks, you’ll prepare and join your datasets, analyze estimated accuracy, verify which columns are impactful, train the best performing model, and generate new individual or batch predictions. We’re excited to hear your feedback and help you solve even more business problems with ML.

Alex

How do we develop AI education in schools? A panel discussion

Post Syndicated from Sue Sentance original https://www.raspberrypi.org/blog/ai-education-schools-panel-uk-policy/

AI is a broad and rapidly developing field of technology. Our goal is to make sure all young people have the skills, knowledge, and confidence to use and create AI systems. So what should AI education in schools look like?

To hear a range of insights into this, we organised a panel discussion as part of our seminar series on AI and data science education, which we co-host with The Alan Turing Institute. Here our panel chair Tabitha Goldstaub, Co-founder of CogX and Chair of the UK government’s AI Council, summarises the event. You can also watch the recording below.

As part of the Raspberry Pi Foundation’s monthly AI education seminar series, I was delighted to chair a special panel session to broaden the range of perspectives on the subject. The members of the panel were:

  • Chris Philp, UK Minister for Tech and the Digital Economy
  • Philip Colligan, CEO of the Raspberry Pi Foundation 
  • Danielle Belgrave, Research Scientist, DeepMind
  • Caitlin Glover, A level student, Sandon School, Chelmsford
  • Alice Ashby, student, University of Brighton

The session explored the UK government’s commitment in the recently published UK National AI Strategy stating that “the [UK] government will continue to ensure programmes that engage children with AI concepts are accessible and reach the widest demographic.” We discussed what it will take to make this a reality, and how we will ensure young people have a seat at the table.

Two teenage girls do coding during a computer science lesson.

Why AI education for young people?

It was clear that the Minister felt it is very important for young people to understand AI. He said, “The government takes the view that AI is going to be one of the foundation stones of our future prosperity and our future growth. It’s an enabling technology that’s going to have almost universal applicability across our entire economy, and that is why it’s so important that the United Kingdom leads the world in this area. Young people are the country’s future, so nothing is complete without them being at the heart of it.”

A teacher watches two female learners code in Code Club session in the classroom.

Our panelist Caitlin Glover, an A level student at Sandon School, reiterated this from her perspective as a young person. She told us that her passion for AI started initially because she wanted to help neurodiverse young people like herself. Her idea was to start a company that would build AI-powered products to help neurodiverse students.

What careers will AI education lead to?

A theme of the Foundation’s seminar series so far has been how learning about AI early may impact young people’s career choices. Our panelist Alice Ashby, who studies Computer Science and AI at Brighton University, told us about her own process of deciding on her course of study. She pointed to the fact that terms such as machine learning, natural language processing, self-driving cars, chatbots, and many others are currently all under the umbrella of artificial intelligence, but they’re all very different. Alice thinks it’s hard for young people to know whether it’s the right decision to study something that’s still so ambiguous.

A young person codes at a Raspberry Pi computer.

When I asked Alice what gave her the courage to take a leap of faith with her university course, she said, “I didn’t know it was the right move for me, honestly. I took a gamble, I knew I wanted to be in computer science, but I wanted to spice it up.” The AI ecosystem is very lucky that people like Alice choose to enter the field even without being taught what precisely it comprises.

We also heard from Danielle Belgrave, a Research Scientist at DeepMind with a remarkable career in AI for healthcare. Danielle explained that she was lucky to have had a Mathematics teacher who encouraged her to work in statistics for healthcare. She said she wanted to ensure she could use her technical skills and her love for math to make an impact on society, and to really help make the world a better place. Danielle works with biologists, mathematicians, philosophers, and ethicists as well as with data scientists and AI researchers at DeepMind. One possibility she suggested for improving young people’s understanding of what roles are available was industry mentorship. Linking people who work in the field of AI with school students was an idea that Caitlin was eager to confirm as very useful for young people her age.

We need investment in AI education in school

The AI Council’s Roadmap stresses how important it is to not only teach the skills needed to foster a pool of people who are able to research and build AI, but also to ensure that every child leaves school with the necessary AI and data literacy to be able to become engaged, informed, and empowered users of the technology. During the panel, the Minister, Chris Philp, spoke about the fact that people don’t have to be technical experts to come up with brilliant ideas, and that we need more people to be able to think creatively and have the confidence to adopt AI, and that this starts in schools. 

A class of primary school students do coding at laptops.

Caitlin is a perfect example of a young person who has been inspired about AI while in school. But sadly, among young people and especially girls, she’s in the minority by choosing to take computer science, which meant she had the chance to hear about AI in the classroom. But even for young people who choose computer science in school, at the moment AI isn’t in the national Computing curriculum or part of GCSE computer science, so much of their learning currently takes place outside of the classroom. Caitlin added that she had had to go out of her way to find information about AI; the majority of her peers are not even aware of opportunities that may be out there. She suggested that we ensure AI is taught across all subjects, so that every learner sees how it can make their favourite subject even more magical and thinks “AI’s cool!”.

A primary school boy codes at a laptop with the help of an educator.

Philip Colligan, the CEO here at the Foundation, also described how AI could be integrated into existing subjects including maths, geography, biology, and citizenship classes. Danielle thoroughly agreed and made the very good point that teaching this way across the school would help prepare young people for the world of work in AI, where cross-disciplinary science is so important. She reminded us that AI is not one single discipline. Instead, many different skill sets are needed, including engineering new AI systems, integrating AI systems into products, researching problems to be addressed through AI, or investigating AI’s societal impacts and how humans interact with AI systems.

On hearing about this multitude of different skills, our discussion turned to the teachers who are responsible for imparting this knowledge, and to the challenges they face. 

The challenge of AI education for teachers

When we shifted the focus of the discussion to teachers, Philip said: “If we really want to equip every young person with the knowledge and skills to thrive in a world that shaped by these technologies, then we have to find ways to evolve the curriculum and support teachers to develop the skills and confidence to teach that curriculum.”

Teenage students and a teacher do coding during a computer science lesson.

I asked the Minister what he thought needed to happen to ensure we achieved data and AI literacy for all young people. He said, “We need to work across government, but also across business and society more widely as well.” He went on to explain how important it was that the Department for Education (DfE) gets the support to make the changes needed, and that he and the Office for AI were ready to help.

Philip explained that the Raspberry Pi Foundation is one of the organisations in the consortium running the National Centre for Computing Education (NCCE), which is funded by the DfE in England. Through the NCCE, the Foundation has already supported thousands of teachers to develop their subject knowledge and pedagogy around computer science.

A recent study recognises that the investment made by the DfE in England is the most comprehensive effort globally to implement the computing curriculum, so we are starting from a good base. But Philip made it clear that now we need to expand this investment to cover AI.

Young people engaging with AI out of school

Philip described how brilliant it is to witness young people who choose to get creative with new technologies. As an example, he shared that the Foundation is seeing more and more young people employ machine learning in the European Astro Pi Challenge, where participants run experiments using Raspberry Pi computers on board the International Space Station. 

Three teenage boys do coding at a shared computer during a computer science lesson.

Philip also explained that, in the Foundation’s non-formal CoderDojo club network and its Coolest Projects tech showcase events, young people build their dream AI products supported by volunteers and mentors. Among these have been autonomous recycling robots and AI anti-collision alarms for bicycles. Like Caitlin with her company idea, this shows that young people are ready and eager to engage and create with AI.

We closed out the panel by going back to a point raised by Mhairi Aitken, who presented at the Foundation’s research seminar in September. Mhairi, an Alan Turing Institute ethics fellow, argues that children don’t just need to learn about AI, but that they should actually shape the direction of AI. All our panelists agreed on this point, and we discussed what it would take for young people to have a seat at the table.

A Black boy uses a Raspberry Pi computer at school.

Alice advised that we start by looking at our existing systems for engaging young people, such as Youth Parliament, student unions, and school groups. She also suggested adding young people to the AI Council, which I’m going to look into right away! Caitlin agreed and added that it would be great to make these forums virtual, so that young people from all over the country could participate.

The panel session was full of insight and felt very positive. Although the challenge of ensuring we have a data- and AI-literate generation of young people is tough, it’s clear that if we include them in finding the solution, we are in for a bright future. 

What’s next for AI education at the Raspberry Pi Foundation?

In the coming months, our goal at the Foundation is to increase our understanding of the concepts underlying AI education and how to teach them in an age-appropriate way. To that end, we will start to conduct a series of small AI education research projects, which will involve gathering the perspectives of a variety of stakeholders, including young people. We’ll make more information available on our research pages soon.

In the meantime, you can sign up for our upcoming research seminars on AI and data science education, and peruse the collection of related resources we’ve put together.

The post How do we develop AI education in schools? A panel discussion appeared first on Raspberry Pi.

New – Amazon EC2 G5g Instances Powered by AWS Graviton2 Processors and NVIDIA T4G Tensor Core GPUs

Post Syndicated from Channy Yun original https://aws.amazon.com/blogs/aws/new-amazon-ec2-g5g-instances-powered-by-aws-graviton2-processors-and-nvidia-t4g-tensor-core-gpus/

AWS Graviton2 processors are custom-designed by AWS to enable the best price performance in Amazon EC2. Thousands of customers are realizing significant price performance benefits for a wide variety of workloads with Graviton2-based instances.

Today, we are announcing the general availability of Amazon EC2 G5g instances that extend Graviton2 price-performance benefits to GPU-based workloads including graphics applications and machine learning inference. In addition to Graviton2 processors, G5g instances feature NVIDIA T4G Tensor Core GPUs to provide the best price performance for Android game streaming, with up to 25 Gbps of networking bandwidth and 19 Gbps of EBS bandwidth.

These instances provide up to 30 percent lower cost per stream per hour for Android game streaming than x86-based GPU instances. G5g instances are also ideal for machine learning developers who are looking for cost-effective inference, have ML models that are sensitive to CPU performance, and leverage NVIDIA’s AI libraries.

G5g instances are available in the six sizes as shown below.

Instance Name vCPUs Memory (GB) NVIDIA T4G Tensor Core GPU GPU Memory (GB) EBS Bandwidth (Gbps) Network Bandwidth (Gbps)
g5g.xlarge 4 8 1 16 Up to 3.5 Up to 10
g5g.2xlarge 8 16 1 16 Up to 3.5 Up to 10
g5g.4xlarge 16 32 1 16 Up to 3.5 Up to 10
g5g.8xlarge 32 64 1 16 9 12
g5g.16xlarge 64 128 2 32 19 25
g5g.metal 64 128 2 32 19 25

These instances are a great fit for many interesting types of workloads. Here are a few examples:

  • Streaming Android gaming—With G5g instances, Android game developers can build natively on Arm-based GPU instances without the need for cross-compilation or emulation on x86-based instances. They can encode the rendered graphics and stream the game over the network to a mobile device. This helps simplify development efforts and time and lowers the cost per stream per hour by up to 30 percent.
  • ML Inference —G5g instances are also ideal for machine learning developers who are looking for cost-effective inference, have ML models that are sensitive to CPU performance, and leverage NVIDIA’s AI If you don’t have any dependencies on NVIDIA software, you may use Inf1 instances, which deliver up to 70 percent lower cost-per-inference than G4dn instances.
  • Graphics rendering—G5g instances are the most cost-effective option for customers with rendering workloads and dependencies on NVIDIA libraries. These instances also support rendering applications and use cases that leverage industry-standard APIs such as OpenGL and Vulkan.
  • Autonomous Vehicle Simulations—Several of our customers are designing and simulating autonomous vehicles that include multiple real-time sensors. They can use ray tracing to simulate sensor input in real time.

The instances are compatible with a very long list of graphical and machine learning libraries on Linux, including NVENC, NVDEC, nvJPEG, OpenGL, Vulkan, CUDA, CuDNN, CuBLAS, and TensorRT.

Available Now
The new G5g instances are available now, and you can start using them today in the US East (N. Virginia), US West (Oregon), and Asia-Pacific (Seoul, Singapore and Tokyo) Regions in On-Demand, Spot, Savings Plan, and Reserved Instance form. To learn more, see the EC2 pricing page.

G5g instances are available now in AWS Deep Learning AMIs with NVIDIA drivers and popular ML frameworks, Amazon Elastic Container Service (Amazon ECS), or Amazon Elastic Kubernetes Service (Amazon EKS) clusters for containerized ML applications.

You can send feedback to the AWS forum for Amazon EC2 or through your usual AWS Support contacts.

Channy

Amazon CodeGuru Reviewer Introduces Secrets Detector to Identify Hardcoded Secrets and Secure Them with AWS Secrets Manager

Post Syndicated from Alex Casalboni original https://aws.amazon.com/blogs/aws/codeguru-reviewer-secrets-detector-identify-hardcoded-secrets/

Amazon CodeGuru helps you improve code quality and automate code reviews by scanning and profiling your Java and Python applications. CodeGuru Reviewer can detect potential defects and bugs in your code. For example, it suggests improvements regarding security vulnerabilities, resource leaks, concurrency issues, incorrect input validation, and deviation from AWS best practices.

One of the most well-known security practices is the centralization and governance of secrets, such as passwords, API keys, and credentials in general. As many other developers facing a strict deadline, I’ve often taken shortcuts when managing and consuming secrets in my code, using plaintext environment variables or hard-coding static secrets during local development, and then inadvertently commit them. Of course, I’ve always regretted it and wished there was an automated way to detect and secure these secrets across all my repositories.

Today, I’m happy to announce the new Amazon CodeGuru Reviewer Secrets Detector, an automated tool that helps developers detect secrets in source code or configuration files, such as passwords, API keys, SSH keys, and access tokens.

These new detectors use machine learning (ML) to identify hardcoded secrets as part of your code review process, ultimately helping you to ensure that all new code doesn’t contain hardcoded secrets before being merged and deployed. In addition to Java and Python code, secrets detectors also scan configuration and documentation files. CodeGuru Reviewer suggests remediation steps to secure your secrets with AWS Secrets Manager, a managed service that lets you securely and automatically store, rotate, manage, and retrieve credentials, API keys, and all sorts of secrets.

This new functionality is included as part of the CodeGuru Reviewer service at no additional cost and supports the most common API providers, such as AWS, Atlassian, Datadog, Databricks, GitHub, Hubspot, Mailchimp, Salesforce, SendGrid, Shopify, Slack, Stripe, Tableau, Telegram, and Twilio. Check out the full list here.

Secrets Detectors in Action
First, I select CodeGuru from the AWS Secrets Manager console. This new flow lets me associate a new repository and run a full repository analysis with the goal of identifying hardcoded secrets.

Associating a new repository only takes a few seconds. I connect my GitHub account, and then select a repository named hawkcd, which contains a few Java, C#, JavaScript, and configuration files.

A few minutes later, my full repository is successfully associated and the full scan is completed. I could also have a look at a demo repository analysis called DemoFullRepositoryAnalysisSecrets. You’ll find this demo in the CodeGuru console, under Full repository analysis, in your AWS Account.

I select the repository analysis and find 42 recommendations, including one recommendation for a hardcoded secret (you can filter recommendations by Type=Secrets). CodeGuru Reviewer identified a hardcoded AWS Access Key ID in a .travis.yml file.

The recommendation highlights the importance of storing these secrets securely, provides a link to learn more about the issue, and suggests rotating the identified secret to make sure that it can’t be reused by malicious actors in the future.

CodeGuru Reviewer lets me jump to the exact file and line of code where the secret appears, so that I can dive deeper, understand the context, verify the file history, and take action quickly.

Last but not least, the recommendation includes a Protect your credential button that lets me jump quickly to the AWS Secrets Manager console and create a new secret with the proper name and value.

I’m going to remove the plaintext secret from my source code and update my application to fetch the secret value from AWS Secrets Manager. In many cases, you can keep the current configuration structure and use existing parameters to store the secret’s name instead of the secret’s value.

Once the secret is securely stored, AWS Secrets Manager also provides me with code snippets that fetch my new secret in many programming languages using the AWS SDKs. These snippets let me save time and include the necessary SDK call, as well as the error handling, decryption, and decoding logic.

I’ve showed you how to run a full repository analysis, and of course the same analysis can be performed continuously on every new pull request to help you prevent hardcoded secrets and other issues from being introduced in the future.

Available Today with CodeGuru Reviewer
CodeGuru Reviewer Secrets Detector is available in all regions where CodeGuru Reviewer is available, at no additional cost.

If you’re new to CodeGuru Reviewer, you can try it for free for 90 days with repositories up to 100,000 lines of code. Connecting your repositories and starting a full scan takes only a couple of minutes, whether your code is hosted on AWS CodeCommit, BitBucket, or GitHub. If you’re using GitHub, check out the GitHub Actions integration as well.

You can learn more about Secrets Detector in the technical documentation.

Alex

The machine learning effect: Magic boxes and computational thinking 2.0

Post Syndicated from Jane Waite original https://www.raspberrypi.org/blog/machine-learning-education-school-computational-thinking-2-0-research-seminar/

How does teaching children and young people about machine learning (ML) differ from teaching them about other aspects of computing? Professor Matti Tedre and Dr Henriikka Vartiainen from the University of Eastern Finland shared some answers at our latest research seminar.

Three smiling young learners in a computing classroom.
We need to determine how to teach young people about machine learning, and what teachers need to know to help their learners form correct mental models.

Their presentation, titled ‘ML education for K-12: emerging trajectories’, had a profound impact on my thinking about how we teach computational thinking and programming. For this blog post, I have simplified some of the complexity associated with machine learning for the benefit of readers who are new to the topic.

a 3D-rendered grey box.
Machine learning is not magic — what needs to change in computing education to make sure learners don’t see ML systems as magic boxes?

Our seminars on teaching AI, ML, and data science

We’re currently partnering with The Alan Turing Institute to host a series of free research seminars about how to teach artificial intelligence (AI) and data science to young people.

The seminar with Matti and Henriikka, the third one of the series, was very well attended. Over 100 participants from San Francisco to Rajasthan, including teachers, researchers, and industry professionals, contributed to a lively and thought-provoking discussion.

Representing a large interdisciplinary team of researchers, Matti and Henriikka have been working on how to teach AI and machine learning for more than three years, which in this new area of study is a long time. So far, the Finnish team has written over a dozen academic papers based on their pilot studies with kindergarten-, primary-, and secondary-aged learners.

Current teaching in schools: classical rule-driven programming

Matti and Henriikka started by giving an overview of classical programming and how it is currently taught in schools. Classical programming can be described as rule-driven. Example features of classical computer programs and programming languages are:

  • A classical language has a strict syntax, and a limited set of commands that can only be used in a predetermined way
  • A classical language is deterministic, meaning we can guarantee what will happen when each line of code is run
  • A classical program is executed in a strict, step-wise order following a known set of rules

When we teach this type of programming, we show learners how to use a deductive problem solving approach or workflow: defining the task, designing a possible solution, and implementing the solution by writing a stepwise program that is then run on a computer. We encourage learners to avoid using trial and error to write programs. Instead, as they develop and test a program, we ask them to trace it line by line in order to predict what will happen when each line is run (glass-box testing).

A list of features of rule-driven computer programming, also included in the text.
The features of classical (rule-driven) programming approaches as taught in computer science education (CSE) (Tedre & Vartiainen, 2021).

Classical programming underpins the current view of computational thinking (CT). Our speakers called this version of CT ‘CT 1.0’. So what’s the alternative Matti and Henriikka presented, and how does it affect what computational thinking is or may become?

Machine learning (data-driven) models and new computational thinking (CT 2.0) 

Rule-based programming languages are not being eradicated. Instead, software systems are being augmented through the addition of machine learning (data-driven) elements. Many of today’s successful software products, such as search engines, image classifiers, and speech recognition programs, combine rule-driven software and data-driven models. However, the workflows for these two approaches to solving problems through computing are very different.

A table comparing problem solving workflows using computational thinking 1.0 versus computational thinking 2.0, info also included in the text.
Problem solving is very different depending on whether a rule-driven computational thinking (CT 1.0) approach or a data-driven computational thinking (CT 2.0) approach is used (Tedre & Vartiainen,2021).

Significantly, while in rule-based programming (and CT 1.0), the focus is on solving problems by creating algorithms, in data-driven approaches, the problem solving workflow is all about the data. To highlight the profound impact this shift in focus has on teaching and learning computing, Matti introduced us to a new version of computational thinking for machine learning, CT 2.0, which is detailed in a forthcoming research paper.

Because of the focus on data rather than algorithms, developing a machine learning model is not at all like developing a classical rule-driven program. In classical programming, programs can be traced, and we can predict what will happen when they run. But in data-driven development, there is no flow of rules, and no absolutely right or wrong answer.

A table comparing conceptual differences between computational thinking 1.0 versus computational thinking 2.0, info also included in the text.
There are major differences between rule-driven computational thinking (CT 1.0) and data-driven computational thinking (CT 2.0), which impact what computing education needs to take into account (Tedre & Vartiainen,2021).

Machine learning models are created iteratively using training data and must be cross-validated with test data. A tiny change in the data provided can make a model useless. We rarely know exactly why the output of an ML model is as it is, and we cannot explain each individual decision that the model might have made. When evaluating a machine learning system, we can only say how well it works based on statistical confidence and efficiency. 

Machine learning education must cover ethical and societal implications 

The ethical and societal implications of computer science have always been important for students to understand. But machine learning models open up a whole new set of topics for teachers and students to consider, because of these models’ reliance on large datasets, the difficulty of explaining their decisions, and their usefulness for automating very complex processes. This includes privacy, surveillance, diversity, bias, job losses, misinformation, accountability, democracy, and veracity, to name but a few.

I see the shift in problem solving approach as a chance to strengthen the teaching of computing in general, because it opens up opportunities to teach about systems, uncertainty, data, and society.

Jane Waite

Teaching machine learning: the challenges of magic boxes and new mental models

For teaching classical rule-driven programming, much time and effort has been put into researching learners’ understanding of what a program will do when it is run. This kind of understanding is called a learner’s mental model or notional machine. An approach teachers often use to help students develop a useful mental model of a program is to hide the detail of how the program works and only gradually reveal its complexity. This approach is described with the metaphor of hiding the detail of elements of the program in a box. 

Data-driven models in machine learning systems are highly complex and make little sense to humans. Therefore, they may appear like magic boxes to students. This view needs to be banished. Machine learning is not magic. We have just not figured out yet how to explain the detail of data-driven models in a way that allows learners to form useful mental models.

An example of a representation of a machine learning model in TensorFlow, an online machine learning tool (Tedre & Vartiainen,2021).

Some existing ML tools aim to help learners form mental models of ML, for example through visual representations of how a neural network works (see Figure 2). But these explanations are still very complex. Clearly, we need to find new ways to help learners of all ages form useful mental models of machine learning, so that teachers can explain to them how machine learning systems work and banish the view that machine learning is magic.

Some tools and teaching approaches for ML education

Matti and Henriikka’s team piloted different tools and pedagogical approaches with different age groups of learners. In terms of tools, since large amounts of data are needed for machine learning projects, our presenters suggested that tools that enable lots of data to be easily collected are ideal for teaching activities. Media-rich education tools provide an opportunity to capture still images, movements, sounds, or sense other inputs and then use these as data in machine learning teaching activities. For example, to create a machine learning–based rock-paper-scissors game, students can take photographs of their hands to train a machine learning model using Google Teachable Machine.

Photos of hands are used to train a machine learning model as part of a project to create a rock-paper-scissors game.
Photos of hands are used to train a Teachable Machine machine learning model as part of a project to create a rock-paper-scissors game (Tedre & Vartiainen, 2021).

Similar to tools that teach classic programming to novice students (e.g. Scratch), some of the new classroom tools for teaching machine learning have a drag-and-drop interface (e.g. Cognimates). Using such tools means that in lessons, there can be less focus on one of the more complex aspects of learning to program, learning programming language syntax. However, not all machine learning education products include drag-and-drop interaction, some instead have their own complex languages (e.g. Wolfram Programming Lab), which are less attractive to teachers and learners. In their pilot studies, the Finnish team found that drag-and-drop machine learning tools appeared to work well with students of all ages.

The different pedagogical approaches the Finnish research team used in their pilot studies included an exploratory approach with preschool children, who investigated machine learning recognition of happy or sad faces; and a project-based approach with older students, who co-created machine learning apps with web-based tools such as Teachable Machine and Learn Machine Learning (built by the research team), supported by machine learning experts.

Example of a middle school (age 8 to 11) student’s pen and paper design for a machine learning app that recognises different instruments and chords.
Example of a middle school (age 8 to 11) student’s design for a machine learning app that recognises different instruments and chords (Tedre & Vartiainen, 2021).

What impact these pedagogies have on students’ long-term mental models about machine learning has yet to be researched. If you want to find out more about the classroom pilot studies, the academic paper is a very accessible read.

My take-aways: new opportunities, new research questions

We all learned a tremendous amount from Matti and Henriikka and their perspectives on this important topic. Our seminar participants asked them many questions about the pedagogies and practicalities of teaching machine learning in class, and raised concerns about squeezing more into an already packed computing curriculum.

For me, the most significant take-away from the seminar was the need to shift focus from algorithms to data and from CT 1.0 to CT 2.0. Learning how to best teach classical rule-driven programming has been a long journey that we have not yet completed. We are forming an understanding of what concepts learners need to be taught, the progression of learning, key mental models, pedagogical options, and assessment approaches. For teaching data-driven development, we need to do the same.  

The question of how we make sure teachers have the necessary understanding is key.

Jane Waite

I see the shift in problem solving approach as a chance to strengthen the teaching of computing in general, because it opens up opportunities to teach about systems, uncertainty, data, and society. I think it will help us raise awareness about design, context, creativity, and student agency. But I worry about how we will introduce this shift. In my view, there is a considerable risk that we will be sucked into open-ended, project-based learning, with busy and fun but shallow learning experiences that result in restricted conceptual development for students.

I also worry about how we can best help teachers build up the knowledge and experience to support their students. In the Q&A after the seminar, I asked Matti and Henriikka about the role of their team’s machine learning experts in their pilot studies. It seemed to me that without them, the pilot lessons would not have worked, as the participating teachers and students would not have had the vocabulary to talk about the process and would not have known what was doable given the available time, tools, and student knowledge.

The question of how we make sure teachers have the necessary understanding is key. Many existing professional development resources for teachers wanting to learn about ML seem to imply that teachers will all need a PhD in statistics and neural network optimisation to engage with machine learning education. This is misleading. But teachers do need to understand the machine learning concepts that their students need to learn about, and I think we don’t yet know exactly what these concepts are. 

In summary, clearly more research is needed. There are fundamental questions still to be answered about what, when, and how we teach data-driven approaches to software systems development and how this impacts what we teach about classical, rule-based programming. But to me, that is exciting, and I am very much looking forward to the journey ahead.

Join our next free seminar

To find out what others recommend about teaching AI and ML, catch up on last month’s seminar with Professor Carsten Schulte and colleagues on centring data instead of code in the teaching of AI.

We have another four seminars in our monthly series on AI, machine learning, and data science education. Find out more about them on this page, and catch up on past seminar blogs and recordings here.

At our next seminar on Tuesday 7 December at 17:00–18:30 GMT, we will welcome Professor Rose Luckin from University College London. She will be presenting on what it is about AI that makes it useful for teachers and learners.

We look forward to meeting you there!

PS You can build your understanding of machine learning by joining our latest free online course, where you’ll learn foundational concepts and train your own ML model!

The post The machine learning effect: Magic boxes and computational thinking 2.0 appeared first on Raspberry Pi.

Announcing Fully Managed RStudio on Amazon SageMaker for Data Scientists

Post Syndicated from Channy Yun original https://aws.amazon.com/blogs/aws/announcing-fully-managed-rstudio-on-amazon-sagemaker-for-data-scientists/

Two years ago, we introduced Amazon SageMaker Studio, the industry’s first fully integrated development environment (IDE) for machine learning (ML). Amazon SageMaker Studio provides a single, web-based visual interface where you can perform all ML development steps, improving data science team productivity by up to 10 times

Many data scientists love the R project, an open-source ecosystem with more than 18,000 packages that is not just a programming language but is also an interactive environment for doing data science. RStudio is one of the most popular IDE among R developers for ML and data science projects. RStudio provides open-source tools for R and enterprise-ready professional software for data science teams to develop and share their work in the organization. But, building, securing, scaling and maintaining RStudio yourself is tedious and cumbersome.

Today, in collaboration with RStudio PBC, we are excited to announce the general availability of RStudio on Amazon SageMaker, the industry’s first fully managed RStudio Workbench IDE in the cloud. You can now bring your current RStudio license to easily migrate your self-managed RStudio environments to Amazon SageMaker in just a few simple steps. If you’d like to read more about this exciting collaboration, check out this blog from RStudio PBC.

With RStudio on Amazon SageMaker, administrators can have a simple experience to migrate their RStudio environments to integrate into Amazon SageMaker and bring existing RStudio licenses to manage through AWS License Manager. They can onboard both R and Python developers to the same Amazon SageMaker domain using AWS Single Sign-On (SSO) or AWS Identity and Access Management (IAM) and take it as a centralized place to configure both RStudio and Amzon SageMaker Studio.

So, data scientists have a freedom of choice between programming languages and coding interfaces to switch between RStudio and Amazon SageMaker Studio notebooks. All of their work, including code, datasets, repositories, and other artifacts are synchronized between the two environments through the underlying Amazon EFS storage.

Getting Started with RStudio on SageMaker
You now can launch the familiar RStudio Workbench with a simple click from Amazon SageMaker. Before getting started, your administrator needs to buy an appropriate license from RStudio PBC for end-users, set up your granted licenses in AWS License Manager, and create an Amazon SageMaker domain and user profile to launch RStudio on Amazon SageMaker. To learn all the administrator jobs, including managing licenses and monitoring usages, see a blog post of the setting up process, or Manage RStudio on Amazon SageMaker in the AWS documentation.

Once the required setup process is completed, you can open the RStudio Workbench from the new Launch app drop-down list in the created user list and select RStudio.

You will immediately see the RStudio Workbench home page and a list of sessions, projects, and published content on the home page. To create a new session, select the New Session button on the page, select a desired instance in the Instance Type dropdown list, and choose Start Session.

When you choose a compute instance type for a lightweight analysis that can be powered by two vCPU and four GiB memory, you can use a default ml.t3.medium instance. For a complex and large-scale ML modeling, you can choose a large instance with desired compute and memory from a wide array of ML instances available on Amazon SageMaker.

In a few minutes, your session will be ready for development in RStudio Workbench. When you launch your RStudio session, the Base R image serves as the basis of your instance. This Docker image includes R v4.0, AWS tools such as awscli, sagemaker, boto3 Python packages, and reticulate package for the interoperability between Python and R.

Managing R Packages and Publishing your Analysis
Along with the RStudio Workbench, RStudio Connect and RStudio Package Manager are the most used products of RStudio.

RStudio Connect is designed to allow data scientists to publish insights and dashboard and web applications from RStudio Workbench easily. RStudio Package Manager centrally manages the package repository for your organization so that data scientists can securely install packages faster while ensuring project reproducibility and repeatability.

Your administrator, for example, can create a repository and subscribe it to the built-in source named cran in RStudio Package Manager.

$ rspm sync --wait # Initiate a sync
$ rspm create repo --name=prod-cran --description='Access CRAN packages' # Create a repository:
$ rspm subscribe --repo=prod-cran --source=cran # Subscribe the repository to the cran source

When these steps are completed, you can use the prod-cran repository in the web interface of RStudio Package Manager.

Now, you can configure this repository to install and manage your packages in RStudio Workbench. You can also configure RStudio Connect to publish insights, dashboard and web applications from RStudio Workbench via RStudio Connect so that your collaborators can easily consume your work.

For example, you run the analysis inline to create an R Markdown that can be published to your collaborators. You can preview the slides while writing codes with the Preview button and publish it with the Publish icon in your RStudio session.

You can also publish Shiny application easy to create interactive web interfaces, or Python-based content such as Streamlit to the RStudio Connect instance.

To learn more, see Host RStudio Connect and Package Manager for ML development in RStudio on Amazon SageMaker written by my colleagues, Michael Hsieh, Chayan Panda, and Farooq Sabir on the AWS Machine Learning Blog.

Integrating training jobs with Amazon SageMaker
One of the benefits of using RStudio on Amazon SageMaker is the integration of Amazon SageMaker features. Your RStudio and Jupyter Notebook instances of Amazon SageMaker allow you to share the same Amazon EFS file system. You can import R codes written in Jupyter Notebook or use the same files in both Jupyter Notebook and RStudio without having to move your files between the two.

For example, you can run an R sample code including importing libraries, creating an Amazon SageMaker session, getting the IAM role, and importing and visualizing sample data. And then, it stores data on the S3 bucket, and triggers a training task with an XGBoost model by specifying the training container and defining an Amazon SageMaker Estimator. To learn more, see R sample codes in Amazon SageMaker.

# Import reticulate, readr and sagemaker libraries
library(reticulate)
library(readr)
sagemaker <- import('sagemaker')

# Create a sagemaker session
session <- sagemaker$Session()

# Get execution role
role_arn <- sagemaker$get_execution_role()

# Read a csv file from UCI public repository
data_file <- 'http://archive.ics.uci.edu/ml/machine-learning-databases/abalone/abalone.data'

# Copy data to a dataframe, rename columns, and show dataframe head
data_csv <- read_csv(file = data_file, col_names = FALSE, col_types = cols())
names(data_csv) <- c('sex', 'length', 'diameter', 'height', 'whole_weight', 'shucked_weight', 'viscera_weight', 'shell_weight', 'rings')
head(data_csv)

# Visualize data have height equal to 0
library(ggplot2)
options(repr.plot.width = 5, repr.plot.height = 4) 
ggplot(abalone, aes(x = height, y = rings, color = sex, alpha=0.5)) + geom_point() + geom_jitter()

# Upload data to Amazon S3 bucket
s3_train <- session$upload_data(path = data_csv,
                                bucket = my_s3_bucket, 
                                key_prefix = 'r_hello_world_demo/data')
s3_path = paste('s3://',bucket,'/r_hello_world_demo/data/abalone.csv',sep = '')

# Train a XGBoost model, specify the training containers, and define an Amazon SageMaker Estimator
container <- sagemaker$image_uris$retrieve(framework='xgboost', 
                                           region= session$boto_region_name, 
										   version='latest')							
estimator <- sagemaker$estimator$Estimator(image_uri = container,
                                           role = role_arn,
                                           train_instance_count = 1L,
                                           train_instance_type = 'ml.m5.4xlarge',
                                           train_volume_size = 30L,
                                           train_max_run = 3600L,
                                           input_mode = 'File',
                                           output_path = s3_path)

Now Available
RStudio on Amazon SageMaker is available in all AWS Regions where both Amazon SageMaker Studio and AWS License Manager are available. You can bring your own license of RStudio on Amazon SageMaker and pay for the underlying compute and storage resources within Amazon SageMaker or other AWS services, based on your usage.

To get started with RStudio on Amazon SageMaker, you can use AWS Free Tier. You can use 250 hours of ml.t3.medium instance on Amazon SageMaker Studio per month for the first two months. To learn more, see Amazon SageMaker Pricing page.

Give it a try, and please send us feedback either in the AWS forum for Amazon SageMaker or through your usual AWS support contacts.

Channy

Detect Python and Java code security vulnerabilities with Amazon CodeGuru Reviewer

Post Syndicated from Haider Naqvi original https://aws.amazon.com/blogs/devops/detect-python-and-java-code-security-vulnerabilities-with-codeguru-reviewer/

with Aaron Friedman (Principal PM-T for xGuru services)

Amazon CodeGuru is a developer tool that uses machine learning and automated reasoning to catch hard to find defects and security vulnerabilities in application code. The purpose of this blog is to show how new CodeGuru Reviewer features help improve the security posture of your Python applications and highlight some of the specific categories of code vulnerabilities that CodeGuru Reviewer can detect. We will also cover newly expanded security capabilities for Java applications.

Amazon CodeGuru Reviewer can detect code vulnerabilities and provide actionable recommendations across dozens of the most common and impactful categories of code security issues (as classified by industry-recognized standards, Open Web Application Security, OWASP , “top ten” and Common Weakness Enumeration, CWE. The following are some of the most severe code vulnerabilities that CodeGuru Reviewer can now help you detect and prevent:

  • Injection weaknesses typically appears in data-rich applications. Every year, hundreds of web servers are compromised using SQL Injection. An attacker can use this method to bypass access control and read or modify application data.
  • Path Traversal security issues occur when an application does not properly handle special elements within a provided path name. An attacker can use it to overwrite or delete critical files and expose sensitive data.
  • Null Pointer Dereference issues can occur due to simple programming errors, race conditions, and others. Null Pointer Dereference can cause availability issues in rare conditions. Attackers can use it to read and modify memory.
  • Weak or broken cryptography is a risk that may compromise the confidentiality and integrity of sensitive data.

Security vulnerabilities present in source code can result in application downtime, leaked data, lost revenue, and lost customer trust. Best practices and peer code reviews aren’t sufficient to prevent these issues. You need a systematic way of detecting and preventing vulnerabilities from being deployed to production. CodeGuru Reviewer Security Detectors can provide a scalable approach to DevSecOps, a mechanism that employs automation to address security issues early in the software development lifecycle. Security detectors automate the detection of hard-to-find security vulnerabilities in Java and now Python applications, and provide actionable recommendations to developers.

By baking security mechanisms into each step of the process, DevSecOps enables the development of secure software without sacrificing speed. However, false positive issues raised by Static Application Security Testing (SAST) tools often must be manually triaged effectively and work against this value. CodeGuru uses techniques from automated reasoning, and specifically, precise data-flow analysis, to enhance the precision of code analysis. CodeGuru therefore reports fewer false positives.

Many customers are already embracing open-source code analysis tools in their DevSecOps practices. However, integrating such software into a pipeline requires a heavy up front lift, ongoing maintenance, and patience to configure. Furthering the utility of new security detectors in Amazon CodeGuru Reviewer, this update adds integrations with Bandit  and Infer, two widely-adopted open-source code analysis tools. In Java code bases, CodeGuru Reviewer now provide recommendations from Infer that detect null pointer dereferences, thread safety violations and improper use of synchronization locks. And in Python code, the service detects instances of SQL injection, path traversal attacks, weak cryptography, or the use of compromised libraries. Security issues found and recommendations generated by these tools are shown in the console, in pull requests comments, or through CI/CD integrations, alongside code recommendations generated by CodeGuru’s code quality and security detectors. Let’s dive deep and review some examples of code vulnerabilities that CodeGuru Reviewer can help detect.

Injection (Python)

Amazon CodeGuru Reviewer can help detect the most common injection vulnerabilities including SQL, XML, OS command, and LDAP types. For example, SQL injection occurs when SQL queries are constructed through string formatting. An attacker could manipulate the program inputs to modify the intent of the SQL query. The following python statement executes a SQL query constructed through string formatting and can be an attack vector:

import sqlite3
from flask import request

def removing_product():
    productId = request.args.get('productId')
    str = 'DELETE FROM products WHERE productID = ' + productId
    return str

def sql_injection():
    connection = psycopg2.connect("dbname=test user=postgres")
    cur = db.cursor()
    query = removing_product()
    cur.execute(query)

CodeGuru will flag a potential SQL injection using Bandit security detector, will make the following recommendation:

>> We detected a SQL command that might use unsanitized input. 
This can result in an SQL injection. To increase the security of your code, 
sanitize inputs before using them to form a query string.

To avoid this, the user should correct the code to use a parameter sanitization mechanism that guards against SQL injection as done below:

import sqlite3
from flask import request

def removing_product():
    productId = sanitize_input(request.args.get('productId'))
    str = 'DELETE FROM products WHERE productID = ' + productId
    return str

def sql_injection():
    connection = psycopg2.connect("dbname=test user=postgres")
    cur = db.cursor()
    query = removing_product()
    cur.execute(query)

In the above corrected code, user supplied sanitize_input method will take care of sanitizing user inputs.

Path Traversal (Python)

When applications use user input to create a path to read or write local files, an attacker can manipulate the input to overwrite or delete critical files or expose sensitive data. These critical files might include source code, sensitive, or application configuration information.

@app.route('/someurl')
def path_traversal():
    file_name = request.args["file"]
    f = open("./{}".format(file_name))
    f.close()

In above example, file name is directly passed to an open API without checking or filtering its content.

CodeGuru’s recommendation:

>> Potentially untrusted inputs are used to access a file path.
To protect your code from a path traversal attack, verify that your inputs are
sanitized.

In response, the developer should sanitize data before using it for creating/opening file.

@app.route('/someurl')
def path_traversal():
file_name = sanitize_data(request.args["file"])
f = open("./{}".format(file_name))
f.close()

In this modified code, input data file_name has been clean/filtered by sanitized_data api call.

Null Pointer Dereference (Java)

Infer detectors are a new addition that complement CodeGuru Reviewer native Java Security Detectors. Infer detectors, based on the Facebook Infer static analyzer, include rules to detect null pointer dereferences, thread safety violations, and improper use of synchronization locks. In particular, the null-pointer-dereference rule detects paths in the code that lead to null pointer exceptions in Java. Null pointer dereference is a very common pitfall in Java and is considered one of 25 most dangerous software weaknesses.

The Infer null-pointer-dereference rule guards against unexpected null pointer exceptions by detecting locations in the code where pointers that could be null are dereferenced. CodeGuru augments the Infer analyzer with knowledge about the AWS APIs, which allows the security detectors to catch potential null pointer exceptions when using AWS APIs.

For example, the AWS DynamoDBMapper class provides a convenient abstraction for mapping Amazon DynamoDB tables to Java objects. However, developers should be aware that DynamoDB Mapper load operations can return a null pointer if the object was not found in the table. The following code snippet updates a record in a catalog using a DynamoDB Mapper:

DynamoDBMapper mapper = new DynamoDBMapper(client);
// Retrieve the item.
CatalogItem itemRetrieved = mapper.load(
CatalogItem.class, 601);
// Update the item.
itemRetrieved.setISBN("622-2222222222");
itemRetrieved.setBookAuthors(
new HashSet<String>(Arrays.asList(
"Author1", "Author3")));
mapper.save(itemRetrieved);

CodeGuru will protect against a potential null dereference by making the following recommendation:

object `itemRetrieved` last assigned on line 88 could be null
and is dereferenced at line 90.

In response, the developer should add a null check to prevent the null pointer dereference from occurring.

DynamoDBMapper mapper = new DynamoDBMapper(client);
// Retrieve the item.
CatalogItem itemRetrieved = mapper.load(CatalogItem.class, 601);
// Update the item.
if (itemRetrieved != null) {
itemRetrieved.setISBN("622-2222222222");
itemRetrieved.setBookAuthors(
new HashSet<String>(Arrays.asList(
"Author1","Author3")));
mapper.save(itemRetrieved);
} else {
throw new CatalogItemNotFoundException();
}

Weak or broken cryptography  (Python)

Python security detectors support popular frameworks along with built-in APIs such as cryptography, pycryptodome etc. to identify ciphers related vulnerability. As suggested in CWE-327 , the use of a non-standard/inadequate key length algorithm is dangerous because attacker may be able to break the algorithm and compromise whatever data has been protected. In this example, `PBKDF2` is used with a weak algorithm and may lead to cryptographic vulnerabilities.
from Crypto.Protocol.KDF import PBKDF2
from Crypto.Hash import SHA1

def risky_crypto_algorithm(password):
salt = get_random_bytes(16)
keys = PBKDF2(password, salt, 64, count=1000000,
hmac_hash_module=SHA1)

SHA1 is used to create a PBKDF2, however, it is insecure hence not recommended for PBKDF2. CodeGuru’s identifies the issue and makes the following recommendation:

>> The `PBKDF2` function is using a weak algorithm which might
lead to cryptographic vulnerabilities. We recommend that you use the
`SHA224`, `SHA256`, `SHA384`,`SHA512/224`, `SHA512/256`, `BLAKE2s`,
`BLAKE2b`, `SHAKE128`, `SHAKE256` algorithms.

In response, the developer should use the correct SHA algorithm to protect against potential cipher attacks.

from Crypto.Protocol.KDF import PBKDF2
from Crypto.Hash import SHA512

def risky_crypto_algorithm(password):
salt = get_random_bytes(16)
keys = PBKDF2(password, salt, 64, count=1000000,
hmac_hash_module=SHA512)

This modified example uses high strength SHA512 algorithm.

Conclusion

This post reviewed Amazon CodeGuru Reviewer security detectors and how they automatically check your code for vulnerabilities and provide actionable recommendations in code reviews. We covered new capabilities for detecting issues in Python applications, as well as additional security features from Bandit and Infer. Together CodeGuru Reviewer’s security features provide a scalable approach for customers embracing DevSecOps, a mechanism that requires automation to address security issues earlier in the software development lifecycle. CodeGuru automates detection and helps prevent hard-to-find security vulnerabilities, accelerating DevSecOps processes for application development workflow.

You can get started from the CodeGuru console by running a full repository scan or integrating CodeGuru Reviewer with your supported CI/CD pipeline. Code analysis from Infer and Bandit is included as part of the standard CodeGuru Reviewer service.

For more information about automating code reviews and application profiling with Amazon CodeGuru, check out the AWS DevOps Blog. For more details on how to get started, visit the documentation.