Tag Archives: bsa

Security updates for Wednesday

Post Syndicated from ris original https://lwn.net/Articles/748741/rss

Security updates have been issued by Arch Linux (python-django and python2-django), Debian (leptonlib), Fedora (bugzilla, cryptopp, electrum, firefox, freexl, glibc, jhead, libcdio, libsamplerate, libXcursor, libXfont, libXfont2, mingw-wavpack, nx-libs, php, python-crypto, quagga, sharutils, unzip, x2goserver, and xen), Gentoo (exim), openSUSE (cups, go1.8, ImageMagick, jgraphx, leptonica, openexr, tor, and wavpack), Red Hat (389-ds-base, java-1.7.1-ibm, kernel, kernel-rt, libreoffice, and php), SUSE (java-1_7_1-ibm), and Ubuntu (python-django).

Security updates for Wednesday

Post Syndicated from ris original https://lwn.net/Articles/748276/rss

Security updates have been issued by Arch Linux (mbedtls), CentOS (gcab and java-1.7.0-openjdk), Debian (drupal7, lucene-solr, wavpack, and xmltooling), Fedora (dnsmasq, gcab, gimp, golang, knot-resolver, ldns, libsamplerate, mingw-OpenEXR, mingw-poppler, python-crypto, qt5-qtwebengine, sblim-sfcb, systemd, unbound, and wavpack), Mageia (ioquake3, TiMidity++, tomcat, tomcat-native, and wireshark), openSUSE (systemd and zziplib), Red Hat (erlang and openstack-nova and python-novaclient), and SUSE (kernel).

The "Extended Random" Feature in the BSAFE Crypto Library

Post Syndicated from Bruce Schneier original https://www.schneier.com/blog/archives/2017/12/the_extended_ra.html

Matthew Green wrote a fascinating blog post about the NSA’s efforts to increase the amount of random data exposed in the TLS protocol, and how it interacts with the NSA’s backdoor into the DUAL_EC_PRNG random number generator to weaken TLS.

Security updates for Tuesday

Post Syndicated from ris original https://lwn.net/Articles/738378/rss

Security updates have been issued by Debian (apr, apr-util, chromium-browser, libpam4j, and mupdf), Fedora (community-mysql and modulemd), Mageia (git), openSUSE (libsass, libwpd, qemu, sssd, and SuSEfirewall2), Red Hat (Red Hat JBoss Enterprise Application Platform and Red Hat JBoss Enterprise Application Platform 7.0), SUSE (qemu), and Ubuntu (openssl).

Create Multiple Builds from the Same Source Using Different AWS CodeBuild Build Specification Files

Post Syndicated from Prakash Palanisamy original https://aws.amazon.com/blogs/devops/create-multiple-builds-from-the-same-source-using-different-aws-codebuild-build-specification-files/

In June 2017, AWS CodeBuild announced you can now specify an alternate build specification file name or location in an AWS CodeBuild project.

In this post, I’ll show you how to use different build specification files in the same repository to create different builds. You’ll find the source code for this post in our GitHub repo.

Requirements

The AWS CLI must be installed and configured.

Solution Overview

I have created a C program (cbsamplelib.c) that will be used to create a shared library and another utility program (cbsampleutil.c) to use that library. I’ll use a Makefile to compile these files.

I need to put this sample application in RPM and DEB packages so end users can easily deploy them. I have created a build specification file for RPM. It will use make to compile this code and the RPM specification file (cbsample.rpmspec) configured in the build specification to create the RPM package. Similarly, I have created a build specification file for DEB. It will create the DEB package based on the control specification file (cbsample.control) configured in this build specification.

RPM Build Project:

The following build specification file (buildspec-rpm.yml) uses build specification version 0.2. As described in the documentation, this version has different syntax for environment variables. This build specification includes multiple phases:

  • As part of the install phase, the required packages is installed using yum.
  • During the pre_build phase, the required directories are created and the required files, including the RPM build specification file, are copied to the appropriate location.
  • During the build phase, the code is compiled, and then the RPM package is created based on the RPM specification.

As defined in the artifact section, the RPM file will be uploaded as a build artifact.

version: 0.2

env:
  variables:
    build_version: "0.1"

phases:
  install:
    commands:
      - yum install rpm-build make gcc glibc -y
  pre_build:
    commands:
      - curr_working_dir=`pwd`
      - mkdir -p ./{RPMS,SRPMS,BUILD,SOURCES,SPECS,tmp}
      - filename="cbsample-$build_version"
      - echo $filename
      - mkdir -p $filename
      - cp ./*.c ./*.h Makefile $filename
      - tar -zcvf /root/$filename.tar.gz $filename
      - cp /root/$filename.tar.gz ./SOURCES/
      - cp cbsample.rpmspec ./SPECS/
  build:
    commands:
      - echo "Triggering RPM build"
      - rpmbuild --define "_topdir `pwd`" -ba SPECS/cbsample.rpmspec
      - cd $curr_working_dir

artifacts:
  files:
    - RPMS/x86_64/cbsample*.rpm
  discard-paths: yes

Using cb-centos-project.json as a reference, create the input JSON file for the CLI command. This project uses an AWS CodeCommit repository named codebuild-multispec and a file named buildspec-rpm.yml as the build specification file. To create the RPM package, we need to specify a custom image name. I’m using the latest CentOS 7 image available in the Docker Hub. I’m using a role named CodeBuildServiceRole. It contains permissions similar to those defined in CodeBuildServiceRole.json. (You need to change the resource fields in the policy, as appropriate.)

{
    "name": "rpm-build-project",
    "description": "Project which will build RPM from the source.",
    "source": {
        "type": "CODECOMMIT",
        "location": "https://git-codecommit.eu-west-1.amazonaws.com/v1/repos/codebuild-multispec",
        "buildspec": "buildspec-rpm.yml"
    },
    "artifacts": {
        "type": "S3",
        "location": "codebuild-demo-artifact-repository"
    },
    "environment": {
        "type": "LINUX_CONTAINER",
        "image": "centos:7",
        "computeType": "BUILD_GENERAL1_SMALL"
    },
    "serviceRole": "arn:aws:iam::012345678912:role/service-role/CodeBuildServiceRole",
    "timeoutInMinutes": 15,
    "encryptionKey": "arn:aws:kms:eu-west-1:012345678912:alias/aws/s3",
    "tags": [
        {
            "key": "Name",
            "value": "RPM Demo Build"
        }
    ]
}

After the cli-input-json file is ready, execute the following command to create the build project.

$ aws codebuild create-project --name CodeBuild-RPM-Demo --cli-input-json file://cb-centos-project.json

{
    "project": {
        "name": "CodeBuild-RPM-Demo", 
        "serviceRole": "arn:aws:iam::012345678912:role/service-role/CodeBuildServiceRole", 
        "tags": [
            {
                "value": "RPM Demo Build", 
                "key": "Name"
            }
        ], 
        "artifacts": {
            "namespaceType": "NONE", 
            "packaging": "NONE", 
            "type": "S3", 
            "location": "codebuild-demo-artifact-repository", 
            "name": "CodeBuild-RPM-Demo"
        }, 
        "lastModified": 1500559811.13, 
        "timeoutInMinutes": 15, 
        "created": 1500559811.13, 
        "environment": {
            "computeType": "BUILD_GENERAL1_SMALL", 
            "privilegedMode": false, 
            "image": "centos:7", 
            "type": "LINUX_CONTAINER", 
            "environmentVariables": []
        }, 
        "source": {
            "buildspec": "buildspec-rpm.yml", 
            "type": "CODECOMMIT", 
            "location": "https://git-codecommit.eu-west-1.amazonaws.com/v1/repos/codebuild-multispec"
        }, 
        "encryptionKey": "arn:aws:kms:eu-west-1:012345678912:alias/aws/s3", 
        "arn": "arn:aws:codebuild:eu-west-1:012345678912:project/CodeBuild-RPM-Demo", 
        "description": "Project which will build RPM from the source."
    }
}

When the project is created, run the following command to start the build. After the build has started, get the build ID. You can use the build ID to get the status of the build.

$ aws codebuild start-build --project-name CodeBuild-RPM-Demo
{
    "build": {
        "buildComplete": false, 
        "initiator": "prakash", 
        "artifacts": {
            "location": "arn:aws:s3:::codebuild-demo-artifact-repository/CodeBuild-RPM-Demo"
        }, 
        "projectName": "CodeBuild-RPM-Demo", 
        "timeoutInMinutes": 15, 
        "buildStatus": "IN_PROGRESS", 
        "environment": {
            "computeType": "BUILD_GENERAL1_SMALL", 
            "privilegedMode": false, 
            "image": "centos:7", 
            "type": "LINUX_CONTAINER", 
            "environmentVariables": []
        }, 
        "source": {
            "buildspec": "buildspec-rpm.yml", 
            "type": "CODECOMMIT", 
            "location": "https://git-codecommit.eu-west-1.amazonaws.com/v1/repos/codebuild-multispec"
        }, 
        "currentPhase": "SUBMITTED", 
        "startTime": 1500560156.761, 
        "id": "CodeBuild-RPM-Demo:57a36755-4d37-4b08-9c11-1468e1682abc", 
        "arn": "arn:aws:codebuild:eu-west-1: 012345678912:build/CodeBuild-RPM-Demo:57a36755-4d37-4b08-9c11-1468e1682abc"
    }
}

$ aws codebuild list-builds-for-project --project-name CodeBuild-RPM-Demo
{
    "ids": [
        "CodeBuild-RPM-Demo:57a36755-4d37-4b08-9c11-1468e1682abc"
    ]
}

$ aws codebuild batch-get-builds --ids CodeBuild-RPM-Demo:57a36755-4d37-4b08-9c11-1468e1682abc
{
    "buildsNotFound": [], 
    "builds": [
        {
            "buildComplete": true, 
            "phases": [
                {
                    "phaseStatus": "SUCCEEDED", 
                    "endTime": 1500560157.164, 
                    "phaseType": "SUBMITTED", 
                    "durationInSeconds": 0, 
                    "startTime": 1500560156.761
                }, 
                {
                    "contexts": [], 
                    "phaseType": "PROVISIONING", 
                    "phaseStatus": "SUCCEEDED", 
                    "durationInSeconds": 24, 
                    "startTime": 1500560157.164, 
                    "endTime": 1500560182.066
                }, 
                {
                    "contexts": [], 
                    "phaseType": "DOWNLOAD_SOURCE", 
                    "phaseStatus": "SUCCEEDED", 
                    "durationInSeconds": 15, 
                    "startTime": 1500560182.066, 
                    "endTime": 1500560197.906
                }, 
                {
                    "contexts": [], 
                    "phaseType": "INSTALL", 
                    "phaseStatus": "SUCCEEDED", 
                    "durationInSeconds": 19, 
                    "startTime": 1500560197.906, 
                    "endTime": 1500560217.515
                }, 
                {
                    "contexts": [], 
                    "phaseType": "PRE_BUILD", 
                    "phaseStatus": "SUCCEEDED", 
                    "durationInSeconds": 0, 
                    "startTime": 1500560217.515, 
                    "endTime": 1500560217.662
                }, 
                {
                    "contexts": [], 
                    "phaseType": "BUILD", 
                    "phaseStatus": "SUCCEEDED", 
                    "durationInSeconds": 0, 
                    "startTime": 1500560217.662, 
                    "endTime": 1500560217.995
                }, 
                {
                    "contexts": [], 
                    "phaseType": "POST_BUILD", 
                    "phaseStatus": "SUCCEEDED", 
                    "durationInSeconds": 0, 
                    "startTime": 1500560217.995, 
                    "endTime": 1500560218.074
                }, 
                {
                    "contexts": [], 
                    "phaseType": "UPLOAD_ARTIFACTS", 
                    "phaseStatus": "SUCCEEDED", 
                    "durationInSeconds": 0, 
                    "startTime": 1500560218.074, 
                    "endTime": 1500560218.542
                }, 
                {
                    "contexts": [], 
                    "phaseType": "FINALIZING", 
                    "phaseStatus": "SUCCEEDED", 
                    "durationInSeconds": 4, 
                    "startTime": 1500560218.542, 
                    "endTime": 1500560223.128
                }, 
                {
                    "phaseType": "COMPLETED", 
                    "startTime": 1500560223.128
                }
            ], 
            "logs": {
                "groupName": "/aws/codebuild/CodeBuild-RPM-Demo", 
                "deepLink": "https://console.aws.amazon.com/cloudwatch/home?region=eu-west-1#logEvent:group=/aws/codebuild/CodeBuild-RPM-Demo;stream=57a36755-4d37-4b08-9c11-1468e1682abc", 
                "streamName": "57a36755-4d37-4b08-9c11-1468e1682abc"
            }, 
            "artifacts": {
                "location": "arn:aws:s3:::codebuild-demo-artifact-repository/CodeBuild-RPM-Demo"
            }, 
            "projectName": "CodeBuild-RPM-Demo", 
            "timeoutInMinutes": 15, 
            "initiator": "prakash", 
            "buildStatus": "SUCCEEDED", 
            "environment": {
                "computeType": "BUILD_GENERAL1_SMALL", 
                "privilegedMode": false, 
                "image": "centos:7", 
                "type": "LINUX_CONTAINER", 
                "environmentVariables": []
            }, 
            "source": {
                "buildspec": "buildspec-rpm.yml", 
                "type": "CODECOMMIT", 
                "location": "https://git-codecommit.eu-west-1.amazonaws.com/v1/repos/codebuild-multispec"
            }, 
            "currentPhase": "COMPLETED", 
            "startTime": 1500560156.761, 
            "endTime": 1500560223.128, 
            "id": "CodeBuild-RPM-Demo:57a36755-4d37-4b08-9c11-1468e1682abc", 
            "arn": "arn:aws:codebuild:eu-west-1:012345678912:build/CodeBuild-RPM-Demo:57a36755-4d37-4b08-9c11-1468e1682abc"
        }
    ]
}

DEB Build Project:

In this project, we will use the build specification file named buildspec-deb.yml. Like the RPM build project, this specification includes multiple phases. Here I use a Debian control file to create the package in DEB format. After a successful build, the DEB package will be uploaded as build artifact.

version: 0.2

env:
  variables:
    build_version: "0.1"

phases:
  install:
    commands:
      - apt-get install gcc make -y
  pre_build:
    commands:
      - mkdir -p ./cbsample-$build_version/DEBIAN
      - mkdir -p ./cbsample-$build_version/usr/lib
      - mkdir -p ./cbsample-$build_version/usr/include
      - mkdir -p ./cbsample-$build_version/usr/bin
      - cp -f cbsample.control ./cbsample-$build_version/DEBIAN/control
  build:
    commands:
      - echo "Building the application"
      - make
      - cp libcbsamplelib.so ./cbsample-$build_version/usr/lib
      - cp cbsamplelib.h ./cbsample-$build_version/usr/include
      - cp cbsampleutil ./cbsample-$build_version/usr/bin
      - chmod +x ./cbsample-$build_version/usr/bin/cbsampleutil
      - dpkg-deb --build ./cbsample-$build_version

artifacts:
  files:
    - cbsample-*.deb

Here we use cb-ubuntu-project.json as a reference to create the CLI input JSON file. This project uses the same AWS CodeCommit repository (codebuild-multispec) but a different buildspec file in the same repository (buildspec-deb.yml). We use the default CodeBuild image to create the DEB package. We use the same IAM role (CodeBuildServiceRole).

{
    "name": "deb-build-project",
    "description": "Project which will build DEB from the source.",
    "source": {
        "type": "CODECOMMIT",
        "location": "https://git-codecommit.eu-west-1.amazonaws.com/v1/repos/codebuild-multispec",
        "buildspec": "buildspec-deb.yml"
    },
    "artifacts": {
        "type": "S3",
        "location": "codebuild-demo-artifact-repository"
    },
    "environment": {
        "type": "LINUX_CONTAINER",
        "image": "aws/codebuild/ubuntu-base:14.04",
        "computeType": "BUILD_GENERAL1_SMALL"
    },
    "serviceRole": "arn:aws:iam::012345678912:role/service-role/CodeBuildServiceRole",
    "timeoutInMinutes": 15,
    "encryptionKey": "arn:aws:kms:eu-west-1:012345678912:alias/aws/s3",
    "tags": [
        {
            "key": "Name",
            "value": "Debian Demo Build"
        }
    ]
}

Using the CLI input JSON file, create the project, start the build, and check the status of the project.

$ aws codebuild create-project --name CodeBuild-DEB-Demo --cli-input-json file://cb-ubuntu-project.json

$ aws codebuild list-builds-for-project --project-name CodeBuild-DEB-Demo

$ aws codebuild batch-get-builds --ids CodeBuild-DEB-Demo:e535c4b0-7067-4fbe-8060-9bb9de203789

After successful completion of the RPM and DEB builds, check the S3 bucket configured in the artifacts section for the build packages. Build projects will create a directory in the name of the build project and copy the artifacts inside it.

$ aws s3 ls s3://codebuild-demo-artifact-repository/CodeBuild-RPM-Demo/
2017-07-20 16:16:59       8108 cbsample-0.1-1.el7.centos.x86_64.rpm

$ aws s3 ls s3://codebuild-demo-artifact-repository/CodeBuild-DEB-Demo/
2017-07-20 16:37:22       5420 cbsample-0.1.deb

Override Buildspec During Build Start:

It’s also possible to override the build specification file of an existing project when starting a build. If we want to create the libs RPM package instead of the whole RPM, we will use the build specification file named buildspec-libs-rpm.yml. This build specification file is similar to the earlier RPM build. The only difference is that it uses a different RPM specification file to create libs RPM.

version: 0.2

env:
  variables:
    build_version: "0.1"

phases:
  install:
    commands:
      - yum install rpm-build make gcc glibc -y
  pre_build:
    commands:
      - curr_working_dir=`pwd`
      - mkdir -p ./{RPMS,SRPMS,BUILD,SOURCES,SPECS,tmp}
      - filename="cbsample-libs-$build_version"
      - echo $filename
      - mkdir -p $filename
      - cp ./*.c ./*.h Makefile $filename
      - tar -zcvf /root/$filename.tar.gz $filename
      - cp /root/$filename.tar.gz ./SOURCES/
      - cp cbsample-libs.rpmspec ./SPECS/
  build:
    commands:
      - echo "Triggering RPM build"
      - rpmbuild --define "_topdir `pwd`" -ba SPECS/cbsample-libs.rpmspec
      - cd $curr_working_dir

artifacts:
  files:
    - RPMS/x86_64/cbsample-libs*.rpm
  discard-paths: yes

Using the same RPM build project that we created earlier, start a new build and set the value of the `–buildspec-override` parameter to buildspec-libs-rpm.yml .

$ aws codebuild start-build --project-name CodeBuild-RPM-Demo --buildspec-override buildspec-libs-rpm.yml
{
    "build": {
        "buildComplete": false, 
        "initiator": "prakash", 
        "artifacts": {
            "location": "arn:aws:s3:::codebuild-demo-artifact-repository/CodeBuild-RPM-Demo"
        }, 
        "projectName": "CodeBuild-RPM-Demo", 
        "timeoutInMinutes": 15, 
        "buildStatus": "IN_PROGRESS", 
        "environment": {
            "computeType": "BUILD_GENERAL1_SMALL", 
            "privilegedMode": false, 
            "image": "centos:7", 
            "type": "LINUX_CONTAINER", 
            "environmentVariables": []
        }, 
        "source": {
            "buildspec": "buildspec-libs-rpm.yml", 
            "type": "CODECOMMIT", 
            "location": "https://git-codecommit.eu-west-1.amazonaws.com/v1/repos/codebuild-multispec"
        }, 
        "currentPhase": "SUBMITTED", 
        "startTime": 1500562366.239, 
        "id": "CodeBuild-RPM-Demo:82d05f8a-b161-401c-82f0-83cb41eba567", 
        "arn": "arn:aws:codebuild:eu-west-1:012345678912:build/CodeBuild-RPM-Demo:82d05f8a-b161-401c-82f0-83cb41eba567"
    }
}

After the build is completed successfully, check to see if the package appears in the artifact S3 bucket under the CodeBuild-RPM-Demo build project folder.

$ aws s3 ls s3://codebuild-demo-artifact-repository/CodeBuild-RPM-Demo/
2017-07-20 16:16:59       8108 cbsample-0.1-1.el7.centos.x86_64.rpm
2017-07-20 16:53:54       5320 cbsample-libs-0.1-1.el7.centos.x86_64.rpm

Conclusion

In this post, I have shown you how multiple buildspec files in the same source repository can be used to run multiple AWS CodeBuild build projects. I have also shown you how to provide a different buildspec file when starting the build.

For more information about AWS CodeBuild, see the AWS CodeBuild documentation. You can get started with AWS CodeBuild by using this step by step guide.


About the author

Prakash Palanisamy is a Solutions Architect for Amazon Web Services. When he is not working on Serverless, DevOps or Alexa, he will be solving problems in Project Euler. He also enjoys watching educational documentaries.

Security updates for Monday

Post Syndicated from ris original https://lwn.net/Articles/722170/rss

Security updates have been issued by Debian (freetype, ghostscript, and roundcube), Fedora (bind99, freetype, ghostscript, icu, thunderbird, and wireshark), Gentoo (chromium, libevent, nss, and oracle-jre-bin), Mageia (audiofile, ettercap, ghostscript, libarchive, and libsamplerate), openSUSE (Chromium and thunderbird), Red Hat (bind and thunderbird), and Scientific Linux (bind and thunderbird).

Announcing C# Support for AWS Lambda

Post Syndicated from Bryan Liston original https://aws.amazon.com/blogs/compute/announcing-c-sharp-support-for-aws-lambda/

Today, we’re excited to announce C# as a supported language for AWS Lambda! Using the new, open source .NET Core 1.0 runtime, you can easily publish C# code to AWS Lambda from a variety of popular .NET tools. .NET developers can now build Lambda functions and serverless applications with the C# language and .NET tools that they know and love. With tooling support in Visual Studio, Yeoman, and the dotnet CLI, you can easily deploy individual Lambda functions or entire serverless applications written in C# to Lambda and Amazon API Gateway.

Lambda is the core of the AWS serverless platform. Originally launched in 2015, Lambda enables customers to deploy Node.js, Python, and Java code to AWS without needing to worry about infrastructure or scaling. This allows developers to focus on the business logic for their application and not spend time maintaining and scaling infrastructure. Until today, .NET developers were not able to take advantage of this model. We’re excited to add C# to the list of supported languages and enable a new category of developers to take advantage of Lambda and API Gateway to create serverless applications.

C# in Lambda

Look at a simple C# Lambda function. If you’ve already used Lambda with Node.js, Python, or Java, this should look familiar:

using System;
using System.IO;
using System.Text;
using Amazon.Lambda.Core;
using Amazon.Lambda.DynamoDBEvents;

using Amazon.Lambda.Serialization.Json;
 

namespace DynamoDBStreams
{
    public class DdbSample
    {
        private static readonly JsonSerializer _jsonSerializer = new JsonSerializer();

        public void ProcessDynamoEvent(DynamoDBEvent dynamoEvent)
        {
            Console.WriteLine($"Beginning to process {dynamoEvent.Records.Count} records...");

            foreach (var record in dynamoEvent.Records)
            {
                Console.WriteLine($"Event ID: {record.EventID}");
                Console.WriteLine($"Event Name: {record.EventName}");
 
                string streamRecordJson = SerializeObject(record.Dynamodb);
                Console.WriteLine($"DynamoDB Record:");
                Console.WriteLine(streamRecordJson);
            }

            Console.WriteLine("Stream processing complete.");
        }


        private string SerializeObject(object streamRecord)
        {
            using (var ms = new MemoryStream())
            {
                _jsonSerializer.Serialize(streamRecord, ms);
                return Encoding.UTF8.GetString(ms.ToArray());
            }
        }
    }
}

As you can see, this is straightforward code, but there are a few important details to call out. Unlike other languages supported on Lambda, you don’t need to implement a specific interface to mark your code as a Lambda function. Instead, just provide a handler string when uploading your code to tell Lambda where to start execution.

Similar to other languages supported on Lambda, you have a few choices for handling input and return types in your function. The most basic choice is to use a low-level stream interface of System.IO.Stream. Alternatively you can also apply the default serializer at the assembly or method level of your application, or you can define your own serialization logic by implementing the ILambdaSerializer interface, which is also provided by the Amazon.Lambda.Core library.

Look at the function signature for the ProcessDynamoEvent function and notice DynamoDBEvent in the signature. This comes from the Amazon.Lambda.Core library provided by Lambda, along with many more classes for other AWS event types. You can add a project dependency on this NuGet package to get access to a static Lambda logger, serialization interfaces, and a C# implementation of the Lambda context object.

For logging, you can use the static Write or WriteLine methods provided by the C# Console class, the Log method on the Amazon.Lambda.Core.LambdaLogger class, or the Logger property in the context object. You can get more information about the C# programming model in the AWS Lambda Developer Guide.

AWS Toolkit for Visual Studio

The AWS Toolkit for Visual Studio supports developing, testing, and deploying .NET Core Lambda functions and serverless applications. The toolkit has two new project templates to help you get started:

  • The AWS Lambda Project template creates a simple project with a single C# Lambda function.
  • The AWS Serverless Application template creates a small AWS serverless application, following the AWS Serverless Application Model (AWS SAM). This template shows how to develop a complete serverless application composed of multiple Lambda functions exposed through an API Gateway REST endpoint. Also, AWS SAM allows you to model the AWS resources that your application uses as part of your project’s template.

dotnet_1.png

After your code is ready, you can deploy directly from Visual Studio by right-clicking your project and choosing Publish to AWS Lambda… in the Solution Explorer. From there, the deployment wizard guides you through the deployment process.

dotnet_2.png

Cross-platform development using the .NET Core CLI

One of the great features of .NET Core is cross-platform support. With the traditional .NET framework, developers are required to build and run their applications on Windows. However, .NET Core enables you to develop your C# code on any platform of your choice and deploy it to any platform as well.

If you’re not developing on Windows and don’t have access to the AWS Toolkit for Visual Studio, you can still use .NET tools to easily publish your C# Lambda functions and serverless applications to AWS. Even if you are using the AWS Toolkit for Visual Studio, knowing how to use the dotnet CLI can be helpful in automating your build and deployment process.

After you create a .NET Core project, enable the Lambda tools in the dotnet CLI using tools like Yeoman; just add a tools dependency on the Amazon.Lambda.Tools NuGet package to your new project.

dotnet_3.png

The Amazon.Lambda.Tools NuGet package adds commands to the new dotnet CLI that allow you to deploy your Lambda functions and serverless applications to AWS, no matter what platform you’re on. Even if you are developing in Visual Studio on Windows, the AWS Lambda tools in the dotnet CLI are helpful for setting up a CI/CD pipeline for your application.

To learn more about the new Lambda commands in the dotnet CLI, type dotnet lambda help in your project directory.

dotnet_4.png

Summary

We’re excited to open up AWS Lambda for C# applications through the .NET Core runtime. You can find more information on writing C# Lambda functions in the AWS Lambda Developer Guide. Download the AWS Toolkit for Visual Studio to get started or check out the Lambda extensions to the dotnet CLI.

Real-time Clickstream Anomaly Detection with Amazon Kinesis Analytics

Post Syndicated from Chris Marshall original https://blogs.aws.amazon.com/bigdata/post/Tx1XNQPQ2ARGT81/Real-time-Clickstream-Anomaly-Detection-with-Amazon-Kinesis-Analytics

Chris Marshall is a Solutions Architect for Amazon Web Services

Analyzing web log traffic to gain insights that drive business decisions has historically been performed using batch processing.  While effective, this approach results in delayed responses to emerging trends and user activities.  There are solutions to deal with processing data in real time using streaming and micro-batching technologies, but they can be complex to set up and maintain.  Amazon Kinesis Analytics is a managed service that makes it very easy to identify and respond to changes in behavior in real-time.   

One use case where it’s valuable to have immediate insights is analyzing clickstream data.   In the world of digital advertising, an impression is when an ad is displayed in a browser and a clickthrough represents a user clicking on that ad.  A clickthrough rate (CTR) is one way to monitor the ad’s effectiveness.  CTR is calculated in the form of: CTR = Clicks / Impressions * 100.  Digital marketers are interested in monitoring CTR to know when particular ads perform better than normal, giving them a chance to optimize placements within the ad campaign.  They may also be interested in anomalous low-end CTR that could be a result of a bad image or bidding model.

In this post, I show an analytics pipeline which detects anomalies in real time for a web traffic stream, using the RANDOM_CUT_FOREST function available in Amazon Kinesis Analytics. 

RANDOM_CUT_FOREST 

Amazon Kinesis Analytics includes a powerful set of analytics functions to analyze streams of data.  One such function is RANDOM_CUT_FOREST.  This function detects anomalies by scoring data flowing through a dynamic data stream. This novel approach identifies a normal pattern in streaming data, then compares new data points in reference to it. For more information, see Robust Random Cut Forest Based Anomaly Detection On Streams.

Analytics pipeline components

To demonstrate how the RANDOM_CUT_FOREST function can be used to detect anomalies in real-time click through rates, I will walk you through how to build an analytics pipeline and generate web traffic using a simple Python script.  When your injected anomaly is detected, you get an email or SMS message to alert you to the event. 

This post first walks through the components of the analytics pipeline, then discusses how to build and test it in your own AWS account.  The accompanying AWS CloudFormation script builds the Amazon API Gateway API, Amazon Kinesis streams, AWS Lambda function, Amazon SNS components, and IAM roles required for this walkthrough. Then you manually create the Amazon Kinesis Analytics application that uses the RANDOM_CUT_FOREST function in SQL. 

Web Client

Because Python is a popular scripting language available on many clients, we’ll use it to generate web impressions and click data by making HTTP GET requests with the requests library.  This script mimics beacon traffic generated by web browsers.  The GET request contains a query string value of click or impression to indicate the type of beacon being captured.  The script has a while loop that iterates 2500 times.  For each pass through the loop, an Impression beacon is sent.  Then, the random function is used to potentially send an additional click beacon.  For most of the passes through the loop, the click request is generated at the rate of roughly 10% of the time.  However, for some web requests, the CTR jumps to a rate of 50% representing an anomaly.  These are artificially high CTR rates used for the purpose of illustration for this post, but the functionality would be the same using small fractional values

import requests
import random
import sys
import argparse

def getClicked(rate):
	if random.random() <= rate:
		return True
	else:
		return False

def httpGetImpression():
	url = args.target + '?browseraction=Impression' 
	r = requests.get(url)

def httpGetClick():
	url = args.target + '?browseraction=Click' 
	r = requests.get(url)
	sys.stdout.write('+')

parser = argparse.ArgumentParser()
parser.add_argument("target",help=" the http(s) location to send the GET request")
args = parser.parse_args()
i = 0
while (i < 2500):
	httpGetImpression()
	if(i<1950 or i>=2000):
		clicked = getClicked(.1)
		sys.stdout.write('_')
	else:
		clicked = getClicked(.5)
		sys.stdout.write('-')
	if(clicked):
		httpGetClick()
	i = i + 1
	sys.stdout.flush()

Web “front door”

Client web requests are handled by Amazon API Gateway, a fully managed service that makes it easy for developers to create, publish, maintain, monitor, and secure APIs at any scale. This handles the web request traffic to get data into your analytics pipeline by being a service proxy to an Amazon Kinesis stream.  API Gateway takes click and impression web traffic and converts the HTTP header and query string data into a JSON message and place it into your stream.  The CloudFormation script creates an API endpoint with a body mapping template similar to the following:

#set($inputRoot = $input.path('$'))
{
  "Data": "$util.base64Encode("{ ""browseraction"" : ""$input.params('browseraction')"", ""site"" : ""$input.params('Host')"" }")",
  "PartitionKey" : "$input.params('Host')",
  "StreamName" : "CSEKinesisBeaconInputStream"
}

(API Gateway service proxy body mapping template)

This tells API Gateway to take the host header and query string inputs and convert them into a payload to be put into a stream using a service proxy for Amazon Kinesis Streams.

Input stream

Amazon Kinesis Streams can capture terabytes of data per hour from thousands of sources.  In this case, you use it to deliver your JSON messages to Amazon Kinesis Analytics. 

Analytics application

Amazon Kinesis Analytics processes the incoming messages and allow you to perform analytics on the dynamic stream of data.  You use SQL to calculate the CTR and pass that into your RANDOM_CUT_FOREST function to calculate an anomaly score. 

First, calculate the counts of impressions and clicks using SQL with a tumbling window in Amazon Kinesis Analytics.  Analytics uses streams and pumps in SQL to process data.  See our earlier post to get an overview of streaming data with Kinesis Analytics. Inside the Analytics SQL, a stream is analogous to a table and a pump is the flow of data into those tables.  Here’s the description of streams for the impression and click data.

CREATE OR REPLACE STREAM "CLICKSTREAM" ( 
   "CLICKCOUNT" DOUBLE
);


CREATE OR REPLACE STREAM "IMPRESSIONSTREAM" ( 
   "IMPRESSIONCOUNT" DOUBLE
);

Later, you calculate the clicks divided by impressions; you are using a DOUBLE data type to contain the count values.  If you left them as integers, the division below would yield only a 0 or 1.  Next, you define the pumps that populate the stream using a tumbling window, a window of time during which all records received are considered as part of the statement. 

CREATE OR REPLACE PUMP "CLICKPUMP" STARTED AS 
INSERT INTO "CLICKSTREAM" ("CLICKCOUNT") 
SELECT STREAM COUNT(*) 
FROM "SOURCE_SQL_STREAM_001"
WHERE "browseraction" = 'Click'
GROUP BY FLOOR(
  ("SOURCE_SQL_STREAM_001".ROWTIME - TIMESTAMP '1970-01-01 00:00:00')
    SECOND / 10 TO SECOND
);
CREATE OR REPLACE PUMP "IMPRESSIONPUMP" STARTED AS 
INSERT INTO "IMPRESSIONSTREAM" ("IMPRESSIONCOUNT") 
SELECT STREAM COUNT(*) 
FROM "SOURCE_SQL_STREAM_001"
WHERE "browseraction" = 'Impression'
GROUP BY FLOOR(
  ("SOURCE_SQL_STREAM_001".ROWTIME - TIMESTAMP '1970-01-01 00:00:00')
    SECOND / 10 TO SECOND
);

The GROUP BY statement uses FLOOR and ROWTIME to create a tumbling window of 10 seconds.  This tumbling window will output one record every ten seconds assuming we are receiving data from the corresponding pump.  In this case, you output the total number of records that were clicks or impressions. 

Next, use the output of these pumps to calculate the CTR:

CREATE OR REPLACE STREAM "CTRSTREAM" (
  "CTR" DOUBLE
);

CREATE OR REPLACE PUMP "CTRPUMP" STARTED AS 
INSERT INTO "CTRSTREAM" ("CTR")
SELECT STREAM "CLICKCOUNT" / "IMPRESSIONCOUNT" * 100 as "CTR"
FROM "IMPRESSIONSTREAM",
  "CLICKSTREAM"
WHERE "IMPRESSIONSTREAM".ROWTIME = "CLICKSTREAM".ROWTIME;

Finally, these CTR values are used in RANDOM_CUT_FOREST to detect anomalous values.  The DESTINATION_SQL_STREAM is the output stream of data from the Amazon Kinesis Analytics application. 

CREATE OR REPLACE STREAM "DESTINATION_SQL_STREAM" (
    "CTRPERCENT" DOUBLE,
    "ANOMALY_SCORE" DOUBLE
);

CREATE OR REPLACE PUMP "OUTPUT_PUMP" STARTED AS 
INSERT INTO "DESTINATION_SQL_STREAM" 
SELECT STREAM * FROM
TABLE (RANDOM_CUT_FOREST( 
             CURSOR(SELECT STREAM "CTR" FROM "CTRSTREAM"), --inputStream
             100, --numberOfTrees (default)
             12, --subSampleSize 
             100000, --timeDecay (default)
             1) --shingleSize (default)
)
WHERE ANOMALY_SCORE > 2; 

The RANDOM_CUT_FOREST function greatly simplifies the programming required for anomaly detection.  However, understanding your data domain is paramount when performing data analytics.  The RANDOM_CUT_FOREST function is a tool for data scientists, not a replacement for them.  Knowing whether your data is logarithmic, circadian rhythmic, linear, etc. will provide the insights necessary to select the right parameters for RANDOM_CUT_FOREST.  For more information about parameters, see the RANDOM_CUT_FOREST Function

Fortunately, the default values work in a wide variety of cases. In this case, use the default values for all but the subSampleSize parameter.  Typically, you would use a larger sample size to increase the pool of random samples used to calculate the anomaly score; for this post, use 12 samples so as to start evaluating the anomaly scores sooner.  

Your SQL query outputs one record every ten seconds from the tumbling window so you’ll have enough evaluation values after two minutes to start calculating the anomaly score.  You are also using a cutoff value where records are only output to “DESTINATION_SQL_STREAM” if the anomaly score from the function is greater than 2 using the WHERE clause. To help visualize the cutoff point, here are the data points from a few runs through the pipeline using the sample Python script:

As you can see, the majority of the CTR data points are grouped around a 10% rate.  As the values move away from the 10% rate, their anomaly scores go up because they are more infrequent.  In the graph, an anomaly score above the cutoff is shaded orange. A cutoff value of 2 works well for this data set.  

Output stream

The Amazon Kinesis Analytics application outputs the values from “DESTINATION_SQL_STREAM” to an Amazon Kinesis stream when the record has an anomaly score greater than 2.

Anomaly execution function

The output stream is an input event for an AWS Lambda function with a batch size of 1, so the function gets called one time for each message put in the output stream.  This function represents a place where you could react to anomalous events.  In a real world scenario, you may want to instead update a real-time bidding system to react to current events.  In this example, you send a message to SNS to see a tangible response to the changing CTR.  The CloudFormation template creates a Lambda function similar to the following:

var AWS = require('aws-sdk');
var sns = new AWS.SNS( { region: "<region>" });
	exports.handler = function(event, context) {
	//our batch size is 1 record so loop expected to process only once
	event.Records.forEach(function(record) {
		// Kinesis data is base64 encoded so decode here
		var payload = new Buffer(record.kinesis.data, 'base64').toString('ascii');
		var rec = payload.split(',');
		var ctr = rec[0];
		var anomaly_score = rec[1];
		var detail = 'Anomaly detected with a click through rate of ' + ctr + '% and an anomaly score of ' + anomaly_score;
		var subject = 'Anomaly Detected';
		var params = {
			Message: detail,
			MessageStructure: 'String',
			Subject: subject,
			TopicArn: 'arn:aws:sns:us-east-1::ClickStreamEvent'
		};
		sns.publish(params, function(err, data) {
			if (err) context.fail(err.stack);
			else     context.succeed('Published Notification');
		});
	});
}; 

Notification system

Amazon SNS is a fully managed, highly scalable messaging platform.  For this walkthrough, you send messages to SNS with subscribers that send email and text messages to a mobile phone.

Analytics pipeline walkthrough

Sometimes it’s best to build out a solution so you can see all the parts working and get a good sense of how it works.   Here are the steps to build out the entire pipeline as described above in your own account and perform real-time clickstream analysis yourself.  Following this walkthrough creates resources in your account that you can use to test the example pipeline.

Prerequisites

To complete the walkthrough, you’ll need an AWS account and a place to execute Python scripts.

Configure the CloudFormation template

Download the CloudFormation template named CSE-cfn.json from the AWS Big Data blog Github repository. In the CloudFormation console, set your region to N. Virginia, Oregon, or Ireland and then browse to that file.

When your script executes and an anomaly is detected, SNS sends a notification.  To see these in an email or text message, enter your email address and mobile phone number in the Parameters section and choose Next

This CloudFormation script creates policies and roles necessary to process the incoming messages.  Acknowledge this by selecting the field or you will get a warning about IAM_CAPABILITY.

If you entered a valid email or SMS number, you will receive a validation message when the SNS subscriptions are created by CloudFormation.  If you don’t wish to receive email or texts from your pipeline, you don’t need to complete the verification process.  

Run the Python script

Copy ClickImpressionGenerator.py to a local folder.

This script uses the requests library to make HTTP requests.  If you don’t have that Python package globally installed, you can install it locally in by running the following command in the same folder as the Python script:

pip install requests –t .

When the CloudFormation stack is complete, choose Outputs, ExecutionCommand, and run the command from the folder with the Python script. 

This will start generating web traffic and send it to the API Gateway in the front of your analytics pipeline.  It is important to have data flowing into your Amazon Kinesis Analytics application for the next section so that a schema can be generated based on the incoming data.  If the script completes before you finish with the next section, simply restart the script with the same command.

Create the Analytics pipeline application

Open the Amazon Kinesis console and choose Go to Analytics.

Choose Create new application to create the Kinesis Analytics application, enter an application name and description, and then choose Save and continue

Choose Connect to a source to see a list of streams.  Select the stream that starts with the name of your CloudFormation stack and contains CSEKinesisBeaconInputStream in the form of: <stack name>- CSEKinesisBeaconInputStream-<random string>.  This is the stream that your click and impression requests will be sent to from API Gateway. 

Scroll down and choose Choose an IAM role.  Select the role with the name in the form of <stack name>-CSEKinesisAnalyticsRole-<random string>.  If your ClickImpressionGenerator.py script is still running and generating data, you should see a stream sample.  Scroll to the bottom and choose Save and continue

Next, add the SQL for your real-time analytics.  Choose the Go to SQL editor, choose Yes, start application, and paste the SQL contents from CSE-SQL.sql into the editor. Choose Save and run SQL.  After a minute or so, you should see data showing up every ten seconds in the CLICKSTREAM.  Choose Exit when you are done editing. 

Choose Connect to a destination and select the <stack name>-CSEKinesisBeaconOutputStream-<random string> from the list of streams.  Change the output format to CSV. 

Choose Choose an IAM role and select <stack name>-CSEKinesisAnalyticsRole-<random string> again.  Choose Save and continue.  Now your analytics pipeline is complete.  When your script gets to the injected anomaly section, you should get a text message or email to notify you of the anomaly.  To clean up the resources created here, delete the stack in CloudFormation, stop the Kinesis Analytics application then delete it. 

Summary

A pipeline like this can be used for many use cases where anomaly detection is valuable. What solutions have you enabled with this architecture? If you have questions or suggestions, please comment below.

—————————–

Related

Process Amazon Kinesis Aggregated Data with AWS Lambda

Real-time Clickstream Anomaly Detection with Amazon Kinesis Analytics

Post Syndicated from Chris Marshall original https://aws.amazon.com/blogs/big-data/real-time-clickstream-anomaly-detection-with-amazon-kinesis-analytics/

Chris Marshall is a Solutions Architect for Amazon Web Services

Analyzing web log traffic to gain insights that drive business decisions has historically been performed using batch processing.  While effective, this approach results in delayed responses to emerging trends and user activities.  There are solutions to deal with processing data in real time using streaming and micro-batching technologies, but they can be complex to set up and maintain.  Amazon Kinesis Analytics is a managed service that makes it very easy to identify and respond to changes in behavior in real-time.

One use case where it’s valuable to have immediate insights is analyzing clickstream data.   In the world of digital advertising, an impression is when an ad is displayed in a browser and a clickthrough represents a user clicking on that ad.  A clickthrough rate (CTR) is one way to monitor the ad’s effectiveness.  CTR is calculated in the form of: CTR = Clicks / Impressions * 100.  Digital marketers are interested in monitoring CTR to know when particular ads perform better than normal, giving them a chance to optimize placements within the ad campaign.  They may also be interested in anomalous low-end CTR that could be a result of a bad image or bidding model.

In this post, I show an analytics pipeline which detects anomalies in real time for a web traffic stream, using the RANDOM_CUT_FOREST function available in Amazon Kinesis Analytics.

RANDOM_CUT_FOREST 

Amazon Kinesis Analytics includes a powerful set of analytics functions to analyze streams of data.  One such function is RANDOM_CUT_FOREST.  This function detects anomalies by scoring data flowing through a dynamic data stream. This novel approach identifies a normal pattern in streaming data, then compares new data points in reference to it. For more information, see Robust Random Cut Forest Based Anomaly Detection On Streams.

Analytics pipeline components

To demonstrate how the RANDOM_CUT_FOREST function can be used to detect anomalies in real-time click through rates, I will walk you through how to build an analytics pipeline and generate web traffic using a simple Python script.  When your injected anomaly is detected, you get an email or SMS message to alert you to the event.

This post first walks through the components of the analytics pipeline, then discusses how to build and test it in your own AWS account.  The accompanying AWS CloudFormation script builds the Amazon API Gateway API, Amazon Kinesis streams, AWS Lambda function, Amazon SNS components, and IAM roles required for this walkthrough. Then you manually create the Amazon Kinesis Analytics application that uses the RANDOM_CUT_FOREST function in SQL.

Web Client

Because Python is a popular scripting language available on many clients, we’ll use it to generate web impressions and click data by making HTTP GET requests with the requests library.  This script mimics beacon traffic generated by web browsers.  The GET request contains a query string value of click or impression to indicate the type of beacon being captured.  The script has a while loop that iterates 2500 times.  For each pass through the loop, an Impression beacon is sent.  Then, the random function is used to potentially send an additional click beacon.  For most of the passes through the loop, the click request is generated at the rate of roughly 10% of the time.  However, for some web requests, the CTR jumps to a rate of 50% representing an anomaly.  These are artificially high CTR rates used for the purpose of illustration for this post, but the functionality would be the same using small fractional values

import requests
import random
import sys
import argparse

def getClicked(rate):
	if random.random() <= rate:
		return True
	else:
		return False

def httpGetImpression():
	url = args.target + '?browseraction=Impression' 
	r = requests.get(url)

def httpGetClick():
	url = args.target + '?browseraction=Click' 
	r = requests.get(url)
	sys.stdout.write('+')

parser = argparse.ArgumentParser()
parser.add_argument("target",help=" the http(s) location to send the GET request")
args = parser.parse_args()
i = 0
while (i < 2500):
	httpGetImpression()
	if(i<1950 or i>=2000):
		clicked = getClicked(.1)
		sys.stdout.write('_')
	else:
		clicked = getClicked(.5)
		sys.stdout.write('-')
	if(clicked):
		httpGetClick()
	i = i + 1
	sys.stdout.flush()

Web “front door”

Client web requests are handled by Amazon API Gateway, a fully managed service that makes it easy for developers to create, publish, maintain, monitor, and secure APIs at any scale. This handles the web request traffic to get data into your analytics pipeline by being a service proxy to an Amazon Kinesis stream.  API Gateway takes click and impression web traffic and converts the HTTP header and query string data into a JSON message and place it into your stream.  The CloudFormation script creates an API endpoint with a body mapping template similar to the following:

#set($inputRoot = $input.path('$'))
{
  "Data": "$util.base64Encode("{ ""browseraction"" : ""$input.params('browseraction')"", ""site"" : ""$input.params('Host')"" }")",
  "PartitionKey" : "$input.params('Host')",
  "StreamName" : "CSEKinesisBeaconInputStream"
}

(API Gateway service proxy body mapping template)

This tells API Gateway to take the host header and query string inputs and convert them into a payload to be put into a stream using a service proxy for Amazon Kinesis Streams.

Input stream

Amazon Kinesis Streams can capture terabytes of data per hour from thousands of sources.  In this case, you use it to deliver your JSON messages to Amazon Kinesis Analytics.

Analytics application

Amazon Kinesis Analytics processes the incoming messages and allow you to perform analytics on the dynamic stream of data.  You use SQL to calculate the CTR and pass that into your RANDOM_CUT_FOREST function to calculate an anomaly score.

First, calculate the counts of impressions and clicks using SQL with a tumbling window in Amazon Kinesis Analytics.  Analytics uses streams and pumps in SQL to process data.  See our earlier post to get an overview of streaming data with Kinesis Analytics. Inside the Analytics SQL, a stream is analogous to a table and a pump is the flow of data into those tables.  Here’s the description of streams for the impression and click data.

CREATE OR REPLACE STREAM "CLICKSTREAM" ( 
   "CLICKCOUNT" DOUBLE
);


CREATE OR REPLACE STREAM "IMPRESSIONSTREAM" ( 
   "IMPRESSIONCOUNT" DOUBLE
);

Later, you calculate the clicks divided by impressions; you are using a DOUBLE data type to contain the count values.  If you left them as integers, the division below would yield only a 0 or 1.  Next, you define the pumps that populate the stream using a tumbling window, a window of time during which all records received are considered as part of the statement.

CREATE OR REPLACE PUMP "CLICKPUMP" STARTED AS 
INSERT INTO "CLICKSTREAM" ("CLICKCOUNT") 
SELECT STREAM COUNT(*) 
FROM "SOURCE_SQL_STREAM_001"
WHERE "browseraction" = 'Click'
GROUP BY FLOOR(
  ("SOURCE_SQL_STREAM_001".ROWTIME - TIMESTAMP '1970-01-01 00:00:00')
    SECOND / 10 TO SECOND
);
CREATE OR REPLACE PUMP "IMPRESSIONPUMP" STARTED AS 
INSERT INTO "IMPRESSIONSTREAM" ("IMPRESSIONCOUNT") 
SELECT STREAM COUNT(*) 
FROM "SOURCE_SQL_STREAM_001"
WHERE "browseraction" = 'Impression'
GROUP BY FLOOR(
  ("SOURCE_SQL_STREAM_001".ROWTIME - TIMESTAMP '1970-01-01 00:00:00')
    SECOND / 10 TO SECOND
);

The GROUP BY statement uses FLOOR and ROWTIME to create a tumbling window of 10 seconds.  This tumbling window will output one record every ten seconds assuming we are receiving data from the corresponding pump.  In this case, you output the total number of records that were clicks or impressions.

Next, use the output of these pumps to calculate the CTR:

CREATE OR REPLACE STREAM "CTRSTREAM" (
  "CTR" DOUBLE
);

CREATE OR REPLACE PUMP "CTRPUMP" STARTED AS 
INSERT INTO "CTRSTREAM" ("CTR")
SELECT STREAM "CLICKCOUNT" / "IMPRESSIONCOUNT" * 100 as "CTR"
FROM "IMPRESSIONSTREAM",
  "CLICKSTREAM"
WHERE "IMPRESSIONSTREAM".ROWTIME = "CLICKSTREAM".ROWTIME;

Finally, these CTR values are used in RANDOM_CUT_FOREST to detect anomalous values.  The DESTINATION_SQL_STREAM is the output stream of data from the Amazon Kinesis Analytics application.

CREATE OR REPLACE STREAM "DESTINATION_SQL_STREAM" (
    "CTRPERCENT" DOUBLE,
    "ANOMALY_SCORE" DOUBLE
);

CREATE OR REPLACE PUMP "OUTPUT_PUMP" STARTED AS 
INSERT INTO "DESTINATION_SQL_STREAM" 
SELECT STREAM * FROM
TABLE (RANDOM_CUT_FOREST( 
             CURSOR(SELECT STREAM "CTR" FROM "CTRSTREAM"), --inputStream
             100, --numberOfTrees (default)
             12, --subSampleSize 
             100000, --timeDecay (default)
             1) --shingleSize (default)
)
WHERE ANOMALY_SCORE > 2; 

The RANDOM_CUT_FOREST function greatly simplifies the programming required for anomaly detection.  However, understanding your data domain is paramount when performing data analytics.  The RANDOM_CUT_FOREST function is a tool for data scientists, not a replacement for them.  Knowing whether your data is logarithmic, circadian rhythmic, linear, etc. will provide the insights necessary to select the right parameters for RANDOM_CUT_FOREST.  For more information about parameters, see the RANDOM_CUT_FOREST Function.

Fortunately, the default values work in a wide variety of cases. In this case, use the default values for all but the subSampleSize parameter.  Typically, you would use a larger sample size to increase the pool of random samples used to calculate the anomaly score; for this post, use 12 samples so as to start evaluating the anomaly scores sooner.

Your SQL query outputs one record every ten seconds from the tumbling window so you’ll have enough evaluation values after two minutes to start calculating the anomaly score.  You are also using a cutoff value where records are only output to “DESTINATION_SQL_STREAM” if the anomaly score from the function is greater than 2 using the WHERE clause. To help visualize the cutoff point, here are the data points from a few runs through the pipeline using the sample Python script:

As you can see, the majority of the CTR data points are grouped around a 10% rate.  As the values move away from the 10% rate, their anomaly scores go up because they are more infrequent.  In the graph, an anomaly score above the cutoff is shaded orange. A cutoff value of 2 works well for this data set.

Output stream

The Amazon Kinesis Analytics application outputs the values from “DESTINATION_SQL_STREAM” to an Amazon Kinesis stream when the record has an anomaly score greater than 2.

Anomaly execution function

The output stream is an input event for an AWS Lambda function with a batch size of 1, so the function gets called one time for each message put in the output stream.  This function represents a place where you could react to anomalous events.  In a real world scenario, you may want to instead update a real-time bidding system to react to current events.  In this example, you send a message to SNS to see a tangible response to the changing CTR.  The CloudFormation template creates a Lambda function similar to the following:

var AWS = require('aws-sdk');
var sns = new AWS.SNS( { region: "<region>" });
	exports.handler = function(event, context) {
	//our batch size is 1 record so loop expected to process only once
	event.Records.forEach(function(record) {
		// Kinesis data is base64 encoded so decode here
		var payload = new Buffer(record.kinesis.data, 'base64').toString('ascii');
		var rec = payload.split(',');
		var ctr = rec[0];
		var anomaly_score = rec[1];
		var detail = 'Anomaly detected with a click through rate of ' + ctr + '% and an anomaly score of ' + anomaly_score;
		var subject = 'Anomaly Detected';
		var params = {
			Message: detail,
			MessageStructure: 'String',
			Subject: subject,
			TopicArn: 'arn:aws:sns:us-east-1::ClickStreamEvent'
		};
		sns.publish(params, function(err, data) {
			if (err) context.fail(err.stack);
			else     context.succeed('Published Notification');
		});
	});
}; 

Notification system

Amazon SNS is a fully managed, highly scalable messaging platform.  For this walkthrough, you send messages to SNS with subscribers that send email and text messages to a mobile phone.

Analytics pipeline walkthrough

Sometimes it’s best to build out a solution so you can see all the parts working and get a good sense of how it works.   Here are the steps to build out the entire pipeline as described above in your own account and perform real-time clickstream analysis yourself.  Following this walkthrough creates resources in your account that you can use to test the example pipeline.

Prerequisites

To complete the walkthrough, you’ll need an AWS account and a place to execute Python scripts.

Configure the CloudFormation template

Download the CloudFormation template named CSE-cfn.json from the AWS Big Data blog Github repository. In the CloudFormation console, set your region to N. Virginia, Oregon, or Ireland and then browse to that file.

When your script executes and an anomaly is detected, SNS sends a notification.  To see these in an email or text message, enter your email address and mobile phone number in the Parameters section and choose Next.

This CloudFormation script creates policies and roles necessary to process the incoming messages.  Acknowledge this by selecting the field or you will get a warning about IAM_CAPABILITY.

If you entered a valid email or SMS number, you will receive a validation message when the SNS subscriptions are created by CloudFormation.  If you don’t wish to receive email or texts from your pipeline, you don’t need to complete the verification process.

Run the Python script

Copy ClickImpressionGenerator.py to a local folder.

This script uses the requests library to make HTTP requests.  If you don’t have that Python package globally installed, you can install it locally in by running the following command in the same folder as the Python script:

pip install requests –t .

When the CloudFormation stack is complete, choose Outputs, ExecutionCommand, and run the command from the folder with the Python script.

This will start generating web traffic and send it to the API Gateway in the front of your analytics pipeline.  It is important to have data flowing into your Amazon Kinesis Analytics application for the next section so that a schema can be generated based on the incoming data.  If the script completes before you finish with the next section, simply restart the script with the same command.

Create the Analytics pipeline application

Open the Amazon Kinesis console and choose Go to Analytics.

Choose Create new application to create the Kinesis Analytics application, enter an application name and description, and then choose Save and continue.

Choose Connect to a source to see a list of streams.  Select the stream that starts with the name of your CloudFormation stack and contains CSEKinesisBeaconInputStream in the form of: <stack name>- CSEKinesisBeaconInputStream-<random string>.  This is the stream that your click and impression requests will be sent to from API Gateway.

Scroll down and choose Choose an IAM role.  Select the role with the name in the form of <stack name>-CSEKinesisAnalyticsRole-<random string>.  If your ClickImpressionGenerator.py script is still running and generating data, you should see a stream sample.  Scroll to the bottom and choose Save and continue.

Next, add the SQL for your real-time analytics.  Choose the Go to SQL editor, choose Yes, start application, and paste the SQL contents from CSE-SQL.sql into the editor. Choose Save and run SQL.  After a minute or so, you should see data showing up every ten seconds in the CLICKSTREAM.  Choose Exit when you are done editing.

Choose Connect to a destination and select the <stack name>-CSEKinesisBeaconOutputStream-<random string> from the list of streams.  Change the output format to CSV.

Choose Choose an IAM role and select <stack name>-CSEKinesisAnalyticsRole-<random string> again.  Choose Save and continue.  Now your analytics pipeline is complete.  When your script gets to the injected anomaly section, you should get a text message or email to notify you of the anomaly.  To clean up the resources created here, delete the stack in CloudFormation, stop the Kinesis Analytics application then delete it.

Summary

A pipeline like this can be used for many use cases where anomaly detection is valuable. What solutions have you enabled with this architecture? If you have questions or suggestions, please comment below.

 


Related

Process Amazon Kinesis Aggregated Data with AWS Lambda

Storing Pokémon without SQL

Post Syndicated from Eevee original https://eev.ee/blog/2016/08/05/storing-pok%C3%A9mon-without-sql/

I run veekun, a little niche Pokédex website that mostly focuses on (a) very accurate data for every version, derived directly from the games and (b) a bunch of nerdy nerd tools.

It’s been languishing for a few years. (Sorry.) Part of it is that the team has never been very big, and all of us have either drifted away or gotten tied up in other things.

And part of it is that the schema absolutely sucks to work with. I’ve been planning to fix it for a year or two now, and with Sun/Moon on the horizon, it’s time I actually got around to doing that.

Alas! I’m still unsure on some of the details. I’m hoping if I talk them out, a clear best answer will present itself. It’s like advanced rubber duck debugging, with the added bonus that maybe a bunch of strangers will validate my thinking.

(Spoilers: I think I figured some stuff out by the end, so you don’t actually need to read any of this.)

The data

Pokémon has a lot of stuff going on under the hood.

  • The Pokémon themselves have one or two types; a set of abilities; moves they might learn at a given level or from a certain “tutor” NPC or via a specific item; evolution via one of at least twelve different mechanisms and which may branch; items they may be holding in the wild; six stats, plus effort for those six stats; flavor text; and a variety of other little data.

  • A number of Pokémon also have multiple forms, which can mean any number of differences that still “count” as the same Pokémon. Some forms are purely cosmetic (Unown); some affect the Pokémon’s type (Arceus); some affect stats (Pumpkaboo); some affect learned moves (Meowstic); some swap out a signature move (Rotom); some disable evolution (Pichu). Some forms can be switched at will; some switch automatically; some cannot be switched between at all. There aren’t really any hard and fast rules here. They’re effectively different Pokémon with the same name, except most of the properties are the same.

  • Moves are fairly straightforward, except that their effects vary wildly and it would be mighty convenient to be able to categorize them in a way that’s useful to a computer. After 17 years of trying, I’ve still not managed this.

  • Places connect to each other in various directions. They also may have some number of wild Pokémon, which appear at given levels with given probability. Oh, but certain conditions can change some — but not all! — of the possible encounters in an area, making for a UI nightmare. It gets particularly bad in Heart Gold and Soul Silver, where encounters (and their rates) are affected by time of day (morning, midday, night) and the music you’re playing (Sinnoh, Hoenn, none) and whether there’s an active swarm. Try to make sense of Rattata on Route 3.

  • Event Pokémon — those received from giveaways — may be given in several different ways, to several different regions, and may “lock” any of the Pokémon’s attributes either to a specific value or a choice of values.

  • And of course, all of this exists in at least eight different languages, plus a few languages with their own fanon vernacular, plus romanization for katakana and Hangul.

Even that would be all well and good, but the biggest problem of all is that any and all of this can change between games. Pairs of games — say, Red and Blue — tend to be mostly identical except for the encounters, since they come out at the same time. Spiky-Eared Pichu exists only in HGSS, and never appears again. The move Hypnosis has 60% accuracy in every game, except in Diamond and Pearl, where it has 70% accuracy. Sand Attack is ground-type, except in the first generation of games, where it was normal. Several Pokémon change how they evolve in later games, because they relied on a mechanic that was dropped. The type strength/weakness chart has been updated a couple times. And so on.

Oh, and there are several spin-off series, which often reuse the names of moves but completely change how they work. The entire Mystery Dungeon series, for example. Or even Pokémon Go.

This is awful.

The current approach

Since time immemorial, veekun has used a relational database. (Except for that one time I tried a single massive XML file, but let’s not talk about that.) It’s already straining the limits of this format, and it doesn’t even include half the stuff I just mentioned, like event Pokémon or where the move tutors are or Spiky-Eared Pichu’s disabled evolution.

Just the basic information about the Pokémon themselves is spread across three tables: pokemon_species, pokemon, and pokemon_forms. “Species” is supposed to be the pure essence of the name, so it contains stuff like “is this a baby” or “what does this evolve from/into” (which, in the case of Pichu, is already wrong!). pokemon_forms contains every form imaginable, including all 28 Unown, and tries to loosely categorize them — but it also treats Pokémon without forms as having a single “default” form. And then pokemon contains a somewhat arbitrary subset of forms and tacks other data onto them. Other tables arbitrarily join to whichever of these is most appropriate.

Tables may also be segmented by “version” (Red), “version group” (Red and Blue), or “generation” (Red, Blue, and Yellow), depending on when the data tends to vary. Oh, but there are also a number of conquest_* tables for Pokémon Conquest, which doesn’t have a row in versions since it’s not a mainline version. And I think there’s a goofy hack for Stadium in there somewhere.

For data that virtually never varies, except that one time it did, we… don’t really do anything. Base EXP was completely overhauled in X and Y, for example, and we only have a single base_experience column in the pokemon table, so it just contains the new X and Y values. What if you want to know about experience for an older game? Well, oops. Similarly, the type chart is the one from X and Y, which is no longer correct for previous games.

Aligning entities across games can be a little tricky, too. Earlier games had the Itemfinder, gen 5 had the Dowsing MCHN, and now we have the Dowsing Machine. These are all clearly the same item, but only the name Dowsing Machine appears anywhere in veekun, because there’s no support for changing names across games. The last few games also technically “renamed” every move and Pokémon from all-caps to title case, but this isn’t reflected anywhere. In fact, the all-caps names have never appeared on veekun.

All canonical textual data, including the names of fundamental entities like Pokémon and moves, are in separate tables so they can be separated by language as well. Numerous combinations of languages/games are missing, and I don’t think we actually have a list of which games were even released in which languages.

The result is a massive spread of tables, many of them very narrow but very tall, with joins that are not obvious if you’re not a DBA. I forget how half of it works if I haven’t looked at it in at least a month. I make this stuff available for anyone to use, too, so I would greatly prefer if it were (a) understandable by mortals and (b) not comically incomplete in poorly-documented ways.

I think a lot of this is a fear of massively duplicating the pile of data we’ve already got. Fixing the Dowsing Machine thing, for example, would require duplicating the name of every single item for every single game, just to fix this one item that was renamed twice. Fixing the base EXP problem would require yet another new table just for base experience, solely because it changed once.

It’s long past time to fix this.

SQL is bad, actually

(Let me cut you off right now: NoSQL is worse.)

I like the idea of a relational database. You have a schema describing your data, and you can link it together in myriad different ways, and it’s all built around set operations, and wow that’s pretty cool.

The actual implementation leaves a little to be desired. You can really only describe anything as flat tuples. You want to have things that can contain several other things, perhaps in order? Great! Make another flat tuple describing that, and make sure you remember to ask for the order explicitly, every single time you query.

Oh boy, querying. Querying is so, so tedious. You can’t even use all your carefully-constructed foreign key constraints as a shortcut; you have to write out foo.bar_id = bar.id in full every single time.

There are GUIs and whatnot, but the focus is all wrong. It’s on tables. Of course it’s on tables, but a single table is frequently not a useful thing to see on its own. For any given kind of entity (as defined however you think about your application), a table probably only contains a slice of what the entity is about, but it contains that slice for every single instance. Meanwhile, you can’t actually see a single entity on its own.

I’ll repeat that: you cannot.

Consider, for example, a Pokémon. A Pokémon has up to two types, which are rather fundamental properties. How do you view or fetch the Pokémon and its types?

Fuck you, that’s how. If you join pokemon to pokemon_types, you get this goofy result where everything about the Pokémon is potentially duplicated, but each row contains a distinct type.

Want to see abilities as well? There can be up to three of those! Join to both pokemon_abilities and pokemon_types, and now you get up to six rows, which looks increasingly not at all like what you actually wanted. Want moves as well? Good luck.

I don’t understand how this is still the case. SQL is 42 years old! How has it not evolved to have even the slightest nod towards the existence of nested data? This isn’t some niche use case; it’s responsible for at least a third of veekun’s tables!

This die-hard focus on data-as-spreadsheets is probably why we’ve tried so hard to avoid “duplication”, even when it’s the correct thing to do. The fundamental unit of a relational database is the table, and seeing a table full of the same information copied over and over just feels wrong.

But it’s really the focus on tables that’s wrong. The important point isn’t that Bulbasaur is named “BULBASAUR” in ten different games; it’s that each of those games has a name for Bulbasaur, and it happens to be the same much of the time.

NoSQL exists, yes, but I don’t trust anyone who looked at SQL and decided that the real problem was that it has too much schema.

I know the structure of my data, and I’m happy to have it be enforced. The problem isn’t that writing a schema is hard. The problem is that any schema that doesn’t look like a bank ledger maps fairly poorly to SQL primitives. It works, and it’s correct (if you can figure out how to express what you want), but the ergonomics are atrocious.

We’ve papered over some of this with SQLAlchemy’s excellent ORM, but you have to be very good at SQLAlchemy to make the mapping natural, which is the whole goal of using an ORM. I’m pretty good, and it’s still fairly clumsy.

A new idea

So. How about YAML?

See, despite our hesitation to duplicate everything, the dataset really isn’t that big. All of the data combined are a paltry 17MB, which could fit in RAM without much trouble; then we could search and wrangle it with regular Python operations. I could still have a schema, remember, because I wrote a thing for that. And other people could probably make more sense of some YAML files than CSV dumps (!) of a tangled relational database.

The idea is to re-dump every game into its own set of YAML files, describing just the raw data in a form generic enough that it can handle every (main series) game. I did a proof of concept of this for Pokémon earlier this year, and it looks like:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
%TAG !dex! tag:veekun.com,2005:pokedex/
--- !!omap
- bulbasaur: !dex!pokemon
    name: BULBASAUR
    types:
    - grass
    - poison
    base-stats:
      attack: 49
      defense: 49
      hp: 45
      special: 65
      speed: 45
    growth-rate: medium-slow
    base-experience: 64
    pokedex-numbers:
      kanto: 1
    evolutions:
    - into: ivysaur
      minimum-level: 16
      trigger: level-up
    species: SEED
    flavor-text: "A strange seed was\nplanted on its\nback at birth.\fThe plant sprouts\nand
      grows with\nthis POKéMON."
    height: 28
    weight: 150
    moves:
      level-up:
      - 1: tackle
      # ...
    game-index: 153

This is all just regular ol’ YAML syntax. This is for English Red; there’d also be one for French Red, Spanish Red, etc. Ultimately, there’d be a lot of files, with a separate set for every game in every language.

The UI will have to figure out when some datum was the same in every game, but it frequently does that even now, so that’s not a significant new burden. If anything, it’s an improvement, since now it’ll be happening only in one place; right now there are a lot of ad-hoc “deduplication” steps done behind the scenes when we add new data.

I like this idea, but I still feel very uneasy about it for unclear reasons. It is a wee bit out there. I could just take this same approach of “fuck it, store everything” and still use a relational database. But look at this little chunk of data; it already tells you plenty of interesting facts about Bulbasaur and only Bulbasaur, yet it would need at least half a dozen tables to express in a relational database. And you couldn’t inspect just Bulbasaur, and you’d have to do multiple queries to actually get everything, and there’d be no useful way to work with the data independently of the app, and so on. Worst of all, the structure is often not remotely obvious from looking at the tables, whereas you can literally see it in YAML syntax.

There are other advantages, as well:

  • A schema can still be enforced Python-side, using the camel loader, which by the way will produce objects rather than plain dicts. (That’s what the !dex!pokemon tag is for.)
  • If you don’t care about veekun at all and just want data, you have it in a straightforward format, for any version you like.
  • YAML libraries are fairly common, and even someone with very limited programming experience can make sense of the above structure. Currently we store CSV database dumps and offer a tool to load into an RDBMS, which has led to a number of bug reports about obscure compatibility issues with various databases, as well as numerous emails from people who are confused about how to load the data or even about what a database is.
  • It’s much more obvious what’s missing. If there’s no directory for Pokémon Yellow, surprise! That means we don’t have Pokémon Yellow. If the directory exists but there’s no places.yaml, guess what we’re missing! Figuring out what’s there and what’s not in a relational system is much more difficult; I only recently realized that we don’t have flavor text for any game before Black/White.
  • I’ll never again have to rearchitect the schema because a new game changed something I didn’t expect could ever change. Similarly, the UI can drop a lot of special cases for “this changes between games”, “this changes between generations”, etc. and treat it all consistently.
  • Pokémon forms can just be two Pokémon with the same species name. Fuck it, store everything. YAML even has “merge” syntax built right in that can elide the common parts. (This isn’t shown above, and I don’t know exactly what the syntax looks like yet.)

Good idea? Sure, maybe? Okay let’s look at some details, where the devil’s in.

Problems

There are several, and they are blocking my progress on this, and I only have three months to go.

Speed

There will be a lot of YAML, and loading a lot of YAML is not particularly quick, even with pyyaml’s C loader. YAML is a complicated format and this is a lot of text to chew through. I won’t know for sure how slow this is until I actually have more than a handful of games in this format, though.

I have a similar concern about memory use, since I’ll suddenly be storing a whole lot of identical data. I do have an idea for reducing memory use for strings, which is basically manual interning:

1
string_datum = big_ol_string_dict.setdefault(string_datum, string_datum)

If I load two YAML files that contain the same string, I can reuse the first one instead of keeping two copies around for no reason. (Strings are immutable in Python, so this is fine.)

Alas, I’ve seen this done before, and it does have a teeny bit of overhead, which might make the speed issue even worse.

So I think what I’m going to do is load everything into objects, resolve duplicate strings, and then… store it all in a pickle! Then the next time the app goes to load the data, if the pickle is newer than any of the files, just load the pickle instead. Pickle is a well-specified binary format (much faster to parse) and should be able to remember that strings have already been de-duplicated.

I know, I know: I said don’t use pickle. This is the one case where pickle is actually useful: as a disposable cache. It doesn’t leave the machine, so there are no security concerns; it’s not shared between multiple copies of the app at the same time; and if it fails to load for any reason at all, the app can silently trash it and load the data directly.

I just hope that pickle will be quick enough, or this whole idea falls apart. Trouble is, I can’t know for sure until I’m halfway done.

Languages versus games

Earlier I implied that every single game would get its own set of data: English Red has a set of files, French Red has the same set of files, etc.

For the very early games, this directly reflects their structure: each region got its own cartridge with the game in a single language. Different languages might have different character sets, different UI, different encounters (Phanpy and Teddiursa were swapped in Gold and Silver’s Western releases), different mechanics (leech moves fail against a Substitute in gen 1, but only in Japanese), and different graphics (several Gold and Silver trainer classes were slightly censored outside of Japan). You could very well argue that they’re distinct games.

The increased storage space of the Nintendo DS changed things. The games were still released regionally, but every game contains every language’s flavor text and “genus (the stuff you see in the Pokédex). This was an actual feature of the game: if you received a Pokémon caught in another language — made much easier by the introduction of online trading — then you’d get the flavor text for that language in your Pokédex.

The DS versions also use a filesystem rather than baking everything into the binary, so very little code needed to change between languages; everything of interest was in text files.

From X and Y, there are no localizations. Every game contains the full names and descriptions of everything, plus the entire game script, in every language. In fact, you can choose which language to play the game in — in an almost unprecedented move for a Nintendo game, an American player with the American copy of the game can play the entire thing in Japanese.

(If this weren’t the case, you’d need an entire separate 3DS to do that, since the 3DS is region-locked. Thanks, Nintendo.)

The question, then, is how to sensibly store all this.


With the example YAML above, human-language details like names and flavor text are baked right into the Pokémon. This makes sense in the context of a single game, where those are properties of a Pokémon. If you take that to be the schema, then the obvious thing to do is to have a separate file for every game in every language: /red/en/pokemon.yaml, /red/fr/pokemon.yaml, and so on.

This isn’t ideal, since most of the other data is going to be the same. But those games are also the smallest, and anyway this captures the rare oddball difference like Phanpy and Teddiursa (though hell if I know how to express that in the UI).

With X and Y, everything goes out the window. There are effectively no separate games any more, so /x/en versus /x/fr makes no sense. It’s very clear now that flavor text — and even names — aren’t direct properties of the Pokémon, but of some combination of the Pokémon and the player.


One option is to put some flexibility in the directory structure.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
/red
  /en
    pokemon.yaml
    pokemon-text.yaml
  /ja
    pokemon.yaml
    pokemon-text.yaml
...
/x
  pokemon.yaml
  /en
    pokemon-text.yaml
  /ja
    pokemon-text.yaml

A pokemon-text.yaml file would be a very simple mapping.

1
2
3
4
5
6
7
bulbasaur:
    name: BULBASAUR
    species: SEED
    flavor-text: "A strange seed was\nplanted on its\nback at birth.\fThe plant sprouts\nand
      grows with\nthis POKéMON."
ivysaur:
    ...

(Note that the lower-case keys like bulbasaur are identifiers, not names — they’re human-readable and obviously based on the English names, but they’re supposed to be treated as opaque dev-only keys. In fact I might try to obfuscate them further, to discourage anyone from title-casing them and calling them names.)

Something about this doesn’t sit well. I think part of it is that the structure in pokemon-text.yaml doesn’t represent a meaningful thing, which is somewhat at odds with the idea of loading each file directly into a set of objects. With this approach, I have to patchwork update existing objects as I go.

It’s kind of a philosophical quibble, granted.


An extreme solution would be to pretend that X and Y are several different games: have /x/en and /x/fr, even though they contain mostly the same information taken from the same source.

I don’t think that’s a great idea, especially since the merged approach will surely be how all future games work as well.


At the other extreme, I could treat the older games as though they were separate versions themselves. Add a grouping called “cartridge” or something that’s a subset of “version”. Many of the oddball differences are between the Japanese version and everyone else, too.

There’s even a little justification for this in the way the first few games were released. Japan first got Red and Green, which had goofy art and were very buggy; they were later polished and released as the single version Japanese Blue, which became the basis for worldwide releases of Red and Blue. Japanese Red is a fairly different game from American Red; Japanese Blue is closer to American Blue but still not really the same. veekun already has a couple of nods towards this, such as having separate Red/Green and Red/Blue sprite art.

That would lead to a list of games like jp-red, jp-green, jp-blue, ww-red, ww-blue, yellow (I think they were similar across the board), jp-gold, jp-silver, ww-gold, ww-silver, crystal (again, I don’t think there were any differences), and so on. The schema would look like:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
bulbasaur:
    name:
        en: BULBASAUR
        fr: BULBIZARRE
        es: BULBASAUR
        ...
    flavor-text: 
        en: "A strange seed was\nplanted on its\nback at birth.\fThe plant sprouts\nand
          grows with\nthis POKéMON."
        ...

The Japanese games, of course, would only have Japanese entries. A huge advantage of this approach is that it also works perfectly with the newer games, where this is effectively the structure of the original data anyway.

This does raise the question of exactly how I generate such a file without constantly reloading and redumping it. I guess I could dump every language game at the same time. That would also let me verify that there are no differences besides text.

The downside is mostly that the UI would have to consolidate this, and the results might be a little funky. Merging jp-gold with ww-gold and just calling it “Gold” when the information is the same, okay, sure, that’s easy and makes sense. jp-red versus ww-red is a bit weirder of a case. On the other hand, veekun currently pretends Red and Green didn’t even exist, which is certainly wrong.

I’d have to look more into the precise differences to be sure this would actually work, but the more I think about it, the more reasonable this sounds. Probably the biggest benefit is that non-text data would only differ across games, not potentially across games and languages.

Wow, this might be a really good idea. And it had never occurred to me before writing this section. This rubber duck thing really works, thanks!

Forms

As mentioned above, rather than try to group forms into various different tiers based on how much they differ, I might as well just go whole hog and have every form act as a completely distinct Pokémon.

Doing this with YAML’s merge syntax would even make the differences crystal clear:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
plant-wormadam:
    &plant-wormadam
    types: [bug, grass]
    abilities:
        1: anticipation
        2: anticipation
        hidden: overcoat
    moves:
        ...
    # etc
trash-wormadam:
    <<: *plant-wormadam  # means "merge in everything from this other node"
    types: [bug, ground]
    moves:
        ...
# Even better:
unown-a:
    &unown-a
    types: [psychic]
    name: ...
    # whatever else
unown-c:
    <<: *unown-a
unown-d:
    <<: *unown-a
unown-e:
    <<: *unown-a

One catch is that I don’t know how to convince PyYAML to output merge nodes, though it’s perfectly happy to read them.

But wait, hang on. This is a list of Pokémon, not forms. Wormadam is a Pokémon. Plant Wormadam is a form.

Right?

This distinction has haunted us rather thoroughly since we first tried to support it with Diamond and Pearl. The website is still a little weird about this: it acts as though “Plant Wormadam” is the name of a distinct Pokémon (because it affects type) and has distinct pages for Sandy and Trash Wormadam, but “Burmy” is a single page, even though Wormadam evolves from Burmy and they have the same forms. (In Burmy’s case, though, form only affects the sprite and nothing else.) You can also get distinct forms in search results, which may or may not be what you want — but it also may or may not make sense to “ignore” forms when searching. In many cases we’ve arbitrarily chosen a form as the “default” even when there isn’t a clear one, just so you get something useful when you type in “wormadam”.

Either way, there needs to be something connecting them. Merge keys are only a shorthand for writing YAML; they’re completely invisible to app code and don’t exist in the deserialized data.

YAML does have a nice shorthand syntax for a list of mappings:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
bulbasaur:
-   name: ...
    types: ...
unown:
-   &unown-a
    name: ...
    types: ...
    form: a
-   <<: *unown-a
    form: b
-   <<: *unown-a
    form: c
...

Hm, now we lose the unown-a that functions as the actual identifier for the form.

Alternatively, there could be an entire separate type for sets of forms, since we do have tags here.

1
2
3
4
5
6
7
bulbasaur: !dex!pokemon
    name: ...
unown: !dex!pokemon-form-set
    unown-a: !dex!pokemon
        name: ...
    unown-b: !dex!pokemon
        ...

An unadorned Pokemon could act as a set of 1, then? I guess?

Come to think of it, this knits with another question: where does data specific to a set of forms go? Data like “can you switch between forms” and “is this purely cosmetic”. We can’t readily get that from the games, since it’s code rather than data.

It’s also extremely unlikely to ever change, since it’s a fundamental part of each multi-form Pokémon’s in-universe lore. So it makes sense to store that stuff in some separate manually-curated place, right? In which case, we could do the same for storing which sets of forms “count” as the same Pokémon. That is, the data files could contain plant-wormadam and sandy-wormadam as though they were completely distinct, and then we’d have our own bits on top (which we need anyway) to say that, hey, those are both forms of the same thing, wormadam.

That mirrors how the actual games handle this, too — the three Wormadam forms have completely separate stat/etc. structs.

Ah, but the games don’t store the Burmy or Unown forms separately, because they’re cosmetic. How does our code handle that? I guess there’s only one unown, and then we also know that there are 28 possible sprites?

But Arceus’s forms have different types, and they’re not stored separately either. (I think you could argue that Arceus is cosmetic-only, the cosmetic form is changed by Arceus’s type, and Arceus’s type is really just changed by Arceus’s ability. I’m pretty sure the ability doesn’t work if you hack it onto any other Pokémon, but I can’t remember whether Arceus still changes type if hacked to have a different ability.)

Relying too much on outside information also makes the data a teeny bit harder for anyone else to use; suddenly they have three Wormadams, none of which are quite called “Wormadam”, but all of which share the same Pokédex number. (Oh, right, we could just link them by Pokédex number.) That feels goofy, but if what you’re after is something with a definitive set of types, there is nothing called simply “Wormadam”.

Oh, and there’s a minigame that only exists in Heart Gold and Soul Silver, but that has different stats even for cosmetic forms. Christ.

I don’t think there’s any perfect answer here. I have a list of all the forms if you’d like to see more of this madness.

The Python API

So you want to load all this data and do stuff with it. Cool. There’ll be a class like this:

1
2
3
4
5
class Pokemon(Locus):
    types = List(Type, min=1, max=2)
    growth_rate = Scalar(GrowthRate)
    game_index = Scalar(int)
    ...

You know, a little declarative schema that matches the YAML structure. I love declarative classes.

The big question here is what a Pokemon is. (Besides whether it’s a form or not.) Is it a wrapper around all the possible data from every possible game, or just the data from one particular game? Probably the former, since the latter would mean having some twenty different Pokemon all called bulbasaur and that’s weird.

(Arguably, the former would be wrong because much of this stuff only applies to the main games and not Mystery Dungeon or Ranger or whatever else. That’s a very different problem that I’ll worry about later.)

I guess then a Pokemon would wrap all its attributes in a kind of amalgamation object:

1
2
3
4
5
6
7
8
print(pokemon)                          # <Pokemon: bulbasaur>
print(pokemon.growth_rate)              # <MultiValue: bulbasaur.growth_rate>
current = Game.latest
print(current)                          # <Game: alpha-sapphire>
print(pokemon.growth_rate[current])     # <GrowthRate: medium-slow>
pokemonv = pokemon.for_version(current)
print(pokemonv)                         # <Pokemon: bulbsaur/alpha-sapphire>
print(pokemonv.growth_rate)             # <GrowthRate: medium-slow>

There’s one more level you might want: a wrapper that slices by language solely for your own convenience, so you can say print(some_pokemon.name) and get a single string rather than a thing that contains all of them.

Should you be able to slice by language but not by version, so pokemon.name is a thing containing all English names across all the games? I guess that sounds reasonable to want, right? It would also let you treat text like any other property, which could be handy.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
print(pokemon)                          # <Pokemon: bulbasaur>
print(pokemon.growth_rate)              # <MultiValue: bulbasaur.growth_rate>
# I'm making up method names on the fly here, so.
# Also there will probably be a few ways to group together changed properties,
# depending entirely on what the UI needs.
print(pokemon.growth_rate.meld())       # [((...every game...), <GrowthRate: medium-slow>)]
print(pokemon.growth_rate.unify())      # <GrowthRate: medium-slow>
pokemonl = pokemon.for_language(Language['en'])
print(pokemonl.name)                    # <MultiValue: bulbasaur.name>
print(pokemonl.name.meld())             # [((<Game: ww-red>, ...), 'BULBASAUR'), ((<Game: x>, ...), 'Bulbasaur')]
print(pokemonl.name.unify())            # None, maybe ValueError?

(Having written all of this, I suddenly realize that I’m targeting Python 3, where I can use é in class names. Which I am probably going to do a lot.)

I think… this all… seems reasonable and doable. It’ll require some clever wrapper types, but that’s okay.

Hmm

I know these are relatively minor problems in the grand scheme of things. People handle hundreds of millions of rows in poorly-designed MySQL tables all the time and manage just fine. I’m mostly waffling because this is a lot of (hobby!) work and I’ve already been through several of these rearchitecturings and I’m tired of discovering the dozens of drawbacks only after all the work is done.

Writing this out has provided some clarity, though, and I think I have a better idea of what I want to do. So, thanks.

I’d like to have a proof of concept of this, covering some arbitrary but representative subset of games, by the end of the month. Keep your eyes peeled.

Weekly roundup: doodling

Post Syndicated from Eevee original https://eev.ee/dev/2016/06/13/weekly-roundup-doodling/

June’s theme is clearing my plate! I will finally stop spreading these vegetables around to make it look like I ate some of them, and just sneak them to the dog under the table.

  • art: I drew a beautiful manga, which is based on true events. I drew a new header image for here, replacing the old poor cutout of a JPEG exported from Pokémon Art Academy, and adjusted the colorscheme of the whole site to match it. And I did a lot of doodling. A loooot of doodling.

  • doom: I did a lot of work on my factory map for DUMP 3, but I don’t think I’m going to make the deadline. I’m okay with that. I did a mad scramble for DUMP 2 in April, then a longer mad scramble for Under Construction in May; I’m alright with not doing a mad scramble for a third month in a row. I can still make maps outside of a particular challenge.

  • twitter: I made @perlin_noise, a Twitter bot that tweets Perlin noise in various forms.

  • pyscss: I’d had a ton of GitHub mail marked unread for ages, so I finally spent a day fixing a lot of easy bugs and released 1.3.5. I don’t plan to spend significant time on this in the future — especially when there’s now a libsass in C and Python bindings to it — but it’s nice to see some obvious problems fixed.

  • blog: It’s been a year since I quit now!

It doesn’t sound like a lot, but honestly most of the time went into the Doom map and artwork, both of which are things I wanted to do this month. I’m not sure how I’ll schedule the mapping if I’m giving up on DUMP 3; maybe I’ll hop on Runed Awakening for a bit and look at the map again later with fresh eyes.

Perceptual Image Compression at Flickr

Post Syndicated from yahoo original https://yahooeng.tumblr.com/post/130574301641

Archie Russell, Peter Norby, Saeideh BakhshiAt Flickr our users really care about image quality. They also care a lot about how responsive our apps are. Addressing both of these concerns simultaneously is challenging;  higher quality images have larger file sizes and are slower to transfer. Slow transfers are especially noticeable on mobile devices. Flickr had historically targeted the high image quality, but in late 2014 we implemented a method to both maintain image quality and decrease file size. As image appearance is very important to our users,  we performed an extensive user test before rolling this change out. Here’s how we did it.Background:  JPEG Quality SettingsJPEG compression has several tunable knobs. The q-value is the best known of these; it adjusts the level of spatial detail stored for fine details: a higher q-value typically keeps more detail. However, as q-value gets very close to 100, file size increases dramatically, usually without improving image appearance.If file size and app performance isn’t an issue, dialing up q-value is an easy way to get really nice-looking images; this is what Flickr has done in the past. And if appearance isn’t very important, dialing down q-value is a viable option. But if you want both,  you’re kind of stuck. Additionally, q-value is not one size fits all,  some images look great at q-value 80 while others don’t.imageAnother commonly adjusted setting is chroma-subsampling,  which alters the amount of color information stored in a JPEG file. With a setting of 4:4:4, the two chroma (color) channels in a JPG have as much information as the luminance channel. In an image with a setting of 4:2:0, each chroma channel has only a quarter as much information as in an a 4:4:4 image.imageq=96, chroma=4:4:4 (125KB)imageq=70, chroma=4:4:4 (67KB)imageq=96, chroma=4:2:0 (62KB)imagea=70, chroma=4:2:0 (62KB)Table 1:  JPEG stored at different quality and chroma levels. The top image is saved at high quality and chroma level – notice the color and detail in the folds of the red flag. The bottom image has the lowest quality – notice artifacts along the right edges of the red flag.Perceptual JPEG CompressionIdeally, we’d have an algorithm which automatically tuned all JPEG parameters to make a file smaller, but which would limit perceptible changes to the image. Technology exists that attempts to do this and can decrease image file size by over 30%. This compression ratio is highly dependent on image content and dimensions.imageFig 2. Compressed (l) and uncompressed ® images. Compressed image is 36% smaller.We were pleased with perceptually compressed images in cursory examinations, compressed images were smaller and nearly indistinguishable from their sources. But we wanted to really quantify how well it worked before rolling it out. The standard computational tools for evaluating compression, such as SSIM, are fairly simplistic and don’t do a great job at modeling how a user sees things. To really evaluate this technology had to use a better measure of perceptibility:  human minds.  imageTo test whether our image compression would impact user perception of image quality, we put together a “taste test”. The taste test was constructed as a game with multiple rounds where users looked at both compressed and uncompressed images. Users accumulated points the longer they played, and got more points for doing well at the game.  We maintained a leaderboard to encourage participation and used only internal testers.   The game’s test images came from a diverse collection of 250 images contributed by Flickr staff. The images came from a variety of cameras and included a number of subjects from photographers with varying skill levels.imageIn each round, our test code randomly selected a test image, and presented two variants of this image side by side. 50% of the time we presented the user two identical images; the rest of the time we presented one compressed image and one uncompressed image. We asked the tester if the two images looked the same or different. We expected a user choosing randomly OR a user unable to distinguish the two cases would answer correctly about half the time.  We randomly swapped the location of the compressed images to compensate for user bias to the left or the right.  If testers chose correctly, they were presented with a second question: "Which image did you prefer, and why?”imageFig 4. Screenshot of taste test.Our test displayed images simultaneously to prevent testers noticing a longer load time for the larger, non-compressed image. The images were presented with either 320, 640, or 1600 pixels on their longest side. The 320 & 640px images were shown for 12 seconds before being dimmed out. The intent behind this detail was to represent how real users interacted with our images. The 1600px images stayed on screen for 20 seconds, as we expected larger images to be viewed for longer periods of time by real users.Taste Test Outcome and DeploymentWe ran our taste test for two weeks and analyzed our results. Although we let users play as long as they liked, we skipped the first result per user as a “warm-up” and considered only the subsequent ten results, which limited the potential for users training themselves to spot compression artifacts. We disregarded users that had fewer than eleven results.imageTable 2. Taste test results. Testers selected “identical” at nearly the same rate, whether the input was identical or not.When our testers were presented with two identical images, they thought the images were identical only 68.8% of the time(!), and when presented with a compressed image next to a non-compressed image, our testers thought the images were identical slightly less often: 67.6% of the time. This difference was small enough for us, and our statisticians told us it was statistically insignificant. Our image pairs were so similar that multiple testers thought all images were identical and reported that the test system was buggy. We inspected the images most often labeled different, and found no significant artifacts in the compressed versions.So even in this side-by-side test, perceptual image compression was just barely noticeable when images were presented side-by-side.  As the Flickr website wouldn’t ever show compressed and uncompressed images at the same time, and the use of compression had large benefits in storage footprint and site performance, we elected to go forward.At the beginning of 2014 we silently rolled out perceptual-based compression on our image thumbnails (we don’t alter the “original” images uploaded by our users). The slight changes to image appearance went unnoticed by users, but user interactions with Flickr became much faster, especially for users with slow connections, while our storage footprint became much smaller. This was a best-case scenario for us.Evaluating perceptual compression was a considerable task, but it gave the confidence we needed to apply this compression in production to our users.   This marked the first time Flickr had adjusted image settings in years, and, it was fun.imageFig 5.  Taste test high score listEpilogueAfter eighteen months of perceptual compression at Flickr,  we adjusted our settings slightly to shrink images an additional 15%. For our users on mobile devices, 15% fewer bytes per image makes for a much more responsive experience. We had run a taste test on this newer setting and users were were able to spot our compression slightly more often than with our original settings. When presented a pair of identical images, our testers declared these images identical 65.2% of the time, when presented with different images, our testers declared the images identical 62% of the time. It wasn’t as imperceptible as our original approach, but, we decided it was close enough to roll out.Boy were we wrong! A few very vocal users spotted the compression and didn’t like it at all. The Flickr Help Forum had a very lively thread which Petapixel picked up. We beat our heads against the wall considered our options and came up with a middle path between our initial and follow-on approaches, giving us smaller, faster-to-load files while still maintaining the appearance our users expect.Through our use of perceptual compression, combined with our use of on-the-fly resize and COS, we were able to decrease our storage footprint dramatically, while simultaneously improving user experience. It’s a win all around but we’re not done yet — we still have a few tricks up our sleeves.

FOMS/LCA Recap

Post Syndicated from Lennart Poettering original http://0pointer.net/blog/projects/foms-lca-recap.html

Finally, here’s my linux.conf.au 2007 and FOMS 2007
recap. Maybe a little bit late, but better late then never.

FOMS was a very well organized conference with a packed schedule
and a lot of high-profile attendees. To my surprise PulseAudio has been accepted by the
attendees without any opposition (at least none was expressed
aloud). After a few “discussions” on a few mailing lists (including
GNOME MLs) and some personal emails I got, I had thought that more
people were in opposition of the idea of having a userspace sound
daemon for the desktop. Apparently, I was overly pessimistic. Good
news, that!

During the FOMS conference we discussed the problems audio on Linux
currently has. One of the major issues still is that we’re lacking a
cross-platform PCM audio API everyone agrees on. ALSA is Linux-specific
and complicated to use. The only real contender is PortAudio. However,
PortAudio has its share of problems and hasn’t reach wide adoption
yet. Right now most larger software projects implement an audio
abstraction layer of some kind, and mostly in a very dirty, simplistic
and limited fasion. MPlayer does, Xine does it, Flash does
it. Everyone does it, and it sucks. (Note: this is only a very short
overview why audio on Linux sucks right now. For a longer one, please
have a look on the first 15mins of my PulseAudio talk at LCA, linked
below.)

Several people were asking why not to make the PulseAudio API the
new “standard” PCM API for Linux. Due to several reasons that would be a
bad idea. First of all, the PulseAudio API cannot be used on anything
else but PulseAudio. While PulseAudio has been ported to Win32, Vista
already has a userspace desktop sound server, hence running PulseAudio
on top of that doesn’t make much sense. Thus the API is not exactly
cross-platform. Secondly, I – as the guy who designed it – am not
happy with the current PulseAudio API. While it is very powerful it is
also very difficult to use and easy to misuse, mostly due to its fully
asynchronous nature. In addition it is also not the exactly smallest
API around.

So, what could be done about this? We agreed on a – maybe –
controversional solution: defining yet another abstracted PCM audio
API. Yes, fixing the problem that we have too many conflicting,
competing sound systems by defining yet another API sounds like a
paradoxon, but I do believe this is the right path to follow. Why?
Because none of the currently available solutions is suitable for all
application areas we have on Linux. Either the current APIs are not
portable, or they are horribly difficult to use properly, or have a
strange license, or are too simple in their functionality. MacOSX
managed to establish a single audio API (CoreAudio) that makes almost
everyone happy on that system – and we should be able to do same for
Linux. Secondly, none of the current APIs has been designed with
network sound servers in mind. However, proper networking support
reflects back into the API, and in a non-trivial way. An API which
works fine in networked environment needs to eliminate roundtrips
where possible, be open for time interpolation and have a flexible
buffering (besides other minor things). Thirdly none of the current
APIs offers enough functionality to properly support all the needs of
modern desktop sound systems, such as per-stream volumes, stream names
and notifications about external state changes.

During FOMS and LCA, Mikko Leppanen (from Nokia), Jean-Marc Valin
(from Xiph) and I sat down and designed a draft API for the
functionality we would like to see in this API. For the time being we
dubbed it libsydney, after the city where we started this
project. I plan to make this the only supported audio API for
PulseAudio, eventually. Thus, if you will code against PulseAudio you
will get cross-platform support for free. In addition, because
PulseAudio is now being integrated into the major distributions (at
least Ubuntu and Fedora), this library will be made available on most
systems through the backdoor.

So, what will this new API offer? Firstly, the buffering model is
much more powerful than of any current sound API. The buffering model
mostly follows PulseAudio’s internal buffering model which
(theoretically) can offer zero-latency streaming and has been
pioneered by Jim Gettys’ AF sound server. It allows you to seek around
in the playback buffer very flexibly. This is very useful to allow
very fast reaction to the user’s playback control commands while still
allowing large buffers, which are good to deal with high network
lag. In addition it is very handy for the programmer, such as when
implementing streaming clients where packets may arrive
out-of-order. The API will emulate this buffering model on top of
traditional audio devices, and when used on top of PulseAudio it will use
its native implementation. The API will also clearly define which
sound formats are guaranteed to be available, thus making it a lot
easier to code without thinking of different hardware supporting
different formats all the time. Of course, the API will be easier to
use than PulseAudio’s current API. It will be very portable, scaling
from FPU-less architectures to pro-audio machines with a massive
number of synchronised channels. There are several modes available to
deal with XRUNs semi-automatically, one of them guaranteeing that the
time axis stays linear and monotonical in all events.

The list of features of this new API is much longer, however,
enough of these grand plans! We didn’t write any real code for this
yet. To make sure that this project is not another one of those which
are announced grandiosely without ever producing any code I will stop
listing features here now. We will eventually publish a first draft of
our C API for public discussion. Stay tuned.

Side-by-side with libsydney I discussed an abstract API
for desktop event sounds with Mikko (i.e. those annoying “bing” sounds
when you click a button and the like). Dubbed libcanberra
(named after the city which one of the developers visited after
Sydney), this will hopefully be for the PulseAudio sample cache API
what libsydney is for the PulseAudio streaming API: a total
replacement.

As a by-product of the libsydney discussion Jean-Marc
coded a
fast C resampling library
supporting both floating point and fixed
point and being licensed under BSD. (In contrast to
libsamplerate which is GPL and floating-point-only, but which
probably has better quality). PulseAudio will make use of this new
library, as will libsydney. And I sincerly hope that ALSA,
GStreamer and other projects replace their crappy home-grown
resamplers with this one!

For PulseAudio I was looking for a CODEC which we could use to
encode audio if we have to transfer it over the network. Such a CODEC
would need to have low CPU requirements and allow low-latency
operation, while providing hifi audio. Compression ratio is not such a
high requirement. Unfortunately, as it seems no such CODEC exists,
especially not a “Free” one. However, the Xiph people recommended to
hack up a special version of FLAC for this task. FLAC is fast, has
(obviously) good quality and if hacked up could provide low-latency
encoding. However, FLAC doesn’t compress that well. Current PulseAudio
thin-client installations require 170kB network bandwidth for each
client if hifi audio is used. Encoding this in FLAC this could cut
this in half. Not perfect, but better than nothing.

So, that was FOMS! FOMS is a definitely highly recommended
conference. If you have the chance to attend next year, don’t miss it!
I’ve never been to a more productive, packed conference in my life!

At LCA I met fellow Avahi coder Trent Lloyd for the first time. Our
talk about Avahi went very well. During my flights to and back from
.au I hacked up avahi-ui
which I also announced during that talk. Also, in related news,
tedp started to work on an implementation of NAT-PMP
(aka “reverse firewall piercing”; both client and server) for
inclusion in Avahi. This will hopefully make the upcoming Wide-Area
DNS support in Avahi much more useful.

linux.conf.au was a very exciting conference. As a speaker
you’re treated like a rock star, with stuff like the speakers dinner,
the speakers adventure (climbing on top of Sydney’s AMP tower) and
the penguin dinner. Heck, the organizers even picked me up at the
airport, something I really didn’t expect when I landed in Sydney,
which however is quite nice after a 27h flight.

Two talks I particularly enjoyed at LCA:

nouveau – reverse engineered nvidia drivers (Ogg Theora)
burning cpu and battery on the gnome desktop (Ogg Theora)

And just for the sake of completeness, here are the links to my presentations:

The PulseAudio Sound Server (Ogg Theora; Slides)
Using Avahi the “Right Way” (Ogg Theora; Slides)

Ok, that’s it for now. Thanks go to Silvia Pfeiffer, the rest of
the FOMS team and the Seven Team for organizing these two amazing
conferences!

FOMS/LCA Recap

Post Syndicated from Lennart Poettering original http://0pointer.net/blog/projects/foms-lca-recap.html

Finally, here’s my linux.conf.au 2007 and FOMS 2007
recap. Maybe a little bit late, but better late then never.

FOMS was a very well organized conference with a packed schedule
and a lot of high-profile attendees. To my surprise PulseAudio has been accepted by the
attendees without any opposition (at least none was expressed
aloud). After a few “discussions” on a few mailing lists (including
GNOME MLs) and some personal emails I got, I had thought that more
people were in opposition of the idea of having a userspace sound
daemon for the desktop. Apparently, I was overly pessimistic. Good
news, that!

During the FOMS conference we discussed the problems audio on Linux
currently has. One of the major issues still is that we’re lacking a
cross-platform PCM audio API everyone agrees on. ALSA is Linux-specific
and complicated to use. The only real contender is PortAudio. However,
PortAudio has its share of problems and hasn’t reach wide adoption
yet. Right now most larger software projects implement an audio
abstraction layer of some kind, and mostly in a very dirty, simplistic
and limited fasion. MPlayer does, Xine does it, Flash does
it. Everyone does it, and it sucks. (Note: this is only a very short
overview why audio on Linux sucks right now. For a longer one, please
have a look on the first 15mins of my PulseAudio talk at LCA, linked
below.)

Several people were asking why not to make the PulseAudio API the
new “standard” PCM API for Linux. Due to several reasons that would be a
bad idea. First of all, the PulseAudio API cannot be used on anything
else but PulseAudio. While PulseAudio has been ported to Win32, Vista
already has a userspace desktop sound server, hence running PulseAudio
on top of that doesn’t make much sense. Thus the API is not exactly
cross-platform. Secondly, I – as the guy who designed it – am not
happy with the current PulseAudio API. While it is very powerful it is
also very difficult to use and easy to misuse, mostly due to its fully
asynchronous nature. In addition it is also not the exactly smallest
API around.

So, what could be done about this? We agreed on a – maybe –
controversional solution: defining yet another abstracted PCM audio
API. Yes, fixing the problem that we have too many conflicting,
competing sound systems by defining yet another API sounds like a
paradoxon, but I do believe this is the right path to follow. Why?
Because none of the currently available solutions is suitable for all
application areas we have on Linux. Either the current APIs are not
portable, or they are horribly difficult to use properly, or have a
strange license, or are too simple in their functionality. MacOSX
managed to establish a single audio API (CoreAudio) that makes almost
everyone happy on that system – and we should be able to do same for
Linux. Secondly, none of the current APIs has been designed with
network sound servers in mind. However, proper networking support
reflects back into the API, and in a non-trivial way. An API which
works fine in networked environment needs to eliminate roundtrips
where possible, be open for time interpolation and have a flexible
buffering (besides other minor things). Thirdly none of the current
APIs offers enough functionality to properly support all the needs of
modern desktop sound systems, such as per-stream volumes, stream names
and notifications about external state changes.

During FOMS and LCA, Mikko Leppanen (from Nokia), Jean-Marc Valin
(from Xiph) and I sat down and designed a draft API for the
functionality we would like to see in this API. For the time being we
dubbed it libsydney, after the city where we started this
project. I plan to make this the only supported audio API for
PulseAudio, eventually. Thus, if you will code against PulseAudio you
will get cross-platform support for free. In addition, because
PulseAudio is now being integrated into the major distributions (at
least Ubuntu and Fedora), this library will be made available on most
systems through the backdoor.

So, what will this new API offer? Firstly, the buffering model is
much more powerful than of any current sound API. The buffering model
mostly follows PulseAudio’s internal buffering model which
(theoretically) can offer zero-latency streaming and has been
pioneered by Jim Gettys’ AF sound server. It allows you to seek around
in the playback buffer very flexibly. This is very useful to allow
very fast reaction to the user’s playback control commands while still
allowing large buffers, which are good to deal with high network
lag. In addition it is very handy for the programmer, such as when
implementing streaming clients where packets may arrive
out-of-order. The API will emulate this buffering model on top of
traditional audio devices, and when used on top of PulseAudio it will use
its native implementation. The API will also clearly define which
sound formats are guaranteed to be available, thus making it a lot
easier to code without thinking of different hardware supporting
different formats all the time. Of course, the API will be easier to
use than PulseAudio’s current API. It will be very portable, scaling
from FPU-less architectures to pro-audio machines with a massive
number of synchronised channels. There are several modes available to
deal with XRUNs semi-automatically, one of them guaranteeing that the
time axis stays linear and monotonical in all events.

The list of features of this new API is much longer, however,
enough of these grand plans! We didn’t write any real code for this
yet. To make sure that this project is not another one of those which
are announced grandiosely without ever producing any code I will stop
listing features here now. We will eventually publish a first draft of
our C API for public discussion. Stay tuned.

Side-by-side with libsydney I discussed an abstract API
for desktop event sounds with Mikko (i.e. those annoying “bing” sounds
when you click a button and the like). Dubbed libcanberra
(named after the city which one of the developers visited after
Sydney), this will hopefully be for the PulseAudio sample cache API
what libsydney is for the PulseAudio streaming API: a total
replacement.

As a by-product of the libsydney discussion Jean-Marc
coded a
fast C resampling library
supporting both floating point and fixed
point and being licensed under BSD. (In contrast to
libsamplerate which is GPL and floating-point-only, but which
probably has better quality). PulseAudio will make use of this new
library, as will libsydney. And I sincerly hope that ALSA,
GStreamer and other projects replace their crappy home-grown
resamplers with this one!

For PulseAudio I was looking for a CODEC which we could use to
encode audio if we have to transfer it over the network. Such a CODEC
would need to have low CPU requirements and allow low-latency
operation, while providing hifi audio. Compression ratio is not such a
high requirement. Unfortunately, as it seems no such CODEC exists,
especially not a “Free” one. However, the Xiph people recommended to
hack up a special version of FLAC for this task. FLAC is fast, has
(obviously) good quality and if hacked up could provide low-latency
encoding. However, FLAC doesn’t compress that well. Current PulseAudio
thin-client installations require 170kB network bandwidth for each
client if hifi audio is used. Encoding this in FLAC this could cut
this in half. Not perfect, but better than nothing.

So, that was FOMS! FOMS is a definitely highly recommended
conference. If you have the chance to attend next year, don’t miss it!
I’ve never been to a more productive, packed conference in my life!

At LCA I met fellow Avahi coder Trent Lloyd for the first time. Our
talk about Avahi went very well. During my flights to and back from
.au I hacked up avahi-ui
which I also announced during that talk. Also, in related news,
tedp started to work on an implementation of NAT-PMP
(aka “reverse firewall piercing”; both client and server) for
inclusion in Avahi. This will hopefully make the upcoming Wide-Area
DNS support in Avahi much more useful.

linux.conf.au was a very exciting conference. As a speaker
you’re treated like a rock star, with stuff like the speakers dinner,
the speakers adventure (climbing on top of Sydney’s AMP tower) and
the penguin dinner. Heck, the organizers even picked me up at the
airport, something I really didn’t expect when I landed in Sydney,
which however is quite nice after a 27h flight.

Two talks I particularly enjoyed at LCA:

And just for the sake of completeness, here are the links to my presentations:

Ok, that’s it for now. Thanks go to Silvia Pfeiffer, the rest of
the FOMS team and the Seven Team for organizing these two amazing
conferences!

A few updates on PulseAudio

Post Syndicated from Lennart Poettering original http://0pointer.net/blog/projects/pulse-news.html

Thanks to Marc-Andre Lureau there’s now a jhbuild file
for PulseAudio
. And there is this (little bit chaotic)
Wiki page
in GNOME Live! about the relation of PulseAudio and
GNOME.

A few weeks ago I wrote a new page for our Wiki where I tried to
describe the steps necessary to get the most out of PulseAudio. It’s
called the Perfect
Setup
.

A few minutes ago I released PulseAudio 0.9.5 and new versions of the auxiliary tools. The changelog:

Add module-hal-detect, a module that detects all local sound hardware using HAL and loads the necessary modules. Handles hot-plug and hot-removal of audio devices. (Contributed by Shahms E. King)
Add shared memory transfer method for local clients
Update module-volume-restore to automatically restore the output device last used by an application in addition to the volume it last used
Add a new module module-rescue-streams for automatically moving streams to another sink/source if the sink/source they are connected to dies
Add support for moving streams “hot” between sinks/sources
Reduce memory consumption and CPU load as result of Valgrind/Massif profiling
Add new module module-gconf for reading additional configuration statements from GConf
Fix module-tunnel to work with the latest protocol
Miscellaneous fixes

One of the nicest new features of PulseAudio 0.9.5 is HAL
integration (which has been contributed by Shahms King). PulseAudio will
now automatically detect all available sound devices and will make
use of them. It supports both hot-plug and hot-remove.

Another nice feature is the GConf integration which allowed us to add another nice application to the PulseAudio toolset: the PulseAudio Preferences utility:

paprefs screenshot

The idea is to have a simple, nice configuration dialog that allows
configuration of the more exotic features of PulseAudio which we do
not enable by default due to security considerations or to not
confuse the user. Right now a lot of features are hidden behind
non-trivial configuration file statements. This preferences tool shall
make them available for the users which are not so keen on editing
configuration files.

Playing around with Valgrind‘s
Massif tool and KCachegrind I did a little bit of memory and perfomance profiling of
the PulseAudio daemon. The 0.9.5 release contains a lot of
optimizations which are result of this work.

Before:

Massif before

After:

Massif after

These plots show the memory consumption against the time, from
starting the server, to playing stream, to stopping the stream and
shutting down the server again. The major improvement was actually an
update to libsamplerate done
by its maintainer to improve the memory handling of that library. (He
didn’t release an updated version of his library containing the
changes shown in the plots yet).

PulseAudio had the nice feature of remembering the playback volume of every
application for quite a while. Starting with 0.9.5 PulseAudio it also remembers
the output device for every application. Together with an updated Volume
Control tool which now allows moving streams between sinks while they are
played this can be used to configure a ruleset like “Ekiga always on the USB
headset, Rhytmbox always on the external speakers” very intuitively and easily:

pavucontrol screenshot

And here’s a final screenshot showing all the tools we currently have for PulseAudio 0.9.5.

PA Screenshot

A few updates on PulseAudio

Post Syndicated from Lennart Poettering original http://0pointer.net/blog/projects/pulse-news.html

Thanks to Marc-Andre Lureau there’s now a jhbuild file
for PulseAudio
. And there is this (little bit chaotic)
Wiki page
in GNOME Live! about the relation of PulseAudio and
GNOME.

A few weeks ago I wrote a new page for our Wiki where I tried to
describe the steps necessary to get the most out of PulseAudio. It’s
called the Perfect
Setup
.

A few minutes ago I released PulseAudio 0.9.5 and new versions of the auxiliary tools. The changelog:

  • Add module-hal-detect, a module that detects all local sound hardware using HAL and loads the necessary modules. Handles hot-plug and hot-removal of audio devices. (Contributed by Shahms E. King)
  • Add shared memory transfer method for local clients
  • Update module-volume-restore to automatically restore the output device last used by an application in addition to the volume it last used
  • Add a new module module-rescue-streams for automatically moving streams to another sink/source if the sink/source they are connected to dies
  • Add support for moving streams “hot” between sinks/sources
  • Reduce memory consumption and CPU load as result of Valgrind/Massif profiling
  • Add new module module-gconf for reading additional configuration statements from GConf
  • Fix module-tunnel to work with the latest protocol
  • Miscellaneous fixes

One of the nicest new features of PulseAudio 0.9.5 is HAL
integration (which has been contributed by Shahms King). PulseAudio will
now automatically detect all available sound devices and will make
use of them. It supports both hot-plug and hot-remove.

Another nice feature is the GConf integration which allowed us to add another nice application to the PulseAudio toolset: the PulseAudio Preferences utility:

paprefs screenshot

The idea is to have a simple, nice configuration dialog that allows
configuration of the more exotic features of PulseAudio which we do
not enable by default due to security considerations or to not
confuse the user. Right now a lot of features are hidden behind
non-trivial configuration file statements. This preferences tool shall
make them available for the users which are not so keen on editing
configuration files.

Playing around with Valgrind‘s
Massif tool and KCachegrind I did a little bit of memory and perfomance profiling of
the PulseAudio daemon. The 0.9.5 release contains a lot of
optimizations which are result of this work.

Before:

Massif before

After:

Massif after

These plots show the memory consumption against the time, from
starting the server, to playing stream, to stopping the stream and
shutting down the server again. The major improvement was actually an
update to libsamplerate done
by its maintainer to improve the memory handling of that library. (He
didn’t release an updated version of his library containing the
changes shown in the plots yet).

PulseAudio had the nice feature of remembering the playback volume of every
application for quite a while. Starting with 0.9.5 PulseAudio it also remembers
the output device for every application. Together with an updated Volume
Control tool which now allows moving streams between sinks while they are
played this can be used to configure a ruleset like “Ekiga always on the USB
headset, Rhytmbox always on the external speakers” very intuitively and easily:

pavucontrol screenshot

And here’s a final screenshot showing all the tools we currently have for PulseAudio 0.9.5.

PA Screenshot