Tag Archives: hive

Eevee mugshot set for Doom

Post Syndicated from Eevee original https://eev.ee/release/2017/11/23/eevee-mugshot-set-for-doom/

Screenshot of Industrial Zone from Doom II, with an Eevee face replacing the usual Doom marine in the status bar

A full replacement of Doomguy’s vast array of 42 expressions.

You can get it yourself if you want to play Doom as me, for some reason? It does nothing but replace a few sprites, so it works with any Doom flavor (including vanilla) on 1, 2, or Final. Just run Doom with -file eeveemug.wad. With GZDoom, you can load it automatically.

I don’t entirely know why I did this. I drew the first one on a whim, then realized there was nothing really stopping me from making a full set, so I spent a day doing that.

The funny thing is that I usually play Doom with ZDoom’s “alternate” HUD. It’s a full-screen overlay rather than a huge bar, and — crucially — it does not show the mugshot. It can’t even be configured to show the mugshot. As far as I’m aware, it can’t even be modded to show the mugshot. So I have to play with the OG status bar if I want to actually use the thing I made.

Preview of the Eevee mugshot sprites arranged in a grid, where the Eevee becomes more beaten up in each subsequent column

I’m pretty happy with the results overall! I think I did a decent job emulating the Doom “surreal grit” style. I did the shading with Aseprite‘s shading mode — instead of laying down a solid color, it shifts pixels along a ramp of colors you select every time you draw over them. Doom’s palette has a lot of browns, so I made a ramp out of all of them and kept going over furry areas, nudging pixels into being lighter or darker, until I liked the texture. It was a lot like making a texture in a sketch with a lot of scratchy pencil strokes.

I also gleaned some interesting things about smoothness and how the eye interprets contours? I tried to explain this on Twitter and had a hell of a time putting it into words, but the short version is that it’s amazing to see the difference a single misplaced pixel can make, especially as you slide that pixel between dark and light.

Doom's palette of 256 colors, many of which are very long gradients of reds and browns

Speaking of which, Doom’s palette is incredibly weird to work with. Thank goodness Eevees are brown! The game does have to draw arbitrary levels of darkness all with the same palette, which partly explains the number of dark colors and gradients — but I believe a number of the colors are exact duplicates, so close they might as well be duplicates, or completely unused in stock Doom assets. I guess they had no reason to optimize for people trying to add arbitrary art to the game 25 years later, though. (And nowadays, GZDoom includes a truecolor software renderer, so the palette is becoming less and less important.)

I originally wanted the god mode sprite to be a Sylveon, but Sylveon is made of pink and azure and blurple, and I don’t think I could’ve pulled it off with this set of colors. I even struggled with the color of the mane a bit — I usually color it with pretty pale colors, but Doom only has a couple of those, and they’re very saturated. I ended up using a lot more dark yellows than I would normally, and thankfully it worked out pretty well.

The most significant change I made between the original sprite and the final set was the eye color:

A comparison between an original Doom mugshot sprite, the first sprite I drew, and how it ended up

(This is STFST20, a frame from the default three-frame “glacing around” animation that plays when the player has between 40 and 59 health. Doom Wiki has a whole article on the mugshot if you’re interested.)

The blue eyes in my original just do not work at all. The Doom palette doesn’t have a lot of subtle colors, and its blues in particular are incredibly bad. In the end, I made the eyes basically black, though with a couple pixels of very dark blue in them.

After I decided to make the full set, I started by making a neutral and completely healthy front pose, then derived the others from that (with a very complicated system of layers). You can see some of the side effects of that here: the face doesn’t actually turn when glancing around, because hoo boy that would’ve been a lot of work, and so the cheek fluff is visible on both sides.

I also notice that there are two columns of identical pixels in each eye! I fixed that in the glance to the right, but must’ve forgotten about it here. Oh, well; I didn’t even notice until I zoomed in just now.

A general comparison between the Doom mugshots and my Eevee ones, showing each pose in its healthy state plus the neutral pose in every state of deterioration

The original sprites might not be quite aligned correctly in the above image. The available space in the status bar is 35×31, of which a couple pixels go to an inset border, leaving 33×30. I drew all of my sprites at that size, but the originals are all cropped and have varying offsets (part of the Doom sprite format). I extremely can’t be assed to check all of those offsets for over a dozen sprites, so I just told ImageMagick to center them. (I only notice right now that some of the original sprites are even a full 31 pixels tall and draw over the top border that I was so careful to stay out of!)

Anyway, this is a representative sample of the Doom mugshot poses.

The top row shows all eight frames at full health. The first three are the “idle” state, drawn when nothing else is going on; the sprite usually faces forwards, but glances around every so often at random. The forward-facing sprite is the one I finalized first.

I tried to take a lot of cues from the original sprite, seeing as I wanted to match the style. I’d never tried drawing a sprite with a large palette and a small resolution before, and the first thing that struck me was Doomguy’s lips — the upper lip, lips themselves, and shadow under the lower lip are all created with only one row of pixels each. I thought that was amazing. Now I even kinda wish I’d exaggerated that effect a bit more, but I was wary of going too dark when there’s a shadow only a couple pixels away. I suppose Doomguy has the advantage of having, ah, a chin.

I did much the same for the eyebrows, which was especially necessary because Doomguy has more of a forehead than my Eevee does. I probably could’ve exaggerated those a bit more, as well! Still, I love how they came out — especially in the simple looking-around frames, where even a two-pixel eyebrow raise is almost comically smug.

The fourth frame is a wild-ass grin (even named STFEVL0), which shows for a short time after picking up a new weapon. Come to think of it, that’s a pretty rare occurrence when playing straight through one of the Doom games; you keep your weapons between levels.

The fifth through seventh are also a set. If the player takes damage, the status bar will briefly show one of these frames to indicate where the damage is coming from. You may notice that where Doomguy bravely faces the source of the pain, I drew myself wincing and recoiling away from it.

The middle frame of that set also appears while the player is firing continuously (regardless of damage), so I couldn’t really make it match the left and right ones. I like the result anyway. It was also great fun figuring out the expressions with the mouth — that’s another place where individual pixels make a huge difference.

Finally, the eighth column is the legendary “ouch” face, which appears when the player takes more than 20 damage at once. It may look completely alien to you, because vanilla Doom has a bug that only shows this face when the player gains 20 or more health while taking damage. This is vanishingly rare (though possible!), so the frame virtually never appears in vanilla Doom. Lots of source ports have fixed this bug, making the ouch face it a bit better known, but I usually play without the mugshot visible so it still looks super weird to me. I think my own spin on it is a bit less, ah, body horror?

The second row shows deterioration. It is pretty weird drawing yourself getting beaten up.

A lot of Doomguy’s deterioration is in the form of blood dripping from under his hair, which I didn’t think would translate terribly well to a character without hair. Instead, I went a little cartoony with it, adding bandages here and there. I had a little bit of a hard time with the bloodshot eyes at this resolution, which I realize as I type it is a very poor excuse when I had eyes three times bigger than Doomguy’s. I do love the drooping ears, with the possible exception of the fifth state, which I’m not sure is how that would actually look…? Oh well. I also like the bow becoming gradually unravelled, eventually falling off entirely when you die.

Oh, yes, the sixth frame there (before the gap) is actually for a dead player. Doomguy’s bleeding becomes markedly more extreme here, but again that didn’t really work for me, so I went a little sillier with it. A little. It’s still pretty weird drawing yourself dead.

That leaves only god mode, which is incredible. I love that glow. I love the faux whisker shapes it makes. I love how it fades into the background. I love that 100% pure “oh this is pretty good” smile. It all makes me want to just play Doom in god mode forever.

Now that I’ve looked closely at these sprites again, I spy a good half dozen little inconsistencies and nitpicks, which I’m going to refrain from spelling out. I did do this in only a day, and I think it came out pretty dang well considering.

Maybe I’ll try something else like this in the future. Not quite sure what, though; there aren’t many small and self-contained sets of sprites like this in Doom. Monsters are several times bigger and have a zillion different angles. Maybe some pickups, which only have one frame?

Hmm. Parting thought: I’m not quite sure where I should host this sort of one-off thing. It arguably belongs on Itch, but seems really out of place alongside entire released games. It also arguably belongs on the idgames archive, but I’m hesitant to put it there because it’s such an obscure thing of little interest to a general audience. At the moment it’s just a file I’ve uploaded to wherever on my own space, but I now have three little Doom experiments with no real permanent home.

How to Enable Caching for AWS CodeBuild

Post Syndicated from Karthik Thirugnanasambandam original https://aws.amazon.com/blogs/devops/how-to-enable-caching-for-aws-codebuild/

AWS CodeBuild is a fully managed build service. There are no servers to provision and scale, or software to install, configure, and operate. You just specify the location of your source code, choose your build settings, and CodeBuild runs build scripts for compiling, testing, and packaging your code.

A typical application build process includes phases like preparing the environment, updating the configuration, downloading dependencies, running unit tests, and finally, packaging the built artifact.

Downloading dependencies is a critical phase in the build process. These dependent files can range in size from a few KBs to multiple MBs. Because most of the dependent files do not change frequently between builds, you can noticeably reduce your build time by caching dependencies.

In this post, I will show you how to enable caching for AWS CodeBuild.


  • Create an Amazon S3 bucket for storing cache archives (You can use existing s3 bucket as well).
  • Create a GitHub account (if you don’t have one).

Create a sample build project:

1. Open the AWS CodeBuild console at https://console.aws.amazon.com/codebuild/.

2. If a welcome page is displayed, choose Get started.

If a welcome page is not displayed, on the navigation pane, choose Build projects, and then choose Create project.

3. On the Configure your project page, for Project name, type a name for this build project. Build project names must be unique across each AWS account.

4. In Source: What to build, for Source provider, choose GitHub.

5. In Environment: How to build, for Environment image, select Use an image managed by AWS CodeBuild.

  • For Operating system, choose Ubuntu.
  • For Runtime, choose Java.
  • For Version,  choose aws/codebuild/java:openjdk-8.
  • For Build specification, select Insert build commands.

Note: The build specification file (buildspec.yml) can be configured in two ways. You can package it along with your source root directory, or you can override it by using a project environment configuration. In this example, I will use the override option and will use the console editor to specify the build specification.

6. Under Build commands, click Switch to editor to enter the build specification.

Copy the following text.

version: 0.2

      - mvn install
    - '/root/.m2/**/*'

Note: The cache section in the build specification instructs AWS CodeBuild about the paths to be cached. Like the artifacts section, the cache paths are relative to $CODEBUILD_SRC_DIR and specify the directories to be cached. In this example, Maven stores the downloaded dependencies to the /root/.m2/ folder, but other tools use different folders. For example, pip uses the /root/.cache/pip folder, and Gradle uses the /root/.gradle/caches folder. You might need to configure the cache paths based on your language platform.

7. In Artifacts: Where to put the artifacts from this build project:

  • For Type, choose No artifacts.

8. In Cache:

  • For Type, choose Amazon S3.
  • For Bucket, choose your S3 bucket.
  • For Path prefix, type cache/archives/

9. In Service role, the Create a service role in your account option will display a default role name.  You can accept the default name or type your own.

If you already have an AWS CodeBuild service role, choose Choose an existing service role from your account.

10. Choose Continue.

11. On the Review page, to run a build, choose Save and build.

Review build and cache behavior:

Let us review our first build for the project.

In the first run, where no cache exists, overall build time would look something like below (notice the time for DOWNLOAD_SOURCE, BUILD and POST_BUILD):

If you check the build logs, you will see log entries for dependency downloads. The dependencies are downloaded directly from configured external repositories. At the end of the log, you will see an entry for the cache uploaded to your S3 bucket.

Let’s review the S3 bucket for the cached archive. You’ll see the cache from our first successful build is uploaded to the configured S3 path.

Let’s try another build with the same CodeBuild project. This time the build should pick up the dependencies from the cache.

In the second run, there was a cache hit (cache was generated from the first run):

You’ll notice a few things:

  1. DOWNLOAD_SOURCE took slightly longer. Because, in addition to the source code, this time the build also downloaded the cache from user’s s3 bucket.
  2. BUILD time was faster. As the dependencies didn’t need to get downloaded, but were reused from cache.
  3. POST_BUILD took slightly longer, but was relatively the same.

Overall, build duration was improved with cache.

Best practices for cache

  • By default, the cache archive is encrypted on the server side with the customer’s artifact KMS key.
  • You can expire the cache by manually removing the cache archive from S3. Alternatively, you can expire the cache by using an S3 lifecycle policy.
  • You can override cache behavior by updating the project. You can use the AWS CodeBuild the AWS CodeBuild console, AWS CLI, or AWS SDKs to update the project. You can also invalidate cache setting by using the new InvalidateProjectCache API. This API forces a new InvalidationKey to be generated, ensuring that future builds receive an empty cache. This API does not remove the existing cache, because this could cause inconsistencies with builds currently in flight.
  • The cache can be enabled for any folders in the build environment, but we recommend you only cache dependencies/files that will not change frequently between builds. Also, to avoid unexpected application behavior, don’t cache configuration and sensitive information.


In this blog post, I showed you how to enable and configure cache setting for AWS CodeBuild. As you see, this can save considerable build time. It also improves resiliency by avoiding external network connections to an artifact repository.

I hope you found this post useful. Feel free to leave your feedback or suggestions in the comments.

What We’re Thankful For

Post Syndicated from Roderick Bauer original https://www.backblaze.com/blog/what-were-thankful-for/

All of us at Backblaze hope you have a wonderful Thanksgiving, and that you can enjoy it with family and friends. We asked everyone at Backblaze to express what they are thankful for. Here are their responses.

Fall leaves

What We’re Thankful For

Aside from friends, family, hobbies, health, etc. I’m thankful for my home. It’s not much, but it’s mine, and allows me to indulge in everything listed above. Or not, if I so choose. And coffee.

— Tony

I’m thankful for my wife Jen, and my other friends. I’m thankful that I like my coworkers and can call them friends too. I’m thankful for my health. I’m thankful that I was born into a middle class family in the US and that I have been very, very lucky because of that.

— Adam

Besides the most important things which are being thankful for my family, my health and my friends, I am very thankful for Backblaze. This is the first job I’ve ever had where I truly feel like I have a great work/life balance. With having 3 kids ages 8, 6 and 4, a husband that works crazy hours and my tennis career on the rise (kidding but I am on 4 teams) it’s really nice to feel like I have balance in my life. So cheers to Backblaze – where a girl can have it all!

— Shelby

I am thankful to work at a high-tech company that recognizes the contributions of engineers in their 40s and 50s.

— Jeannine

I am thankful for the music, the songs I’m singing. Thankful for all the joy they’re bringing. Who can live without it, I ask in all honesty? What would life be? Without a song or a dance what are we? So I say thank you for the music. For giving it to me!

— Yev

I’m thankful that I don’t look anything like the portrait my son draws of me…seriously.

— Natalie

I am thankful to work for a company that puts its people and product ahead of profits.

— James

I am thankful that even in the middle of disasters, turmoil, and violence, there are always people who commit amazing acts of generosity, courage, and kindness that restore my faith in mankind.

— Roderick

The future.

— Ahin

The Future

I am thankful for the current state of modern inexpensive broadband networking that allows me to stay in touch with friends and family that are far away, allows Backblaze to exist and pay my salary so I can live comfortably, and allows me to watch cat videos for free. The internet makes this an amazing time to be alive.

— Brian

Other than being thankful for family & good health, I’m quite thankful through the years I’ve avoided losing any of my 12+TB photo archive. 20 years of photoshoots, family photos and cell phone photos kept safe through changing storage media (floppy drives, flopticals, ZIP, JAZ, DVD-RAM, CD, DVD and hard drives), not to mention various technology/software solutions. It’s a data minefield out there, especially in the long run with changing media formats.

— Jim

I am thankful for non-profit organizations and their volunteers, such as IMAlive. Possibly the greatest gift you can give someone is empowerment, and an opportunity for them to recognize their own resilience and strength.

— Emily

I am thankful for my loving family, friends who make me laugh, a cool company to work for, talented co-workers who make me a better engineer, and beautiful Fall days in Wisconsin!

— Marjorie

Marjorie Wisconsin

I’m thankful for preschool drawings about thankfulness.

— Adam

I am thankful for new friends and working for a company that allows us to be ourselves.

— Annalisa

I’m thankful for my dog as I always find a reason to smile at him everyday. Yes, he still smells from his skunkin’ last week and he tracks mud in my house, but he came from the San Quentin puppy-prisoner program and I’m thankful I found him and that he found me! My vet is thankful as well.

— Terry

I’m thankful that my colleagues are also my friends outside of the office and that the rain season has started in California.

— Aaron

I’m thankful for family, friends, and beer. Mostly for family and friends, but beer is really nice too!

— Ken

There are so many amazing blessings that make up my daily life that I thank God for, so here I go – my basic needs of food, water and shelter, my husband and 2 daughters and the rest of the family (here and abroad) — their love, support, health, and safety, waking up to a new day every day, friends, music, my job, funny things, hugs and more hugs (who does not like hugs?).

— Cecilia

I am thankful to be blessed with a close-knit extended family, and for everything they do for my new, growing family. With a toddler and a second child on the way, it helps having so many extra sets of hands around to help with the kids!

— Zack

I’m thankful for family and friends, the opportunities my parents gave me by moving the U.S., and that all of us together at Backblaze have built a place to be proud of.

— Gleb

Aside for being thankful for family and friends, I am also thankful I live in a place with such natural beauty. Being so close to mountains and the ocean, and everything in between, is something that I don’t take for granted!

— Sona

I’m thankful for my wonderful wife, family, friends, and co-workers. I’m thankful for having a happy and healthy son, and the chance to watch him grow on a daily basis.

— Ariel

I am thankful for a dog-friendly workplace.

— LeAnn

I’m thankful for my amazing new wife and that she’s as much of a nerd as I am.

— Troy

I am thankful for every reunion with my siblings and families.

— Cecilia

I am thankful for my funny, strong-willed, happy daughter, my awesome husband, my family, and amazing friends. I am also thankful for the USA and all the opportunities that come with living here. Finally, I am thankful for Backblaze, a truly great place to work and for all of my co-workers/friends here.

— Natasha

I am thankful that I do not need to hunt and gather everyday to put food on the table but at the same time I feel that I don’t appreciate the food the sits before me as much as I should. So I use Thanksgiving to think about the people and the animals that put food on my family’s table.

— KC

I am thankful for my cat, Catnip. She’s been with me for 18 years and seen me through so many ups and downs. She’s been along my side through two long-term relationships, several moves, and one marriage. I know we don’t have much time together and feel blessed every day she’s here.

— JC

I am thankful for imperfection and misshapen candies. The imperceptible romance of sunsets through bus windows. The dream that family, friends, co-workers, and strangers are connected by love. I am thankful to my ancestors for enduring so much hardship so that I could be here enjoying Bay Area burritos.

— Damon

Autumn leaves

The post What We’re Thankful For appeared first on Backblaze Blog | Cloud Storage & Cloud Backup.

HiveMQ 3.2.8 released

Post Syndicated from The HiveMQ Team original https://www.hivemq.com/blog/hivemq-3-2-8-released/

The HiveMQ team is pleased to announce the availability of HiveMQ 3.2.8. This is a maintenance release for the 3.2 series and brings the following improvements:

  • Improved performance for payload disk persistence
  • Improved performance for subscription disk persistence
  • Improved exception handling in OnSubscribeCallback when an Exception is not caught by a plugin
  • Fixed an issue where the metric for discarded messages “QoS 0 Queue not empty” was increased when a client is offline
  • Fixed an issue where the convenience methods for a SslCertificate might return null for certain extensions
  • Fixed an issue which could lead to the OnPubackReceivedCallback being executed when inflight queue is full
  • Fixed an issue where a scheduled background cleanup job could cause an error in the logs
  • Fixed an issue which could lead to an IllegalArgumentException when sending a QoS 0 message in a rare edge-case
  • Fixed an issue where a error “Exception while handling batched publish request” was logged without reason

You can download the new HiveMQ version here.

We recommend to upgrade if you are an HiveMQ 3.2.x user.

Have a great day,
The HiveMQ Team

Event-Driven Computing with Amazon SNS and AWS Compute, Storage, Database, and Networking Services

Post Syndicated from Christie Gifrin original https://aws.amazon.com/blogs/compute/event-driven-computing-with-amazon-sns-compute-storage-database-and-networking-services/

Contributed by Otavio Ferreira, Manager, Software Development, AWS Messaging

Like other developers around the world, you may be tackling increasingly complex business problems. A key success factor, in that case, is the ability to break down a large project scope into smaller, more manageable components. A service-oriented architecture guides you toward designing systems as a collection of loosely coupled, independently scaled, and highly reusable services. Microservices take this even further. To improve performance and scalability, they promote fine-grained interfaces and lightweight protocols.

However, the communication among isolated microservices can be challenging. Services are often deployed onto independent servers and don’t share any compute or storage resources. Also, you should avoid hard dependencies among microservices, to preserve maintainability and reusability.

If you apply the pub/sub design pattern, you can effortlessly decouple and independently scale out your microservices and serverless architectures. A pub/sub messaging service, such as Amazon SNS, promotes event-driven computing that statically decouples event publishers from subscribers, while dynamically allowing for the exchange of messages between them. An event-driven architecture also introduces the responsiveness needed to deal with complex problems, which are often unpredictable and asynchronous.

What is event-driven computing?

Given the context of microservices, event-driven computing is a model in which subscriber services automatically perform work in response to events triggered by publisher services. This paradigm can be applied to automate workflows while decoupling the services that collectively and independently work to fulfil these workflows. Amazon SNS is an event-driven computing hub, in the AWS Cloud, that has native integration with several AWS publisher and subscriber services.

Which AWS services publish events to SNS natively?

Several AWS services have been integrated as SNS publishers and, therefore, can natively trigger event-driven computing for a variety of use cases. In this post, I specifically cover AWS compute, storage, database, and networking services, as depicted below.

Compute services

  • Auto Scaling: Helps you ensure that you have the correct number of Amazon EC2 instances available to handle the load for your application. You can configure Auto Scaling lifecycle hooks to trigger events, as Auto Scaling resizes your EC2 cluster.As an example, you may want to warm up the local cache store on newly launched EC2 instances, and also download log files from other EC2 instances that are about to be terminated. To make this happen, set an SNS topic as your Auto Scaling group’s notification target, then subscribe two Lambda functions to this SNS topic. The first function is responsible for handling scale-out events (to warm up cache upon provisioning), whereas the second is in charge of handling scale-in events (to download logs upon termination).

  • AWS Elastic Beanstalk: An easy-to-use service for deploying and scaling web applications and web services developed in a number of programming languages. You can configure event notifications for your Elastic Beanstalk environment so that notable events can be automatically published to an SNS topic, then pushed to topic subscribers.As an example, you may use this event-driven architecture to coordinate your continuous integration pipeline (such as Jenkins CI). That way, whenever an environment is created, Elastic Beanstalk publishes this event to an SNS topic, which triggers a subscribing Lambda function, which then kicks off a CI job against your newly created Elastic Beanstalk environment.

  • Elastic Load Balancing: Automatically distributes incoming application traffic across Amazon EC2 instances, containers, or other resources identified by IP addresses.You can configure CloudWatch alarms on Elastic Load Balancing metrics, to automate the handling of events derived from Classic Load Balancers. As an example, you may leverage this event-driven design to automate latency profiling in an Amazon ECS cluster behind a Classic Load Balancer. In this example, whenever your ECS cluster breaches your load balancer latency threshold, an event is posted by CloudWatch to an SNS topic, which then triggers a subscribing Lambda function. This function runs a task on your ECS cluster to trigger a latency profiling tool, hosted on the cluster itself. This can enhance your latency troubleshooting exercise by making it timely.

Storage services

  • Amazon S3: Object storage built to store and retrieve any amount of data.You can enable S3 event notifications, and automatically get them posted to SNS topics, to automate a variety of workflows. For instance, imagine that you have an S3 bucket to store incoming resumes from candidates, and a fleet of EC2 instances to encode these resumes from their original format (such as Word or text) into a portable format (such as PDF).In this example, whenever new files are uploaded to your input bucket, S3 publishes these events to an SNS topic, which in turn pushes these messages into subscribing SQS queues. Then, encoding workers running on EC2 instances poll these messages from the SQS queues; retrieve the original files from the input S3 bucket; encode them into PDF; and finally store them in an output S3 bucket.

  • Amazon EFS: Provides simple and scalable file storage, for use with Amazon EC2 instances, in the AWS Cloud.You can configure CloudWatch alarms on EFS metrics, to automate the management of your EFS systems. For example, consider a highly parallelized genomics analysis application that runs against an EFS system. By default, this file system is instantiated on the “General Purpose” performance mode. Although this performance mode allows for lower latency, it might eventually impose a scaling bottleneck. Therefore, you may leverage an event-driven design to handle it automatically.Basically, as soon as the EFS metric “Percent I/O Limit” breaches 95%, CloudWatch could post this event to an SNS topic, which in turn would push this message into a subscribing Lambda function. This function automatically creates a new file system, this time on the “Max I/O” performance mode, then switches the genomics analysis application to this new file system. As a result, your application starts experiencing higher I/O throughput rates.

  • Amazon Glacier: A secure, durable, and low-cost cloud storage service for data archiving and long-term backup.You can set a notification configuration on an Amazon Glacier vault so that when a job completes, a message is published to an SNS topic. Retrieving an archive from Amazon Glacier is a two-step asynchronous operation, in which you first initiate a job, and then download the output after the job completes. Therefore, SNS helps you eliminate polling your Amazon Glacier vault to check whether your job has been completed, or not. As usual, you may subscribe SQS queues, Lambda functions, and HTTP endpoints to your SNS topic, to be notified when your Amazon Glacier job is done.

  • AWS Snowball: A petabyte-scale data transport solution that uses secure appliances to transfer large amounts of data.You can leverage Snowball notifications to automate workflows related to importing data into and exporting data from AWS. More specifically, whenever your Snowball job status changes, Snowball can publish this event to an SNS topic, which in turn can broadcast the event to all its subscribers.As an example, imagine a Geographic Information System (GIS) that distributes high-resolution satellite images to users via Web browser. In this example, the GIS vendor could capture up to 80 TB of satellite images; create a Snowball job to import these files from an on-premises system to an S3 bucket; and provide an SNS topic ARN to be notified upon job status changes in Snowball. After Snowball changes the job status from “Importing” to “Completed”, Snowball publishes this event to the specified SNS topic, which delivers this message to a subscribing Lambda function, which finally creates a CloudFront web distribution for the target S3 bucket, to serve the images to end users.

Database services

  • Amazon RDS: Makes it easy to set up, operate, and scale a relational database in the cloud.RDS leverages SNS to broadcast notifications when RDS events occur. As usual, these notifications can be delivered via any protocol supported by SNS, including SQS queues, Lambda functions, and HTTP endpoints.As an example, imagine that you own a social network website that has experienced organic growth, and needs to scale its compute and database resources on demand. In this case, you could provide an SNS topic to listen to RDS DB instance events. When the “Low Storage” event is published to the topic, SNS pushes this event to a subscribing Lambda function, which in turn leverages the RDS API to increase the storage capacity allocated to your DB instance. The provisioning itself takes place within the specified DB maintenance window.

  • Amazon ElastiCache: A web service that makes it easy to deploy, operate, and scale an in-memory data store or cache in the cloud.ElastiCache can publish messages using Amazon SNS when significant events happen on your cache cluster. This feature can be used to refresh the list of servers on client machines connected to individual cache node endpoints of a cache cluster. For instance, an ecommerce website fetches product details from a cache cluster, with the goal of offloading a relational database and speeding up page load times. Ideally, you want to make sure that each web server always has an updated list of cache servers to which to connect.To automate this node discovery process, you can get your ElastiCache cluster to publish events to an SNS topic. Thus, when ElastiCache event “AddCacheNodeComplete” is published, your topic then pushes this event to all subscribing HTTP endpoints that serve your ecommerce website, so that these HTTP servers can update their list of cache nodes.

  • Amazon Redshift: A fully managed data warehouse that makes it simple to analyze data using standard SQL and BI (Business Intelligence) tools.Amazon Redshift uses SNS to broadcast relevant events so that data warehouse workflows can be automated. As an example, imagine a news website that sends clickstream data to a Kinesis Firehose stream, which then loads the data into Amazon Redshift, so that popular news and reading preferences might be surfaced on a BI tool. At some point though, this Amazon Redshift cluster might need to be resized, and the cluster enters a ready-only mode. Hence, this Amazon Redshift event is published to an SNS topic, which delivers this event to a subscribing Lambda function, which finally deletes the corresponding Kinesis Firehose delivery stream, so that clickstream data uploads can be put on hold.At a later point, after Amazon Redshift publishes the event that the maintenance window has been closed, SNS notifies a subscribing Lambda function accordingly, so that this function can re-create the Kinesis Firehose delivery stream, and resume clickstream data uploads to Amazon Redshift.

  • AWS DMS: Helps you migrate databases to AWS quickly and securely. The source database remains fully operational during the migration, minimizing downtime to applications that rely on the database.DMS also uses SNS to provide notifications when DMS events occur, which can automate database migration workflows. As an example, you might create data replication tasks to migrate an on-premises MS SQL database, composed of multiple tables, to MySQL. Thus, if replication tasks fail due to incompatible data encoding in the source tables, these events can be published to an SNS topic, which can push these messages into a subscribing SQS queue. Then, encoders running on EC2 can poll these messages from the SQS queue, encode the source tables into a compatible character set, and restart the corresponding replication tasks in DMS. This is an event-driven approach to a self-healing database migration process.

Networking services

  • Amazon Route 53: A highly available and scalable cloud-based DNS (Domain Name System). Route 53 health checks monitor the health and performance of your web applications, web servers, and other resources.You can set CloudWatch alarms and get automated Amazon SNS notifications when the status of your Route 53 health check changes. As an example, imagine an online payment gateway that reports the health of its platform to merchants worldwide, via a status page. This page is hosted on EC2 and fetches platform health data from DynamoDB. In this case, you could configure a CloudWatch alarm for your Route 53 health check, so that when the alarm threshold is breached, and the payment gateway is no longer considered healthy, then CloudWatch publishes this event to an SNS topic, which pushes this message to a subscribing Lambda function, which finally updates the DynamoDB table that populates the status page. This event-driven approach avoids any kind of manual update to the status page visited by merchants.

  • AWS Direct Connect (AWS DX): Makes it easy to establish a dedicated network connection from your premises to AWS, which can reduce your network costs, increase bandwidth throughput, and provide a more consistent network experience than Internet-based connections.You can monitor physical DX connections using CloudWatch alarms, and send SNS messages when alarms change their status. As an example, when a DX connection state shifts to 0 (zero), indicating that the connection is down, this event can be published to an SNS topic, which can fan out this message to impacted servers through HTTP endpoints, so that they might reroute their traffic through a different connection instead. This is an event-driven approach to connectivity resilience.

More event-driven computing on AWS

In addition to SNS, event-driven computing is also addressed by Amazon CloudWatch Events, which delivers a near real-time stream of system events that describe changes in AWS resources. With CloudWatch Events, you can route each event type to one or more targets, including:

Many AWS services publish events to CloudWatch. As an example, you can get CloudWatch Events to capture events on your ETL (Extract, Transform, Load) jobs running on AWS Glue and push failed ones to an SQS queue, so that you can retry them later.


Amazon SNS is a pub/sub messaging service that can be used as an event-driven computing hub to AWS customers worldwide. By capturing events natively triggered by AWS services, such as EC2, S3 and RDS, you can automate and optimize all kinds of workflows, namely scaling, testing, encoding, profiling, broadcasting, discovery, failover, and much more. Business use cases presented in this post ranged from recruiting websites, to scientific research, geographic systems, social networks, retail websites, and news portals.

Start now by visiting Amazon SNS in the AWS Management Console, or by trying the AWS 10-Minute Tutorial, Send Fan-out Event Notifications with Amazon SNS and Amazon SQS.


B2 Cloud Storage Roundup

Post Syndicated from Andy Klein original https://www.backblaze.com/blog/b2-cloud-storage-roundup/

B2 Integrations
Over the past several months, B2 Cloud Storage has continued to grow like we planted magic beans. During that time we have added a B2 Java SDK, and certified integrations with GoodSync, Arq, Panic, UpdraftPlus, Morro Data, QNAP, Archiware, Restic, and more. In addition, B2 customers like Panna Cooking, Sermon Audio, and Fellowship Church are happy they chose B2 as their cloud storage provider. If any of that sounds interesting, read on.

The B2 Java SDK

While the Backblaze B2 API is well documented and straight-forward to implement, we were asked by a few of our Integration Partners if we had an SDK they could use. So we developed one as an open-course project on GitHub, where we hope interested parties will not only use our Java SDK, but make it better for everyone else.

There are different reasons one might use the Java SDK, but a couple of areas where the SDK can simplify the coding process are:

Expiring Authorization — B2 requires an application key for a given account be reissued once a day when using the API. If the application key expires while you are in the middle of transferring files or some other B2 activity (bucket list, etc.), the SDK can be used to detect and then update the application key on the fly. Your B2 related activities will continue without incident and without having to capture and code your own exception case.

Error Handling — There are different types of error codes B2 will return, from expired application keys to detecting malformed requests to command time-outs. The SDK can dramatically simplify the coding needed to capture and account for the various things that can happen.

While Backblaze has created the Java SDK, developers in the GitHub community have also created other SDKs for B2, for example, for PHP (https://github.com/cwhite92/b2-sdk-php,) and Go (https://github.com/kurin/blazer.) Let us know in the comments about other SDKs you’d like to see or perhaps start your own GitHub project. We will publish any updates in our next B2 roundup.

What You Can Do with Affordable and Available Cloud Storage

You’re probably aware that B2 is up to 75% less expensive than other similar cloud storage services like Amazon S3 and Microsoft Azure. Businesses and organizations are finding that projects that previously weren’t economically feasible with other Cloud Storage services are now not only possible, but a reality with B2. Here are a few recent examples:

SermonAudio logo SermonAudio wanted their media files to be readily available, but didn’t want to build and manage their own internal storage farm. Until B2, cloud storage was just too expensive to use. Now they use B2 to store their audio and video files, and also as the primary source of downloads and streaming requests from their subscribers.
Fellowship Church logo Fellowship Church wanted to escape from the ever increasing amount of time they were spending saving their data to their LTO-based system. Using B2 saved countless hours of personnel time versus LTO, fit easily into their video processing workflow, and provided instant access at any time to their media library.
Panna logo Panna Cooking replaced their closet full of archive hard drives with a cost-efficient hybrid-storage solution combining 45Drives and Backblaze B2 Cloud Storage. Archived media files that used to take hours to locate are now readily available regardless of whether they reside in local storage or in the B2 Cloud.

B2 Integrations

Leading companies in backup, archive, and sync continue to add B2 Cloud Storage as a storage destination for their customers. These companies realize that by offering B2 as an option, they can dramatically lower the total cost of ownership for their customers — and that’s always a good thing.

If your favorite application is not integrated to B2, you can do something about it. One integration partner told us they received over 200 customer requests for a B2 integration. The partner got the message and the integration is currently in beta test.

Below are some of the partner integrations completed in the past few months. You can check the B2 Partner Integrations page for a complete list.

Archiware — Both P5 Archive and P5 Backup can now store data in the B2 Cloud making your offsite media files readily available while keeping your off-site storage costs predictable and affordable.

Arq — Combine Arq and B2 for amazingly affordable backup of external drives, network drives, NAS devices, Windows PCs, Windows Servers, and Macs to the cloud.

GoodSync — Automatically synchronize and back up all your photos, music, email, and other important files between all your desktops, laptops, servers, external drives, and sync, or back up to B2 Cloud Storage for off-site storage.

QNAP — QNAP Hybrid Backup Sync consolidates backup, restoration, and synchronization functions into a single QTS application to easily transfer your data to local, remote, and cloud storage.

Morro Data — Their CloudNAS solution stores files in the cloud, caches them locally as needed, and syncs files globally among other CloudNAS systems in an organization.

Restic – Restic is a fast, secure, multi-platform command line backup program. Files are uploaded to a B2 bucket as de-duplicated, encrypted chunks. Each backup is a snapshot of only the data that has changed, making restores of a specific date or time easy.

Transmit 5 by Panic — Transmit 5, the gold standard for macOS file transfer apps, now supports B2. Upload, download, and manage files on tons of servers with an easy, familiar, and powerful UI.

UpdraftPlus — WordPress developers and admins can now use the UpdraftPlus Premium WordPress plugin to affordably back up their data to the B2 Cloud.

Getting Started with B2 Cloud Storage

If you’re using B2 today, thank you. If you’d like to try B2, but don’t know where to start, here’s a guide to getting started with the B2 Web Interface — no programming or scripting is required. You get 10 gigabytes of free storage and 1 gigabyte a day in free downloads. Give it a try.

The post B2 Cloud Storage Roundup appeared first on Backblaze Blog | Cloud Storage & Cloud Backup.

piwheels: making “pip install” fast

Post Syndicated from Ben Nuttall original https://www.raspberrypi.org/blog/piwheels/

TL;DR pip install numpy used to take ages, and now it’s super fast thanks to piwheels.

The Python Package Index (PyPI) is a package repository for Python modules. Members of the Python community publish software and libraries in it as an easy method of distribution. If you’ve ever used pip install, PyPI is the service that hosts the software you installed. You may have noticed that some installations can take a long time on the Raspberry Pi. That usually happens when modules have been implemented in C and require compilation.

XKCD comic of two people sword-fighting on office chairs while their code is compiling

No more slacking off! pip install numpy takes just a few seconds now \o/

Wheels for Python packages

A general solution to this problem exists: Python wheels are a standard for distributing pre-built versions of packages, saving users from having to build from source. However, when C code is compiled, it’s compiled for a particular architecture, so package maintainers usually publish wheels for 32-bit and 64-bit Windows, macOS, and Linux. Although Raspberry Pi runs Linux, its architecture is ARM, so Linux wheels are not compatible.

A comic of snakes biting their own tails to roll down a sand dune like wheels

What Python wheels are not

Pip works by browsing PyPI for a wheel matching the user’s architecture — and if it doesn’t find one, it falls back to the source distribution (usually a tarball or zip of the source code). Then the user has to build it themselves, which can take a long time, or may require certain dependencies. And if pip can’t find a source distribution, the installation fails.

Developing piwheels

In order to solve this problem, I decided to build wheels of every package on PyPI. I wrote some tooling for automating the process and used a postgres database to monitor the status of builds and log the output. Using a Pi 3 in my living room, I attempted to build wheels of the latest version of all 100 000 packages on PyPI and to host them on a web server on the Pi. This took a total of ten days, and my proof-of-concept seemed to show that it generally worked and was likely to be useful! You could install packages directly from the server, and installations were really fast.

A Raspberry Pi 3 sitting atop a Pi 2 on cloth

This Pi 3 was the piwheels beta server, sitting atop my SSH gateway Pi 2 at home

I proceeded to plan for version 2, which would attempt to build every version of every package — about 750 000 versions in total. I estimated this would take 75 days for one Pi, but I intended to scale it up to use multiple Pis. Our web hosts Mythic Beasts provide dedicated Pi 3 hosting, so I fired up 20 of them to share the load. With some help from Dave Jones, who created an efficient queuing system for the builders, we were able make this run like clockwork. In under two weeks, it was complete! Read ALL about the first build run drama on my blog.

A list of the mythic beasts cloud Pis

ALL the cloud Pis

Improving piwheels

We analysed the failures, made some tweaks, installed some key dependencies, and ran the build again to raise our success rate from 76% to 83%. We also rebuilt packages for Python 3.5 (the new default in Raspbian Stretch). The wheels we build are tagged ‘armv7l’, but because our Raspbian image is compatible with all Pi models, they’re really ARMv6, so they’re compatible with Pi 3, Pi 2, Pi 1 and Pi Zero. This means the ‘armv6l’-tagged wheels we provide are really just the ARMv7 wheels renamed.

The piwheels monitor interface created by Dave Jones

The wonderful piwheels monitor interface created by Dave

Now, you might be thinking “Why didn’t you just cross-compile?” I really wanted to have full compatibility, and building natively on Pis seemed to be the best way to achieve that. I had easy access to the Pis, and it really didn’t take all that long. Plus, you know, I wanted to eat my own dog food.

You might also be thinking “Why don’t you just apt install python3-numpy?” It’s true that some Python packages are distributed via the Raspbian/Debian archives too. However, if you’re in a virtual environment, or you need a more recent version than the one packaged for Debian, you need pip.

How it works

Now that the piwheels package repository is running as a service, hosted on a Pi 3 in the Mythic Beasts data centre in London. The pip package in Raspbian Stretch is configured to use piwheels as an additional index, so it falls back to PyPI if we’re missing a package. Just run sudo apt upgrade to get the configuration change. You’ll find that pip installs are much faster now! If you want to use piwheels on Raspbian Jessie, that’s possible too — find the instructions in our FAQs. And now, every time you pip install something, your files come from a web server running on a Raspberry Pi (that capable little machine)!

Try it for yourself in a virtual environment:

sudo apt install virtualenv python3-virtualenv -y
virtualenv -p /usr/bin/python3 testpip
source testpip/bin/activate
pip install numpy

This takes about 20 minutes on a Pi 3, 2.5 hours on a Pi 1, or just a few seconds on either if you use piwheels.

If you’re interested to see the details, try pip install numpy -v for verbose output. You’ll see that pip discovers two indexes to search:

2 location(s) to search for versions of numpy:
  * https://pypi.python.org/simple/numpy/
  * https://www.piwheels.hostedpi.com/simple/numpy/

Then it searches both indexes for available files. From this list of files, it determines the latest version available. Next it looks for a Python version and architecture match, and then opts for a wheel over a source distribution. If a new package or version is released, piwheels will automatically pick it up and add it to the build queue.

A flowchart of how piwheels works

How piwheels works

For the users unfamiliar with virtual environments I should mention that doing this isn’t a requirement — just an easy way of testing installations in a sandbox. Most pip usage will require sudo pip3 install {package}, which installs at a system level.

If you come across any issues with any packages from piwheels, please let us know in a GitHub issue.

Taking piwheels further

We currently provide over 670 000 wheels for more than 96 000 packages, all compiled natively on Raspberry Pi hardware. Moreover, we’ll keep building new packages as they are released.

Note that, at present, we have built wheels for Python 3.4 and 3.5 — we’re planning to add support for Python 3.6 and 2.7. The fact that piwheels is currently missing Python 2 wheels does not affect users: until we rebuild for Python 2, PyPI will be used as normal, it’ll just take longer than installing a Python 3 package for which we have a wheel. But remember, Python 2 end-of-life is less than three years away!

Many thanks to Dave Jones for his contributions to the project, and to Mythic Beasts for providing the excellent hosted Pi service.

Screenshot of the mythic beasts Raspberry Pi 3 server service website

Related/unrelated, check out my poster from the PyCon UK poster session:

A poster about Python and Raspberry Pi

Click to download the PDF!

The post piwheels: making “pip install” fast appeared first on Raspberry Pi.

Osama Bin Laden Compound Was a Piracy Hotbed, CIA Reveals

Post Syndicated from Ernesto original https://torrentfreak.com/osama-bin-laden-compound-was-a-piracy-hotbed-cia-reveals-171103/

The times when pirates were stereotyped as young men in a college dorm are long past us.

Nowadays you can find copyright infringers throughout many cultures and all layers of society.

In the past we’ve discovered ‘pirates’ in the most unusual places, from the FBI, through major record labels and the U.S. Government to the Vatican.

This week we can add another location to the list, Osama Bin Laden’s former Abbottabad compound, where he was captured and killed on 2 May 2011.

The CIA has regularly released documents and information found on the premises. This week it added a massive treasure trove of 470,000 files, providing insight into the interests of one of the most notorious characters in recent history.

“Today’s release of recovered al-Qa‘ida letters, videos, audio files and other materials provides the opportunity for the American people to gain further insights into the plans and workings of this terrorist organization,” CIA Director Pompeo commented.

What caught our eye, however, is the material that the CIA chose not to release. This includes a host of pirated files, some more relevant than others.

For example, the computers contained pirated copies of the movies Antz, Batman Gotham Knight, Cars, Chicken Little, Ice Age: Dawn of the Dinosaurs, Home on the Range and The Three Musketeers. Since these are children-oriented titles, it’s likely they served as entertainment for the kids living in the compound.

There was also other entertainment stored on the hard drives, including the games Final Fantasy VII and Grand Theft Auto: Chinatown Wars, a Game Boy Advance emulator, porn, and anime.

Gizmodo has an overview of some of the weirdest movies, for those who are interested.

Not all content is irrelevant, though. The archive also contains files including the documentary “Where in the World is Osama bin Laden,” “CNN Presents: World’s Most Wanted,” “In the Footsteps of Bin Laden,” and “National Geographic: World’s Worst Venom.”

Or what about “National Geographic: Kung Fu Killers,” which reveals the ten deadliest Kung Fu weapons of all time, including miniature swords disguised as tobacco pipes.

There is, of course, no evidence that Osama Bin Laden watched any of these titles. Just as there’s no proof that he played any games. There were a lot of people in the compound and, while it makes for a good headline, the files are not directly tied to him.

That said, the claim that piracy supports terrorism suddenly gets a whole new meaning…

Credit: Original compound image Sajjad Ali Qureshi

Source: TF, for the latest info on copyright, file-sharing, torrent sites and ANONYMOUS VPN services.

HiveMQ 3.3.1 released

Post Syndicated from The HiveMQ Team original https://www.hivemq.com/blog/hivemq-3-3-1-released/

The HiveMQ team is pleased to announce the availability of HiveMQ 3.3.1. This is a maintenance release for the 3.3 series and brings the following improvements:

  • Increased performance for offline message queues on disk
  • Improved user experience when inspecting client details for TLS clients in the Web UI
  • Improved user experience when creating a trace recording in the Web UI
  • Improved logging for HiveMQ file persistence on Windows
  • Fixed an issue which could lead to the OnPubackReceivedCallback being executed when inflight queue is full
  • Fixed an issue when a message is created via the plugin system and added to an offline message queue
  • Fixed an issue where the convenience methods for a SslCertificate might return null for certain extensions

You can download the new HiveMQ version here.

We recommend to upgrade if you are an HiveMQ 3.3.x user.

Have a great day,
The HiveMQ Team

ExtraTorrent Uploader Groups Launch Their Own Torrent Site

Post Syndicated from Ernesto original https://torrentfreak.com/extratorrent-uploader-groups-launch-their-own-torrent-site-171031/

Earlier this year the torrent community entered a state of shock when another major torrent site suddenly closed its doors.

Having served torrents to the masses for over a decade, ExtraTorrent decided to throw in the towel, without providing any detail or an apparent motive.

“It’s time we say goodbye,” was all that ExtraTorrent operator SaM was willing to say in a brief response.

Although ExtraTorrent is no more, the site’s name lives on thanks to several uploader groups that were tied to the site. The TV and movie torrent uploaders ETTV and ETHD continued their work on sites such as The Pirate Bay and 1337x.

While this worked fine, not all former “followers” knew where to go. ETTV and ETHD torrents were still downloaded millions of times a week, but this activity was scattered around sites that didn’t really feel like their old home.

To fill this hole, ETTV, ETHD and DTOne have now joined forces and launched a torrent site of their own, simply named ETTV.tv.

“We have launched a new site to cater to our fans,” the ETTV.tv teams inform TorrentFreak.

“Since the shutdown of ExtraTorrent we were homeless and our fans and followers didn’t know where to find us. We decided, along with DTOne (formerly DDR/ICTV/DUS), to come together and make a site that can bring our fans back.”


While the groups in question had ExtraTorrent as their home, this is not a reincarnation of the defunct torrent site. ETTV.tv is operated by different people and only offers a curated selection of movie and TV uploads.

The site may expand to other categories in the future, but for now, there are no concrete plans to do so. It’s also a closed ecosystem, meaning that only a select group of trusted uploaders is allowed to add new content.

“We want to have a site for best quality torrents thus we are not taking every uploader that comes knocking our doors,” the ETTV.tv team explains. There are over 50,000 torrents on the site already but the uploaders are working on bringing their entire archive back online.

Since the site is new there are several bugs and other issues too, which is something the site’s operators are keeping a close eye on as well. They informed TorrentFreak that ETTV.tv users are encouraged to report problems and come with suggestions.

The groups currently promote ETTV.tv in their upload notes on other sites and, even though the site has only been around for a few days, many close followers have found their way to it already.

During the months and years to come the groups hope to keep these people on board, add some more, and stay clear from any legal problems.

“We hope we can live up to the trust we had in the past and continue to serve our fans and followers,” ETTV.tv concludes.

Source: TF, for the latest info on copyright, file-sharing, torrent sites and ANONYMOUS VPN services.

Pirate-Friendly Coinhive’s DNS Hacked, User Hashes Stolen

Post Syndicated from Andy original https://torrentfreak.com/pirate-friendly-coinhives-dns-hacked-user-hashes-stolen-171025/

Just over a month ago, a Javascript cryptocurrency miner was silently added to The Pirate Bay. Noticed by users who observed their CPU usage going through the roof, it later transpired the site was trialing a miner operated by Coinhive.

Many users were disappointed that The Pirate Bay had added the Javascript-based Monero coin miner without their permission. However, it didn’t take long for people to see the potential benefits, with a raft of other sites adding the miner in the hope of generating additional revenue.

Now, however, Coinhive has an unexpected and potentially serious problem to deal with. The company has just revealed that on Monday night its DNS records maintained at Cloudflare were accessed by a third-party, allowing an unnamed attacker to redirect user mining traffic to a server they controlled.

“The DNS records for coinhive.com have been manipulated to redirect requests for the coinhive.min.js to a third party server. This third party server hosted a modified version of the JavaScript file with a hardcoded site key. This essentially let the attacker ‘steal’ hashes from our users,” Coinhive said in a statement.

The company hasn’t revealed how long the unauthorized redirect stayed in place for, but it appears that all coins mined on sites hosting Coinhive’s script were ‘stolen’ during the period, instead of being credited to their accounts.

Coinhive stresses that no user account information was leaked and that its website and database servers were uncompromised. But while that’s good news, the method that the hackers used to access the company’s DNS provider lay in a basic security error.

Back in 2014, crowdfunding platform Kickstarter – which Coinhive used – fell victim to a security breach. After being advised of the fact by law enforcement officials, Kickstarter shut down unauthorized access, began strengthening its systems, while advising customers to do the same.

While Coinhive did respond to the warning to ensure that its data was safe, something slipped through the net. One piece of information – its Cloudflare account password – remained unchanged after the Kickstarter attack. It now seems the most likely culprit for this week’s DNS breach.

“The root cause for this incident was an insecure password for our Cloudflare account that was probably leaked with the Kickstarter data breach back in 2014,” Coinhive says.

“We have learned hard lessons about security and used 2FA and unique passwords with all services since, but we neglected to update our years old Cloudflare account.”

While not mentioning Coinhive explicitly, Kickstarter warned earlier this month that the 2014 incident may not be completely over. In an update posted on the site Oct 6, Kickstarter noted that some of its customers had recently been hearing more information about the breach from notification service Have I been pwned?.

In the meantime, Coinhive has issued an apology and indicated it will find ways to reimburse sites which have lost revenue as a result of the DNS hack.

“We’re deeply sorry about this severe oversight,” the company said. “Our current plan is to credit all sites with an additional 12 hours of their the daily average hashrate. Please give us a few hours to roll this out.”

Based on earlier calculations carried out by TF, The Pirate Bay (if it was mining during the breach) could be potentially owed around $200 for the lost hashes, give or take. After turning off mining in September, the site reactivated it again in October, with no opt-out. The situation appears fluid.

While the hack is obviously a disappointment, Coinhive appears to have advised its users quickly and transparently, which under the circumstances is exactly what’s required. The fact that it’s offering compensation to users will also be welcomed.

The breach is the latest controversy to hit the company. Earlier this month, Cloudflare began banning sites which implemented Coinhive mining without informing their users. The CDN company said it considered non-advised mining as malware.

Source: TF, for the latest info on copyright, file-sharing, torrent sites and ANONYMOUS VPN services.

Backing Up Linux to Backblaze B2 with Duplicity and Restic

Post Syndicated from Roderick Bauer original https://www.backblaze.com/blog/backing-linux-backblaze-b2-duplicity-restic/

Linux users have a variety of options for handling data backup. The choices range from free and open-source programs to paid commercial tools, and include applications that are purely command-line based (CLI) and others that have a graphical interface (GUI), or both.

If you take a look at our Backblaze B2 Cloud Storage Integrations page, you will see a number of offerings that enable you to back up your Linux desktops and servers to Backblaze B2. These include CloudBerry, Duplicity, Duplicacy, 45 Drives, GoodSync, HashBackup, QNAP, Restic, and Rclone, plus other choices for NAS and hybrid uses.

In this post, we’ll discuss two popular command line and open-source programs: one older, Duplicity, and a new player, Restic.

Old School vs. New School

We’re highlighting Duplicity and Restic today because they exemplify two different philosophical approaches to data backup: “Old School” (Duplicity) vs “New School” (Restic).

Old School (Duplicity)

In the old school model, data is written sequentially to the storage medium. Once a section of data is recorded, new data is written starting where that section of data ends. It’s not possible to go back and change the data that’s already been written.

This old-school model has long been associated with the use of magnetic tape, a prime example of which is the LTO (Linear Tape-Open) standard. In this “write once” model, files are always appended to the end of the tape. If a file is modified and overwritten or removed from the volume, the associated tape blocks used are not freed up: they are simply marked as unavailable, and the used volume capacity is not recovered. Data is deleted and capacity recovered only if the whole tape is reformatted. As a Linux/Unix user, you undoubtedly are familiar with the TAR archive format, which is an acronym for Tape ARchive. TAR has been around since 1979 and was originally developed to write data to sequential I/O devices with no file system of their own.

It is from the use of tape that we get the full backup/incremental backup approach to backups. A backup sequence beings with a full backup of data. Each incremental backup contains what’s been changed since the last full backup until the next full backup is made and the process starts over, filling more and more tape or whatever medium is being used.

This is the model used by Duplicity: full and incremental backups. Duplicity backs up files by producing encrypted, digitally signed, versioned, TAR-format volumes and uploading them to a remote location, including Backblaze B2 Cloud Storage. Released under the terms of the GNU General Public License (GPL), Duplicity is free software.

With Duplicity, the first archive is a complete (full) backup, and subsequent (incremental) backups only add differences from the latest full or incremental backup. Chains consisting of a full backup and a series of incremental backups can be recovered to the point in time that any of the incremental steps were taken. If any of the incremental backups are missing, then reconstructing a complete and current backup is much more difficult and sometimes impossible.

Duplicity is available under many Unix-like operating systems (such as Linux, BSD, and Mac OS X) and ships with many popular Linux distributions including Ubuntu, Debian, and Fedora. It also can be used with Windows under Cygwin.

We recently published a KB article on How to configure Backblaze B2 with Duplicity on Linux that demonstrates how to set up Duplicity with B2 and back up and restore a directory from Linux.

New School (Restic)

With the arrival of non-sequential storage medium, such as disk drives, and new ideas such as deduplication, comes the new school approach, which is used by Restic. Data can be written and changed anywhere on the storage medium. This efficiency comes largely through the use of deduplication. Deduplication is a process that eliminates redundant copies of data and reduces storage overhead. Data deduplication techniques ensure that only one unique instance of data is retained on storage media, greatly increasing storage efficiency and flexibility.

Restic is a recently available multi-platform command line backup software program that is designed to be fast, efficient, and secure. Restic supports a variety of backends for storing backups, including a local server, SFTP server, HTTP Rest server, and a number of cloud storage providers, including Backblaze B2.

Files are uploaded to a B2 bucket as deduplicated, encrypted chunks. Each time a backup runs, only changed data is backed up. On each backup run, a snapshot is created enabling restores to a specific date or time.

Restic assumes that the storage location for repository is shared, so it always encrypts the backed up data. This is in addition to any encryption and security from the storage provider.

Restic is open source and free software and licensed under the BSD 2-Clause License and actively developed on GitHub.

There’s a lot more you can do with Restic, including adding tags, mounting a repository locally, and scripting. To learn more, you can review the documentation at https://restic.readthedocs.io.

Coincidentally with this blog post, we published a KB article, How to configure Backblaze B2 with Restic on Linux, in which we show how to set up Restic for use with B2 and how to back up and restore a home directory from Linux to B2.

Which is Right for You?

While Duplicity is a popular, widely-available, and useful program, many users of cloud storage solutions such as B2 are moving to new-school solutions like Restic that take better advantage of the non-sequential access capabilities and speed of modern storage media used by cloud storage providers.

Tell us how you’re backing up Linux

Please let us know in the comments what you’re using for Linux backups, and if you have experience using Duplicity, Restic, or other backup software with Backblaze B2.

The post Backing Up Linux to Backblaze B2 with Duplicity and Restic appeared first on Backblaze Blog | Cloud Storage & Cloud Backup.

Security updates for Wednesday

Post Syndicated from ris original https://lwn.net/Articles/736766/rss

Security updates have been issued by Arch Linux (kernel, linux-hardened, and linux-zen), CentOS (wpa_supplicant), Debian (xorg-server), Fedora (selinux-policy), Gentoo (libarchive, nagios-core, ruby, and xen), openSUSE (wpa_supplicant), Oracle (wpa_supplicant), Red Hat (Red Hat Single Sign-On, rh-nodejs6-nodejs, rh-sso7-keycloak, and wpa_supplicant), Scientific Linux (wpa_supplicant), SUSE (git, wpa_supplicant, and xen), and Ubuntu (xorg-server, xorg-server-hwe-16.04, xorg-server-lts-xenial).

What’s new in HiveMQ 3.3

Post Syndicated from The HiveMQ Team original https://www.hivemq.com/whats-new-in-hivemq-3-3

We are pleased to announce the release of HiveMQ 3.3. This version of HiveMQ is the most advanced and user friendly version of HiveMQ ever. A broker is the heart of every MQTT deployment and it’s key to monitor and understand how healthy your system and your connected clients are. Version 3.3 of HiveMQ focuses on observability, usability and advanced administration features and introduces a brand new Web UI. This version is a drop-in replacement for HiveMQ 3.2 and of course supports rolling upgrades for zero-downtime.

HiveMQ 3.3 brings many features that your users, administrators and plugin developers are going to love. These are the highlights:

Web UI

Web UI
The new HiveMQ version has a built-in Web UI for advanced analysis and administrative tasks. A powerful dashboard shows important data about the health of the broker cluster and an overview of the whole MQTT deployment.
With the new Web UI, administrators are able to drill down to specific client information and can perform administrative actions like disconnecting a client. Advanced analytics functionality allows indetifying clients with irregular behavior. It’s easy to identify message-dropping clients as HiveMQ shows detailed statistics of such misbehaving MQTT participants.
Of course all Web UI features work at scale with more than a million connected MQTT clients. Learn more about the Web UI in the documentation.

Time To Live

HiveMQ introduces Time to Live (TTL) on various levels of the MQTT lifecycle. Automatic cleanup of expired messages is as well supported as the wiping of abandoned persistent MQTT sessions. In particular, version 3.3 implements the following TTL features:

  • MQTT client session expiration
  • Retained Message expiration
  • MQTT PUBLISH message expiration

Configuring a TTL for MQTT client sessions and retained messages allows freeing system resources without manual administrative intervention as soon as the data is not needed anymore.
Beside global configuration, MQTT PUBLISHES can have individual TTLs based on application specific characteristics. It’s a breeze to change the TTL of particular messages with the HiveMQ plugin system. As soon as a message TTL expires, the broker won’t send out the message anymore, even if the message was previously queued or in-flight. This can save precious bandwidth for mobile connections as unnecessary traffic is avoided for expired messages.

Trace Recordings

Trace Recordings
Debugging specific MQTT clients or groups of MQTT clients can be challenging at scale. HiveMQ 3.3 introduces an innovative Trace Recording mechanism that allows creating detailed recordings of all client interactions with given filters.
It’s possible to filter based on client identifiers, MQTT message types and topics. And the best of all: You can use regular expressions to select multiple MQTT clients at once as well as topics with complex structures. Getting detailed information about the behavior of specific MQTT clients for debugging complex issues was never easier.

Native SSL

Native SSL
The new native SSL integration of HiveMQ brings a performance boost of more than 40% for SSL Handshakes (in terms of CPU usage) by utilizing an integration with BoringSSL. BoringSSL is Google’s fork of OpenSSL which is also used in Google Chrome and Android. Besides the compute and huge memory optimizations (saves up to 60% Java Heap), additional secure state-of-the-art cipher suites are supported by HiveMQ which are not directly available for Java (like ChaCha20-Poly1305).
Most HiveMQ deployments on Linux systems are expected to see decreased CPU load on TLS handshakes with the native SSL integration and huge memory improvements.

New Plugin System Features

New Plugin System Features
The popular and powerful plugin system has received additional services and callbacks which are useful for many existing and future plugins.
Plugin developers can now use a ConnectionAttributeStore and a SessionAttributeStore for storing arbitrary data for the lifetime of a single MQTT connection of a client or for the whole session of a client. The new ClientGroupService allows grouping different MQTT client identifiers by the same key, so it’s easy to address multiple MQTT clients (with the same group) at once.

A new callback was introduced which notifies a plugin when a HiveMQ instance is ready, which means the instance is part of the cluster and all listeners were started successfully. Developers can now react when a MQTT client session is ready and usable in the cluster with a dedicated callback.

Some use cases require modifying a MQTT PUBLISH packet before it’s sent out to a client. This is now possible with a new callback that was introduced for modifying a PUBLISH before sending it out to a individual client.
The offline queue size for persistent clients is now also configurable for individual clients as well as the queue discard strategy.

Additional Features

Additional Features
HiveMQ 3.3 has many additional features designed for power users and professional MQTT deployments. The new version also has the following highlights:

  • OCSP Stapling
  • Event Log for MQTT client connects, disconnects and unusual events (e.g. discarded message due to slow consumption on the client side
  • Throttling of concurrent TLS handshakes
  • Connect Packet overload protection
  • Configuration of Socket send and receive buffer sizes
  • Global System Information like the HiveMQ Home folder can now be set via Environment Variables without changing the run script
  • The internal HTTP server of HiveMQ is now exposed to the holistic monitoring subsystem
  • Many additional useful metrics were exposed to HiveMQ’s monitoring subsystem


In order to upgrade to HiveMQ 3.3 from HiveMQ 3.2 or older versions, take a look at our Upgrade Guide.
Don’t forget to learn more about all the new features with our HiveMQ User Guide.

Download HiveMQ 3.3 now

Predict Billboard Top 10 Hits Using RStudio, H2O and Amazon Athena

Post Syndicated from Gopal Wunnava original https://aws.amazon.com/blogs/big-data/predict-billboard-top-10-hits-using-rstudio-h2o-and-amazon-athena/

Success in the popular music industry is typically measured in terms of the number of Top 10 hits artists have to their credit. The music industry is a highly competitive multi-billion dollar business, and record labels incur various costs in exchange for a percentage of the profits from sales and concert tickets.

Predicting the success of an artist’s release in the popular music industry can be difficult. One release may be extremely popular, resulting in widespread play on TV, radio and social media, while another single may turn out quite unpopular, and therefore unprofitable. Record labels need to be selective in their decision making, and predictive analytics can help them with decision making around the type of songs and artists they need to promote.

In this walkthrough, you leverage H2O.ai, Amazon Athena, and RStudio to make predictions on whether a song might make it to the Top 10 Billboard charts. You explore the GLM, GBM, and deep learning modeling techniques using H2O’s rapid, distributed and easy-to-use open source parallel processing engine. RStudio is a popular IDE, licensed either commercially or under AGPLv3, for working with R. This is ideal if you don’t want to connect to a server via SSH and use code editors such as vi to do analytics. RStudio is available in a desktop version, or a server version that allows you to access R via a web browser. RStudio’s Notebooks feature is used to demonstrate the execution of code and output. In addition, this post showcases how you can leverage Athena for query and interactive analysis during the modeling phase. A working knowledge of statistics and machine learning would be helpful to interpret the analysis being performed in this post.


Your goal is to predict whether a song will make it to the Top 10 Billboard charts. For this purpose, you will be using multiple modeling techniques―namely GLM, GBM and deep learning―and choose the model that is the best fit.

This solution involves the following steps:

  • Install and configure RStudio with Athena
  • Log in to RStudio
  • Install R packages
  • Connect to Athena
  • Create a dataset
  • Create models

Install and configure RStudio with Athena

Use the following AWS CloudFormation stack to install, configure, and connect RStudio on an Amazon EC2 instance with Athena.

Launching this stack creates all required resources and prerequisites:

  • Amazon EC2 instance with Amazon Linux (minimum size of t2.large is recommended)
  • Provisioning of the EC2 instance in an existing VPC and public subnet
  • Installation of Java 8
  • Assignment of an IAM role to the EC2 instance with the required permissions for accessing Athena and Amazon S3
  • Security group allowing access to the RStudio and SSH ports from the internet (I recommend restricting access to these ports)
  • S3 staging bucket required for Athena (referenced within RStudio as ATHENABUCKET)
  • RStudio username and password
  • Setup logs in Amazon CloudWatch Logs (if needed for additional troubleshooting)
  • Amazon EC2 Systems Manager agent, which makes it easy to manage and patch

All AWS resources are created in the US-East-1 Region. To avoid cross-region data transfer fees, launch the CloudFormation stack in the same region. To check the availability of Athena in other regions, see Region Table.

Log in to RStudio

The instance security group has been automatically configured to allow incoming connections on the RStudio port 8787 from any source internet address. You can edit the security group to restrict source IP access. If you have trouble connecting, ensure that port 8787 isn’t blocked by subnet network ACLS or by your outgoing proxy/firewall.

  1. In the CloudFormation stack, choose Outputs, Value, and then open the RStudio URL. You might need to wait for a few minutes until the instance has been launched.
  2. Log in to RStudio with the and password you provided during setup.

Install R packages

Next, install the required R packages from the RStudio console. You can download the R notebook file containing just the code.

#install pacman – a handy package manager for managing installs
if("pacman" %in% rownames(installed.packages()) == FALSE)
h2o.init(nthreads = -1)
##  Connection successful!
## R is connected to the H2O cluster: 
##     H2O cluster uptime:         2 hours 42 minutes 
##     H2O cluster version: 
##     H2O cluster version age:    4 months and 4 days !!! 
##     H2O cluster name:           H2O_started_from_R_rstudio_hjx881 
##     H2O cluster total nodes:    1 
##     H2O cluster total memory:   3.30 GB 
##     H2O cluster total cores:    4 
##     H2O cluster allowed cores:  4 
##     H2O cluster healthy:        TRUE 
##     H2O Connection ip:          localhost 
##     H2O Connection port:        54321 
##     H2O Connection proxy:       NA 
##     H2O Internal Security:      FALSE 
##     R Version:                  R version 3.3.3 (2017-03-06)
## Warning in h2o.clusterInfo(): 
## Your H2O cluster version is too old (4 months and 4 days)!
## Please download and install the latest version from http://h2o.ai/download/
#install aws sdk if not present (pre-requisite for using Athena with an IAM role)
if (!aws_sdk_present()) {


Connect to Athena

Next, establish a connection to Athena from RStudio, using an IAM role associated with your EC2 instance. Use ATHENABUCKET to specify the S3 staging directory.

URL <- 'https://s3.amazonaws.com/athena-downloads/drivers/AthenaJDBC41-1.0.1.jar'
fil <- basename(URL)
#download the file into current working directory
if (!file.exists(fil)) download.file(URL, fil)
#verify that the file has been downloaded successfully
## [1] "AthenaJDBC41-1.0.1.jar"
drv <- JDBC(driverClass="com.amazonaws.athena.jdbc.AthenaDriver", fil, identifier.quote="'")

con <- jdbcConnection <- dbConnect(drv, 'jdbc:awsathena://athena.us-east-1.amazonaws.com:443/',

Verify the connection. The results returned depend on your specific Athena setup.

## <JDBCConnection>
##  [1] "gdelt"               "wikistats"           "elb_logs_raw_native"
##  [4] "twitter"             "twitter2"            "usermovieratings"   
##  [7] "eventcodes"          "events"              "billboard"          
## [10] "billboardtop10"      "elb_logs"            "gdelthist"          
## [13] "gdeltmaster"         "twitter"             "twitter3"

Create a dataset

For this analysis, you use a sample dataset combining information from Billboard and Wikipedia with Echo Nest data in the Million Songs Dataset. Upload this dataset into your own S3 bucket. The table below provides a description of the fields used in this dataset.

Field Description
year Year that song was released
songtitle Title of the song
artistname Name of the song artist
songid Unique identifier for the song
artistid Unique identifier for the song artist
timesignature Variable estimating the time signature of the song
timesignature_confidence Confidence in the estimate for the timesignature
loudness Continuous variable indicating the average amplitude of the audio in decibels
tempo Variable indicating the estimated beats per minute of the song
tempo_confidence Confidence in the estimate for tempo
key Variable with twelve levels indicating the estimated key of the song (C, C#, B)
key_confidence Confidence in the estimate for key
energy Variable that represents the overall acoustic energy of the song, using a mix of features such as loudness
pitch Continuous variable that indicates the pitch of the song
timbre_0_min thru timbre_11_min Variables that indicate the minimum values over all segments for each of the twelve values in the timbre vector
timbre_0_max thru timbre_11_max Variables that indicate the maximum values over all segments for each of the twelve values in the timbre vector
top10 Indicator for whether or not the song made it to the Top 10 of the Billboard charts (1 if it was in the top 10, and 0 if not)

Create an Athena table based on the dataset

In the Athena console, select the default database, sampled, or create a new database.

Run the following create table statement.

create external table if not exists billboard
year int,
songtitle string,
artistname string,
songID string,
artistID string,
timesignature int,
timesignature_confidence double,
loudness double,
tempo double,
tempo_confidence double,
key int,
key_confidence double,
energy double,
pitch double,
timbre_0_min double,
timbre_0_max double,
timbre_1_min double,
timbre_1_max double,
timbre_2_min double,
timbre_2_max double,
timbre_3_min double,
timbre_3_max double,
timbre_4_min double,
timbre_4_max double,
timbre_5_min double,
timbre_5_max double,
timbre_6_min double,
timbre_6_max double,
timbre_7_min double,
timbre_7_max double,
timbre_8_min double,
timbre_8_max double,
timbre_9_min double,
timbre_9_max double,
timbre_10_min double,
timbre_10_max double,
timbre_11_min double,
timbre_11_max double,
Top10 int
LOCATION 's3://aws-bigdata-blog/artifacts/predict-billboard/data'

Inspect the table definition for the ‘billboard’ table that you have created. If you chose a database other than sampledb, replace that value with your choice.

dbGetQuery(con, "show create table sampledb.billboard")
##                                      createtab_stmt
## 1       CREATE EXTERNAL TABLE `sampledb.billboard`(
## 2                                       `year` int,
## 3                               `songtitle` string,
## 4                              `artistname` string,
## 5                                  `songid` string,
## 6                                `artistid` string,
## 7                              `timesignature` int,
## 8                `timesignature_confidence` double,
## 9                                `loudness` double,
## 10                                  `tempo` double,
## 11                       `tempo_confidence` double,
## 12                                       `key` int,
## 13                         `key_confidence` double,
## 14                                 `energy` double,
## 15                                  `pitch` double,
## 16                           `timbre_0_min` double,
## 17                           `timbre_0_max` double,
## 18                           `timbre_1_min` double,
## 19                           `timbre_1_max` double,
## 20                           `timbre_2_min` double,
## 21                           `timbre_2_max` double,
## 22                           `timbre_3_min` double,
## 23                           `timbre_3_max` double,
## 24                           `timbre_4_min` double,
## 25                           `timbre_4_max` double,
## 26                           `timbre_5_min` double,
## 27                           `timbre_5_max` double,
## 28                           `timbre_6_min` double,
## 29                           `timbre_6_max` double,
## 30                           `timbre_7_min` double,
## 31                           `timbre_7_max` double,
## 32                           `timbre_8_min` double,
## 33                           `timbre_8_max` double,
## 34                           `timbre_9_min` double,
## 35                           `timbre_9_max` double,
## 36                          `timbre_10_min` double,
## 37                          `timbre_10_max` double,
## 38                          `timbre_11_min` double,
## 39                          `timbre_11_max` double,
## 40                                     `top10` int)
## 41                             ROW FORMAT DELIMITED 
## 42                         FIELDS TERMINATED BY ',' 
## 43                            STORED AS INPUTFORMAT 
## 44       'org.apache.hadoop.mapred.TextInputFormat' 
## 45                                     OUTPUTFORMAT 
## 46  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
## 47                                        LOCATION
## 48    's3://aws-bigdata-blog/artifacts/predict-billboard/data'
## 49                                  TBLPROPERTIES (
## 50            'transient_lastDdlTime'='1505484133')

Run a sample query

Next, run a sample query to obtain a list of all songs from Janet Jackson that made it to the Billboard Top 10 charts.

dbGetQuery(con, " SELECT songtitle,artistname,top10   FROM sampledb.billboard WHERE lower(artistname) =     'janet jackson' AND top10 = 1")
##                       songtitle    artistname top10
## 1                       Runaway Janet Jackson     1
## 2               Because Of Love Janet Jackson     1
## 3                         Again Janet Jackson     1
## 4                            If Janet Jackson     1
## 5  Love Will Never Do (Without You) Janet Jackson 1
## 6                     Black Cat Janet Jackson     1
## 7               Come Back To Me Janet Jackson     1
## 8                       Alright Janet Jackson     1
## 9                      Escapade Janet Jackson     1
## 10                Rhythm Nation Janet Jackson     1

Determine how many songs in this dataset are specifically from the year 2010.

dbGetQuery(con, " SELECT count(*)   FROM sampledb.billboard WHERE year = 2010")
##   _col0
## 1   373

The sample dataset provides certain song properties of interest that can be analyzed to gauge the impact to the song’s overall popularity. Look at one such property, timesignature, and determine the value that is the most frequent among songs in the database. Timesignature is a measure of the number of beats and the type of note involved.

Running the query directly may result in an error, as shown in the commented lines below. This error is a result of trying to retrieve a large result set over a JDBC connection, which can cause out-of-memory issues at the client level. To address this, reduce the fetch size and run again.

#t<-dbGetQuery(con, " SELECT timesignature FROM sampledb.billboard")
#Note:  Running the preceding query results in the following error: 
#Error in .jcall(rp, "I", "fetch", stride, block): java.sql.SQLException: The requested #fetchSize is more than the allowed value in Athena. Please reduce the fetchSize and try #again. Refer to the Athena documentation for valid fetchSize values.
# Use the dbSendQuery function, reduce the fetch size, and run again
r <- dbSendQuery(con, " SELECT timesignature     FROM sampledb.billboard")
dftimesignature<- fetch(r, n=-1, block=100)
## [1] TRUE
## dftimesignature
##    0    1    3    4    5    7 
##   10  143  503 6787  112   19
## [1] 7574

From the results, observe that 6787 songs have a timesignature of 4.

Next, determine the song with the highest tempo.

dbGetQuery(con, " SELECT songtitle,artistname,tempo   FROM sampledb.billboard WHERE tempo = (SELECT max(tempo) FROM sampledb.billboard) ")
##                   songtitle      artistname   tempo
## 1 Wanna Be Startin' Somethin' Michael Jackson 244.307

Create the training dataset

Your model needs to be trained such that it can learn and make accurate predictions. Split the data into training and test datasets, and create the training dataset first.  This dataset contains all observations from the year 2009 and earlier. You may face the same JDBC connection issue pointed out earlier, so this query uses a fetch size.

#BillboardTrain <- dbGetQuery(con, "SELECT * FROM sampledb.billboard WHERE year <= 2009")
#Running the preceding query results in the following error:-
#Error in .verify.JDBC.result(r, "Unable to retrieve JDBC result set for ", : Unable to retrieve #JDBC result set for SELECT * FROM sampledb.billboard WHERE year <= 2009 (Internal error)
#Follow the same approach as before to address this issue.

r <- dbSendQuery(con, "SELECT * FROM sampledb.billboard WHERE year <= 2009")
BillboardTrain <- fetch(r, n=-1, block=100)
## [1] TRUE
##   year           songtitle artistname timesignature
## 1 2009 The Awkward Goodbye    Athlete             3
## 2 2009        Rubik's Cube    Athlete             3
##   timesignature_confidence loudness   tempo tempo_confidence
## 1                    0.732   -6.320  89.614   0.652
## 2                    0.906   -9.541 117.742   0.542
## [1] 7201

Create the test dataset

BillboardTest <- dbGetQuery(con, "SELECT * FROM sampledb.billboard where year = 2010")
##   year              songtitle        artistname key
## 1 2010 This Is the House That Doubt Built A Day to Remember  11
## 2 2010        Sticks & Bricks A Day to Remember  10
##   key_confidence    energy pitch timbre_0_min
## 1          0.453 0.9666556 0.024        0.002
## 2          0.469 0.9847095 0.025        0.000
## [1] 373

Convert the training and test datasets into H2O dataframes

train.h2o <- as.h2o(BillboardTrain)
  |                                                                 |   0%
  |=================================================================| 100%
test.h2o <- as.h2o(BillboardTest)
  |                                                                 |   0%
  |=================================================================| 100%

Inspect the column names in your H2O dataframes.

##  [1] "year"                     "songtitle"               
##  [3] "artistname"               "songid"                  
##  [5] "artistid"                 "timesignature"           
##  [7] "timesignature_confidence" "loudness"                
##  [9] "tempo"                    "tempo_confidence"        
## [11] "key"                      "key_confidence"          
## [13] "energy"                   "pitch"                   
## [15] "timbre_0_min"             "timbre_0_max"            
## [17] "timbre_1_min"             "timbre_1_max"            
## [19] "timbre_2_min"             "timbre_2_max"            
## [21] "timbre_3_min"             "timbre_3_max"            
## [23] "timbre_4_min"             "timbre_4_max"            
## [25] "timbre_5_min"             "timbre_5_max"            
## [27] "timbre_6_min"             "timbre_6_max"            
## [29] "timbre_7_min"             "timbre_7_max"            
## [31] "timbre_8_min"             "timbre_8_max"            
## [33] "timbre_9_min"             "timbre_9_max"            
## [35] "timbre_10_min"            "timbre_10_max"           
## [37] "timbre_11_min"            "timbre_11_max"           
## [39] "top10"

Create models

You need to designate the independent and dependent variables prior to applying your modeling algorithms. Because you’re trying to predict the ‘top10’ field, this would be your dependent variable and everything else would be independent.

Create your first model using GLM. Because GLM works best with numeric data, you create your model by dropping non-numeric variables. You only use the variables in the dataset that describe the numerical attributes of the song in the logistic regression model. You won’t use these variables:  “year”, “songtitle”, “artistname”, “songid”, or “artistid”.

y.dep <- 39
x.indep <- c(6:38)
##  [1]  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28
## [24] 29 30 31 32 33 34 35 36 37 38

Create Model 1: All numeric variables

Create Model 1 with the training dataset, using GLM as the modeling algorithm and H2O’s built-in h2o.glm function.

modelh1 <- h2o.glm( y = y.dep, x = x.indep, training_frame = train.h2o, family = "binomial")
  |                                                                 |   0%
  |=====                                                            |   8%
  |=================================================================| 100%

Measure the performance of Model 1, using H2O’s built-in performance function.

## H2OBinomialMetrics: glm
## MSE:  0.09924684
## RMSE:  0.3150347
## LogLoss:  0.3220267
## Mean Per-Class Error:  0.2380168
## AUC:  0.8431394
## Gini:  0.6862787
## R^2:  0.254663
## Null Deviance:  326.0801
## Residual Deviance:  240.2319
## AIC:  308.2319
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
##          0   1    Error     Rate
## 0      255  59 0.187898  =59/314
## 1       17  42 0.288136   =17/59
## Totals 272 101 0.203753  =76/373
## Maximum Metrics: Maximum metrics at their respective thresholds
##                         metric threshold    value idx
## 1                       max f1  0.192772 0.525000 100
## 2                       max f2  0.124912 0.650510 155
## 3                 max f0point5  0.416258 0.612903  23
## 4                 max accuracy  0.416258 0.879357  23
## 5                max precision  0.813396 1.000000   0
## 6                   max recall  0.037579 1.000000 282
## 7              max specificity  0.813396 1.000000   0
## 8             max absolute_mcc  0.416258 0.455251  23
## 9   max min_per_class_accuracy  0.161402 0.738854 125
## 10 max mean_per_class_accuracy  0.124912 0.765006 155
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or ` 
## [1] 0.8431394

The AUC metric provides insight into how well the classifier is able to separate the two classes. In this case, the value of 0.8431394 indicates that the classification is good. (A value of 0.5 indicates a worthless test, while a value of 1.0 indicates a perfect test.)

Next, inspect the coefficients of the variables in the dataset.

dfmodelh1 <- as.data.frame(h2o.varimp(modelh1))
##                       names coefficients sign
## 1              timbre_0_max  1.290938663  NEG
## 2                  loudness  1.262941934  POS
## 3                     pitch  0.616995941  NEG
## 4              timbre_1_min  0.422323735  POS
## 5              timbre_6_min  0.349016024  NEG
## 6                    energy  0.348092062  NEG
## 7             timbre_11_min  0.307331997  NEG
## 8              timbre_3_max  0.302225619  NEG
## 9             timbre_11_max  0.243632060  POS
## 10             timbre_4_min  0.224233951  POS
## 11             timbre_4_max  0.204134342  POS
## 12             timbre_5_min  0.199149324  NEG
## 13             timbre_0_min  0.195147119  POS
## 14 timesignature_confidence  0.179973904  POS
## 15         tempo_confidence  0.144242598  POS
## 16            timbre_10_max  0.137644568  POS
## 17             timbre_7_min  0.126995955  NEG
## 18            timbre_10_min  0.123851179  POS
## 19             timbre_7_max  0.100031481  NEG
## 20             timbre_2_min  0.096127636  NEG
## 21           key_confidence  0.083115820  POS
## 22             timbre_6_max  0.073712419  POS
## 23            timesignature  0.067241917  POS
## 24             timbre_8_min  0.061301881  POS
## 25             timbre_8_max  0.060041698  POS
## 26                      key  0.056158445  POS
## 27             timbre_3_min  0.050825116  POS
## 28             timbre_9_max  0.033733561  POS
## 29             timbre_2_max  0.030939072  POS
## 30             timbre_9_min  0.020708113  POS
## 31             timbre_1_max  0.014228818  NEG
## 32                    tempo  0.008199861  POS
## 33             timbre_5_max  0.004837870  POS
## 34                                    NA <NA>

Typically, songs with heavier instrumentation tend to be louder (have higher values in the variable “loudness”) and more energetic (have higher values in the variable “energy”). This knowledge is helpful for interpreting the modeling results.

You can make the following observations from the results:

  • The coefficient estimates for the confidence values associated with the time signature, key, and tempo variables are positive. This suggests that higher confidence leads to a higher predicted probability of a Top 10 hit.
  • The coefficient estimate for loudness is positive, meaning that mainstream listeners prefer louder songs with heavier instrumentation.
  • The coefficient estimate for energy is negative, meaning that mainstream listeners prefer songs that are less energetic, which are those songs with light instrumentation.

These coefficients lead to contradictory conclusions for Model 1. This could be due to multicollinearity issues. Inspect the correlation between the variables “loudness” and “energy” in the training set.

## [1] 0.7399067

This number indicates that these two variables are highly correlated, and Model 1 does indeed suffer from multicollinearity. Typically, you associate a value of -1.0 to -0.5 or 1.0 to 0.5 to indicate strong correlation, and a value of 0.1 to 0.1 to indicate weak correlation. To avoid this correlation issue, omit one of these two variables and re-create the models.

You build two variations of the original model:

  • Model 2, in which you keep “energy” and omit “loudness”
  • Model 3, in which you keep “loudness” and omit “energy”

You compare these two models and choose the model with a better fit for this use case.

Create Model 2: Keep energy and omit loudness

##  [1] "year"                     "songtitle"               
##  [3] "artistname"               "songid"                  
##  [5] "artistid"                 "timesignature"           
##  [7] "timesignature_confidence" "loudness"                
##  [9] "tempo"                    "tempo_confidence"        
## [11] "key"                      "key_confidence"          
## [13] "energy"                   "pitch"                   
## [15] "timbre_0_min"             "timbre_0_max"            
## [17] "timbre_1_min"             "timbre_1_max"            
## [19] "timbre_2_min"             "timbre_2_max"            
## [21] "timbre_3_min"             "timbre_3_max"            
## [23] "timbre_4_min"             "timbre_4_max"            
## [25] "timbre_5_min"             "timbre_5_max"            
## [27] "timbre_6_min"             "timbre_6_max"            
## [29] "timbre_7_min"             "timbre_7_max"            
## [31] "timbre_8_min"             "timbre_8_max"            
## [33] "timbre_9_min"             "timbre_9_max"            
## [35] "timbre_10_min"            "timbre_10_max"           
## [37] "timbre_11_min"            "timbre_11_max"           
## [39] "top10"
y.dep <- 39
x.indep <- c(6:7,9:38)
##  [1]  6  7  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29
## [24] 30 31 32 33 34 35 36 37 38
modelh2 <- h2o.glm( y = y.dep, x = x.indep, training_frame = train.h2o, family = "binomial")
  |                                                                 |   0%
  |=======                                                          |  10%
  |=================================================================| 100%

Measure the performance of Model 2.

## H2OBinomialMetrics: glm
## MSE:  0.09922606
## RMSE:  0.3150017
## LogLoss:  0.3228213
## Mean Per-Class Error:  0.2490554
## AUC:  0.8431933
## Gini:  0.6863867
## R^2:  0.2548191
## Null Deviance:  326.0801
## Residual Deviance:  240.8247
## AIC:  306.8247
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
##          0  1    Error     Rate
## 0      280 34 0.108280  =34/314
## 1       23 36 0.389831   =23/59
## Totals 303 70 0.152815  =57/373
## Maximum Metrics: Maximum metrics at their respective thresholds
##                         metric threshold    value idx
## 1                       max f1  0.254391 0.558140  69
## 2                       max f2  0.113031 0.647208 157
## 3                 max f0point5  0.413999 0.596026  22
## 4                 max accuracy  0.446250 0.876676  18
## 5                max precision  0.811739 1.000000   0
## 6                   max recall  0.037682 1.000000 283
## 7              max specificity  0.811739 1.000000   0
## 8             max absolute_mcc  0.254391 0.469060  69
## 9   max min_per_class_accuracy  0.141051 0.716561 131
## 10 max mean_per_class_accuracy  0.113031 0.761821 157
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
dfmodelh2 <- as.data.frame(h2o.varimp(modelh2))
##                       names coefficients sign
## 1                     pitch  0.700331511  NEG
## 2              timbre_1_min  0.510270513  POS
## 3              timbre_0_max  0.402059546  NEG
## 4              timbre_6_min  0.333316236  NEG
## 5             timbre_11_min  0.331647383  NEG
## 6              timbre_3_max  0.252425901  NEG
## 7             timbre_11_max  0.227500308  POS
## 8              timbre_4_max  0.210663865  POS
## 9              timbre_0_min  0.208516163  POS
## 10             timbre_5_min  0.202748055  NEG
## 11             timbre_4_min  0.197246582  POS
## 12            timbre_10_max  0.172729619  POS
## 13         tempo_confidence  0.167523934  POS
## 14 timesignature_confidence  0.167398830  POS
## 15             timbre_7_min  0.142450727  NEG
## 16             timbre_8_max  0.093377516  POS
## 17            timbre_10_min  0.090333426  POS
## 18            timesignature  0.085851625  POS
## 19             timbre_7_max  0.083948442  NEG
## 20           key_confidence  0.079657073  POS
## 21             timbre_6_max  0.076426046  POS
## 22             timbre_2_min  0.071957831  NEG
## 23             timbre_9_max  0.071393189  POS
## 24             timbre_8_min  0.070225578  POS
## 25                      key  0.061394702  POS
## 26             timbre_3_min  0.048384697  POS
## 27             timbre_1_max  0.044721121  NEG
## 28                   energy  0.039698433  POS
## 29             timbre_5_max  0.039469064  POS
## 30             timbre_2_max  0.018461133  POS
## 31                    tempo  0.013279926  POS
## 32             timbre_9_min  0.005282143  NEG
## 33                                    NA <NA>

## [1] 0.8431933

You can make the following observations:

  • The AUC metric is 0.8431933.
  • Inspecting the coefficient of the variable energy, Model 2 suggests that songs with high energy levels tend to be more popular. This is as per expectation.
  • As H2O orders variables by significance, the variable energy is not significant in this model.

You can conclude that Model 2 is not ideal for this use , as energy is not significant.

CreateModel 3: Keep loudness but omit energy

##  [1] "year"                     "songtitle"               
##  [3] "artistname"               "songid"                  
##  [5] "artistid"                 "timesignature"           
##  [7] "timesignature_confidence" "loudness"                
##  [9] "tempo"                    "tempo_confidence"        
## [11] "key"                      "key_confidence"          
## [13] "energy"                   "pitch"                   
## [15] "timbre_0_min"             "timbre_0_max"            
## [17] "timbre_1_min"             "timbre_1_max"            
## [19] "timbre_2_min"             "timbre_2_max"            
## [21] "timbre_3_min"             "timbre_3_max"            
## [23] "timbre_4_min"             "timbre_4_max"            
## [25] "timbre_5_min"             "timbre_5_max"            
## [27] "timbre_6_min"             "timbre_6_max"            
## [29] "timbre_7_min"             "timbre_7_max"            
## [31] "timbre_8_min"             "timbre_8_max"            
## [33] "timbre_9_min"             "timbre_9_max"            
## [35] "timbre_10_min"            "timbre_10_max"           
## [37] "timbre_11_min"            "timbre_11_max"           
## [39] "top10"
y.dep <- 39
x.indep <- c(6:12,14:38)
##  [1]  6  7  8  9 10 11 12 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29
## [24] 30 31 32 33 34 35 36 37 38
modelh3 <- h2o.glm( y = y.dep, x = x.indep, training_frame = train.h2o, family = "binomial")
  |                                                                 |   0%
  |========                                                         |  12%
  |=================================================================| 100%
## H2OBinomialMetrics: glm
## MSE:  0.0978859
## RMSE:  0.3128672
## LogLoss:  0.3178367
## Mean Per-Class Error:  0.264925
## AUC:  0.8492389
## Gini:  0.6984778
## R^2:  0.2648836
## Null Deviance:  326.0801
## Residual Deviance:  237.1062
## AIC:  303.1062
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
##          0  1    Error     Rate
## 0      286 28 0.089172  =28/314
## 1       26 33 0.440678   =26/59
## Totals 312 61 0.144772  =54/373
## Maximum Metrics: Maximum metrics at their respective thresholds
##                         metric threshold    value idx
## 1                       max f1  0.273799 0.550000  60
## 2                       max f2  0.125503 0.663265 155
## 3                 max f0point5  0.435479 0.628931  24
## 4                 max accuracy  0.435479 0.882038  24
## 5                max precision  0.821606 1.000000   0
## 6                   max recall  0.038328 1.000000 280
## 7              max specificity  0.821606 1.000000   0
## 8             max absolute_mcc  0.435479 0.471426  24
## 9   max min_per_class_accuracy  0.173693 0.745763 120
## 10 max mean_per_class_accuracy  0.125503 0.775073 155
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
dfmodelh3 <- as.data.frame(h2o.varimp(modelh3))
##                       names coefficients sign
## 1              timbre_0_max 1.216621e+00  NEG
## 2                  loudness 9.780973e-01  POS
## 3                     pitch 7.249788e-01  NEG
## 4              timbre_1_min 3.891197e-01  POS
## 5              timbre_6_min 3.689193e-01  NEG
## 6             timbre_11_min 3.086673e-01  NEG
## 7              timbre_3_max 3.025593e-01  NEG
## 8             timbre_11_max 2.459081e-01  POS
## 9              timbre_4_min 2.379749e-01  POS
## 10             timbre_4_max 2.157627e-01  POS
## 11             timbre_0_min 1.859531e-01  POS
## 12             timbre_5_min 1.846128e-01  NEG
## 13 timesignature_confidence 1.729658e-01  POS
## 14             timbre_7_min 1.431871e-01  NEG
## 15            timbre_10_max 1.366703e-01  POS
## 16            timbre_10_min 1.215954e-01  POS
## 17         tempo_confidence 1.183698e-01  POS
## 18             timbre_2_min 1.019149e-01  NEG
## 19           key_confidence 9.109701e-02  POS
## 20             timbre_7_max 8.987908e-02  NEG
## 21             timbre_6_max 6.935132e-02  POS
## 22             timbre_8_max 6.878241e-02  POS
## 23            timesignature 6.120105e-02  POS
## 24                      key 5.814805e-02  POS
## 25             timbre_8_min 5.759228e-02  POS
## 26             timbre_1_max 2.930285e-02  NEG
## 27             timbre_9_max 2.843755e-02  POS
## 28             timbre_3_min 2.380245e-02  POS
## 29             timbre_2_max 1.917035e-02  POS
## 30             timbre_5_max 1.715813e-02  POS
## 31                    tempo 1.364418e-02  NEG
## 32             timbre_9_min 8.463143e-05  NEG
## 33                                    NA <NA>
## Warning in h2o.find_row_by_threshold(object, t): Could not find exact
## threshold: 0.5 for this set of metrics; using closest threshold found:
## 0.501855569251422. Run `h2o.predict` and apply your desired threshold on a
## probability column.
## [[1]]
## [1] 0.2033898
## [1] 0.8492389

You can make the following observations:

  • The AUC metric is 0.8492389.
  • From the confusion matrix, the model correctly predicts that 33 songs will be top 10 hits (true positives). However, it has 26 false positives (songs that the model predicted would be Top 10 hits, but ended up not being Top 10 hits).
  • Loudness has a positive coefficient estimate, meaning that this model predicts that songs with heavier instrumentation tend to be more popular. This is the same conclusion from Model 2.
  • Loudness is significant in this model.

Overall, Model 3 predicts a higher number of top 10 hits with an accuracy rate that is acceptable. To choose the best fit for production runs, record labels should consider the following factors:

  • Desired model accuracy at a given threshold
  • Number of correct predictions for top10 hits
  • Tolerable number of false positives or false negatives

Next, make predictions using Model 3 on the test dataset.

predict.regh <- h2o.predict(modelh3, test.h2o)
  |                                                                 |   0%
  |=================================================================| 100%
##   predict        p0          p1
## 1       0 0.9654739 0.034526052
## 2       0 0.9654748 0.034525236
## 3       0 0.9635547 0.036445318
## 4       0 0.9343579 0.065642149
## 5       0 0.9978334 0.002166601
## 6       0 0.9779949 0.022005078
## [373 rows x 3 columns]
##   predict
## 1       0
## 2       0
## 3       0
## 4       0
## 5       0
## 6       0
## [373 rows x 1 column]
#Rename the predicted column 
colnames(dpr)[colnames(dpr) == 'predict'] <- 'predict_top10'
##   0   1 
## 312  61

The first set of output results specifies the probabilities associated with each predicted observation.  For example, observation 1 is 96.54739% likely to not be a Top 10 hit, and 3.4526052% likely to be a Top 10 hit (predict=1 indicates Top 10 hit and predict=0 indicates not a Top 10 hit).  The second set of results list the actual predictions made.  From the third set of results, this model predicts that 61 songs will be top 10 hits.

Compute the baseline accuracy, by assuming that the baseline predicts the most frequent outcome, which is that most songs are not Top 10 hits.

##   0   1 
## 314  59

Now observe that the baseline model would get 314 observations correct, and 59 wrong, for an accuracy of 314/(314+59) = 0.8418231.

It seems that Model 3, with an accuracy of 0.8552, provides you with a small improvement over the baseline model. But is this model useful for record labels?

View the two models from an investment perspective:

  • A production company is interested in investing in songs that are more likely to make it to the Top 10. The company’s objective is to minimize the risk of financial losses attributed to investing in songs that end up unpopular.
  • How many songs does Model 3 correctly predict as a Top 10 hit in 2010? Looking at the confusion matrix, you see that it predicts 33 top 10 hits correctly at an optimal threshold, which is more than half the number
  • It will be more useful to the record label if you can provide the production company with a list of songs that are highly likely to end up in the Top 10.
  • The baseline model is not useful, as it simply does not label any song as a hit.

Considering the three models built so far, you can conclude that Model 3 proves to be the best investment choice for the record label.

GBM model

H2O provides you with the ability to explore other learning models, such as GBM and deep learning. Explore building a model using the GBM technique, using the built-in h2o.gbm function.

Before you do this, you need to convert the target variable to a factor for multinomial classification techniques.

gbm.modelh <- h2o.gbm(y=y.dep, x=x.indep, training_frame = train.h2o, ntrees = 500, max_depth = 4, learn_rate = 0.01, seed = 1122,distribution="multinomial")
  |                                                                 |   0%
  |===                                                              |   5%
  |=====                                                            |   7%
  |======                                                           |   9%
  |=======                                                          |  10%
  |======================                                           |  33%
  |=====================================                            |  56%
  |====================================================             |  79%
  |================================================================ |  98%
  |=================================================================| 100%
## H2OBinomialMetrics: gbm
## MSE:  0.09860778
## RMSE:  0.3140188
## LogLoss:  0.3206876
## Mean Per-Class Error:  0.2120263
## AUC:  0.8630573
## Gini:  0.7261146
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
##          0  1    Error     Rate
## 0      266 48 0.152866  =48/314
## 1       16 43 0.271186   =16/59
## Totals 282 91 0.171582  =64/373
## Maximum Metrics: Maximum metrics at their respective thresholds
##                       metric threshold    value idx
## 1                     max f1  0.189757 0.573333  90
## 2                     max f2  0.130895 0.693717 145
## 3               max f0point5  0.327346 0.598802  26
## 4               max accuracy  0.442757 0.876676  14
## 5              max precision  0.802184 1.000000   0
## 6                 max recall  0.049990 1.000000 284
## 7            max specificity  0.802184 1.000000   0
## 8           max absolute_mcc  0.169135 0.496486 104
## 9 max min_per_class_accuracy  0.169135 0.796610 104
## 10 max mean_per_class_accuracy  0.169135 0.805948 104
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `
## Warning in h2o.find_row_by_threshold(object, t): Could not find exact
## threshold: 0.5 for this set of metrics; using closest threshold found:
## 0.501205344484314. Run `h2o.predict` and apply your desired threshold on a
## probability column.
## [[1]]
## [1] 0.1355932
## [1] 0.8630573

This model correctly predicts 43 top 10 hits, which is 10 more than the number predicted by Model 3. Moreover, the AUC metric is higher than the one obtained from Model 3.

As seen above, H2O’s API provides the ability to obtain key statistical measures required to analyze the models easily, using several built-in functions. The record label can experiment with different parameters to arrive at the model that predicts the maximum number of Top 10 hits at the desired level of accuracy and threshold.

H2O also allows you to experiment with deep learning models. Deep learning models have the ability to learn features implicitly, but can be more expensive computationally.

Now, create a deep learning model with the h2o.deeplearning function, using the same training and test datasets created before. The time taken to run this model depends on the type of EC2 instance chosen for this purpose.  For models that require more computation, consider using accelerated computing instances such as the P2 instance type.

  dlearning.modelh <- h2o.deeplearning(y = y.dep,
                                      x = x.indep,
                                      training_frame = train.h2o,
                                      epoch = 250,
                                      hidden = c(250,250),
                                      activation = "Rectifier",
                                      seed = 1122,
  |                                                                 |   0%
  |===                                                              |   4%
  |=====                                                            |   8%
  |========                                                         |  12%
  |==========                                                       |  16%
  |=============                                                    |  20%
  |================                                                 |  24%
  |==================                                               |  28%
  |=====================                                            |  32%
  |=======================                                          |  36%
  |==========================                                       |  40%
  |=============================                                    |  44%
  |===============================                                  |  48%
  |==================================                               |  52%
  |====================================                             |  56%
  |=======================================                          |  60%
  |==========================================                       |  64%
  |============================================                     |  68%
  |===============================================                  |  72%
  |=================================================                |  76%
  |====================================================             |  80%
  |=======================================================          |  84%
  |=========================================================        |  88%
  |============================================================     |  92%
  |==============================================================   |  96%
  |=================================================================| 100%
##    user  system elapsed 
##   1.216   0.020 166.508
## H2OBinomialMetrics: deeplearning
## MSE:  0.1678359
## RMSE:  0.4096778
## LogLoss:  1.86509
## Mean Per-Class Error:  0.3433013
## AUC:  0.7568822
## Gini:  0.5137644
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
##          0  1    Error     Rate
## 0      290 24 0.076433  =24/314
## 1       36 23 0.610169   =36/59
## Totals 326 47 0.160858  =60/373
## Maximum Metrics: Maximum metrics at their respective thresholds
##                       metric threshold    value idx
## 1                     max f1  0.826267 0.433962  46
## 2                     max f2  0.000000 0.588235 239
## 3               max f0point5  0.999929 0.511811  16
## 4               max accuracy  0.999999 0.865952  10
## 5              max precision  1.000000 1.000000   0
## 6                 max recall  0.000000 1.000000 326
## 7            max specificity  1.000000 1.000000   0
## 8           max absolute_mcc  0.999929 0.363219  16
## 9 max min_per_class_accuracy  0.000004 0.662420 145
## 10 max mean_per_class_accuracy  0.000000 0.685334 224
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
## Warning in h2o.find_row_by_threshold(object, t): Could not find exact
## threshold: 0.5 for this set of metrics; using closest threshold found:
## 0.496293348880151. Run `h2o.predict` and apply your desired threshold on a
## probability column.
## [[1]]
## [1] 0.3898305
## [1] 0.7568822

The AUC metric for this model is 0.7568822, which is less than what you got from the earlier models. I recommend further experimentation using different hyper parameters, such as the learning rate, epoch or the number of hidden layers.

H2O’s built-in functions provide many key statistical measures that can help measure model performance. Here are some of these key terms.

Metric Description
Sensitivity Measures the proportion of positives that have been correctly identified. It is also called the true positive rate, or recall.
Specificity Measures the proportion of negatives that have been correctly identified. It is also called the true negative rate.
Threshold Cutoff point that maximizes specificity and sensitivity. While the model may not provide the highest prediction at this point, it would not be biased towards positives or negatives.
Precision The fraction of the documents retrieved that are relevant to the information needed, for example, how many of the positively classified are relevant

Provides insight into how well the classifier is able to separate the two classes. The implicit goal is to deal with situations where the sample distribution is highly skewed, with a tendency to overfit to a single class.

0.90 – 1 = excellent (A)

0.8 – 0.9 = good (B)

0.7 – 0.8 = fair (C)

.6 – 0.7 = poor (D)

0.5 – 0.5 = fail (F)

Here’s a summary of the metrics generated from H2O’s built-in functions for the three models that produced useful results.

Metric Model 3 GBM Model Deep Learning Model



















1.0 1.0





1.0 1.0





0.2033898 0.1355932



AUC 0.8492389 0.8630573 0.756882

Note: ‘t’ denotes threshold.

Your options at this point could be narrowed down to Model 3 and the GBM model, based on the AUC and accuracy metrics observed earlier.  If the slightly lower accuracy of the GBM model is deemed acceptable, the record label can choose to go to production with the GBM model, as it can predict a higher number of Top 10 hits.  The AUC metric for the GBM model is also higher than that of Model 3.

Record labels can experiment with different learning techniques and parameters before arriving at a model that proves to be the best fit for their business. Because deep learning models can be computationally expensive, record labels can choose more powerful EC2 instances on AWS to run their experiments faster.


In this post, I showed how the popular music industry can use analytics to predict the type of songs that make the Top 10 Billboard charts. By running H2O’s scalable machine learning platform on AWS, data scientists can easily experiment with multiple modeling techniques and interactively query the data using Amazon Athena, without having to manage the underlying infrastructure. This helps record labels make critical decisions on the type of artists and songs to promote in a timely fashion, thereby increasing sales and revenue.

If you have questions or suggestions, please comment below.

Additional Reading

Learn how to build and explore a simple geospita simple GEOINT application using SparkR.

About the Authors

gopalGopal Wunnava is a Partner Solution Architect with the AWS GSI Team. He works with partners and customers on big data engagements, and is passionate about building analytical solutions that drive business capabilities and decision making. In his spare time, he loves all things sports and movies related and is fond of old classics like Asterix, Obelix comics and Hitchcock movies.



Bob Strahan, a Senior Consultant with AWS Professional Services, contributed to this post.



Popcorn Time Creator Readies BitTorrent & Blockchain-Powered Video Platform

Post Syndicated from Andy original https://torrentfreak.com/popcorn-time-creator-readies-bittorrent-blockchain-powered-youtube-competitor-171012/

Without a doubt, YouTube is one of the most important websites available on the Internet today.

Its massive archive of videos brings pleasure to millions on a daily basis but its centralized nature means that owner Google always exercises control.

Over the years, people have looked to decentralize the YouTube concept and the latest project hoping to shake up the market has a particularly interesting player onboard.

Until 2015, only insiders knew that Argentinian designer Federico Abad was actually ‘Sebastian’, the shadowy figure behind notorious content sharing platform Popcorn Time.

Now he’s part of the team behind Flixxo, a BitTorrent and blockchain-powered startup hoping to wrestle a share of the video market from YouTube. Here’s how the team, which features blockchain startup RSK Labs, hope things will play out.

The Flixxo network will have no centralized storage of data, eliminating the need for expensive hosting along with associated costs. Instead, transfers will take place between peers using BitTorrent, meaning video content will be stored on the machines of Flixxo users. In practice, the content will be downloaded and uploaded in much the same way as users do on The Pirate Bay or indeed Abad’s baby, Popcorn Time.

However, there’s a twist to the system that envisions content creators, content consumers, and network participants (seeders) making revenue from their efforts.

At the heart of the Flixxo system are digital tokens (think virtual currency), called Flixx. These Flixx ‘coins’, which will go on sale in 12 days, can be used to buy access to content. Creators can also opt to pay consumers when those people help to distribute their content to others.

“Free from structural costs, producers can share the earnings from their content with the network that supports them,” the team explains.

“This way you get paid for helping us improve Flixxo, and you earn credits (in the form of digital tokens called Flixx) for watching higher quality content. Having no intermediaries means that the price you pay for watching the content that you actually want to watch is lower and fairer.”

The Flixxo team

In addition to earning tokens from helping to distribute content, people in the Flixxo ecosystem can also earn currency by watching sponsored content, i.e advertisements. While in a traditional system adverts are often considered a nuisance, Flixx tokens have real value, with a promise that users will be able to trade their Flixx not only for videos, but also for tangible and semi-tangible goods.

“Use your Flixx to reward the producers you follow, encouraging them to create more awesome content. Or keep your Flixx in your wallet and use them to buy a movie ticket, a pair of shoes from an online retailer, a chest of coins in your favourite game or even convert them to old-fashioned cash or up-and-coming digital assets, like Bitcoin,” the team explains.

The Flixxo team have big plans. After foundation in early 2016, the second quarter of 2017 saw the completion of a functional alpha release. In a little under two weeks, the project will begin its token generation event, with new offices in Los Angeles planned for the first half of 2018 alongside a premiere of the Flixxo platform.

“A total of 1,000,000,000 (one billion) Flixx tokens will be issued. A maximum of 300,000,000 (three hundred million) tokens will be sold. Some of these tokens (not more than 33% or 100,000,000 Flixx) may be sold with anticipation of the token allocation event to strategic investors,” Flixxo states.

Like all content platforms, Flixxo will live or die by the quality of the content it provides and whether, at least in the first instance, it can persuade people to part with their hard-earned cash. Only time will tell whether its content will be worth a premium over readily accessible YouTube content but with much-reduced costs, it may tempt creators seeking a bigger piece of the pie.

“Flixxo will also educate its community, teaching its users that in this new internet era value can be held and transferred online without intermediaries, a value that can be earned back by participating in a community, by contributing, being rewarded for every single social interaction,” the team explains.

Of course, the elephant in the room is what will happen when people begin sharing copyrighted content via Flixxo. Certainly, the fact that Popcorn Time’s founder is a key player and rival streaming platform Stremio is listed as a partner means that things could get a bit spicy later on.

Nevertheless, the team suggests that piracy and spam content distribution will be limited by mechanisms already built into the system.

“[A]uthors have to time-block tokens in a smart contract (set as a warranty) in order to upload content. This contract will also handle and block their earnings for a certain period of time, so that in the case of a dispute the unfair-uploader may lose those tokens,” they explain.

That being said, Flixxo also says that “there is no way” for third parties to censor content “which means that anyone has the chance of making any piece of media available on the network.” However, Flixxo says it will develop tools for filtering what it describes as “inappropriate content.”

At this point, things start to become a little unclear. On the one hand Flixxo says it could become a “revolutionary tool for uncensorable and untraceable media” yet on the other it says that it’s necessary to ensure that adult content, for example, isn’t seen by kids.

“We know there is a thin line between filtering or curating content and censorship, and it is a fact that we have an open network for everyone to upload any content. However, Flixxo as a platform will apply certain filtering based on clear rules – there should be a behavior-code for uploaders in order to offer the right content to the right user,” Flixxo explains.

To this end, Flixxo says it will deploy a centralized curation function, carried out by 101 delegates elected by the community, which will become progressively decentralized over time.

“This curation will have a cost, paid in Flixx, and will be collected from the warranty blocked by the content uploaders,” they add.

There can be little doubt that if Flixxo begins ‘curating’ unsuitable content, copyright holders will call on it to do the same for their content too. And, if the platform really takes off, 101 curators probably won’t scratch the surface. There’s also the not inconsiderable issue of what might happen to curators’ judgment when they’re incentivized to block curate content.

Finally, for those sick of “not available in your region” messages, there’s good and bad news. Flixxo insists there will be no geo-blocking of content on its part but individual creators will still have that feature available to them, should they choose.

The Flixx whitepaper can be downloaded here (pdf)

Source: TF, for the latest info on copyright, file-sharing, torrent sites and ANONYMOUS VPN services.

Pirate Bay is Mining Cryptocurrency Again, No Opt Out

Post Syndicated from Ernesto original https://torrentfreak.com/pirate-bay-is-mining-cryptocurrency-again-no-opt-out-171011/

Last month The Pirate Bay caused some uproar by adding a Javascript-based cryptocurrency miner to its website.

The miner utilizes CPU power from visitors to generate Monero coins for the site, providing an extra source of revenue.

The Pirate Bay only tested the option briefly, but that was enough to inspire many others to follow suit. Now, a few weeks later, Pirate Bay has also turned on the miners again.

The miner is not directly embedded in the site’s core code but runs through an ad script. Many ad blockers and anti-malware tools are stopping these request, but people who don’t use any will see a clear spike in CPU usage when they access the site.

The Pirate Bay team previously said that they were testing the miner to see if it can replace ads. While there is some real revenue potential, for now, it’s running in addition to the regular banners. It’s unclear whether the current mining period is another test or if it will run permanently from now on.

The miner does appear to be throttled to a certain degree, so most users might not even notice that it’s running.

Pirate Bay load requests

Running a cryptocurrency miner such as the Coin-Hive script TPB is currently using is not without risk. Aside from user complaints, there is an issue that may make it harder for the site to operate in the future.

Last week we reported that CDN provider Cloudflare had suspended the account of torrent proxy site ProxyBunker, flagging its coin miner as malware. This means that The Pirate Bay now risks losing the Cloudflare service, which they rely on for DDoS protection, among other things.

Cloudflare’s suspension of ProxyBunker occurred even though the site provided users with an option to disable the miner. This functionality was implemented by Coinhive after the script was misused by some sites, which ran it without alerting their users.

The Pirate Bay currently has no opt-out option, nor has it informed users about the latest mining efforts. This could lead to another problem since Coinhive said it would crack down on customers who failed to keep users in the loop.

“We will verify this opt-in on our servers and will implement it in a way that it can not be circumvented. We will pledge to keep the opt-in intact at all times, without exceptions,” the Coinhive team previously noted.

The Pirate Bay team has not commented on the issue thus far. In theory, it’s possible that a rogue advertiser is responsible for the latest mining efforts. If that’s the case it will be disabled soon enough.

Source: TF, for the latest info on copyright, file-sharing, torrent sites and ANONYMOUS VPN services.

Bringing Clean and Safe Drinking Water to Developing Countries

Post Syndicated from Roderick Bauer original https://www.backblaze.com/blog/keeping-charity-water-data-safe/

image of a cup filling with water

If you’d like to read more about charity: water‘s use of Backblaze for Business, visit backblaze.com/charitywater/

charity: water  + Backblaze for Business

Considering that charity: water sends workers with laptop computers to rural communities in 24 countries around the world, it’s not surprising that computer backup is needed on every computer they have. It’s so essential that Matt Ward, System Administrator for charity: water, says it’s a standard part of employee on-boarding.

charity: water, based in New York City, is a non-profit organization that is working to bring clean water to the nearly one in ten people around the world who live without it — a situation that affects not only health, but education and income.

“We have people constantly traveling all over the world, so a cloud-based service makes sense whether the user is in New York or Malawi. Most of our projects and beneficiaries are in Sub Saharan Africa and Southern/Southeast Asia,” explains Matt. “Water scarcity and poor water quality are a problem here, and in so many countries around the world.”

charity: water in Rwanda

To achieve their mission, charity: water works through implementing organizations on the ground within the targeted communities. The people in these communities must spend hours every day walking to collect water for their families. It’s a losing proposition, as the time they spend walking takes away from education, earning money, and generally limits the opportunities for improving their lives.

charity: water began using Backblaze for Business before Matt came on a year ago. They started with a few licenses, but quickly decided to deploy Backblaze to every computer in the organization.

“We’ve lost computers plenty of times,” he says, “but, because of Backblaze, there’s never been a case where we lost the computer’s data.”

charity: water has about 80 staff computer users, and adds ten to twenty interns each season. Each staff member or intern has at least one computer. “Our IT department is two people, me and my director,” explains Matt, “and we have to support everyone, so being super simple to deploy is valuable to us.”

“When a new person joins us, we just send them an invitation to join the Group on Backblaze, and they’re all set. Their data is automatically backed up whenever they’re connected to the internet, and I can see their current status on the management console. [Backblaze] really nailed the user interface. You can show anyone the interface, even on their first day, and they get it because it’s simple and easy to understand.”

young girl drinkng clean water

One of the frequent uses for Backblaze for Business is when Matt off-boards users, such as all the interns at the end of the season. He starts a restore through the Backblaze admin console even before he has the actual computer. “I know I have a reliable archive in the restore from Backblaze, and it’s easier than doing it directly from the laptop.”

Matt is an enthusiastic user of the features designed for business users, especially Backblaze’s Groups feature, which has enabled charity: water to centralize billing and computer management for their worldwide team. Businesses can create groups to cluster job functions, employee locations, or any other criteria.

charity: water delivery clean water to children

“It saves me time to be able to see the status of any user’s backups, such as the last time the data was backed up” explains Matt. Before Backblaze, charity: water was writing documentation for workers, hoping they would follow backup protocols. Now, Matt knows what’s going on in real time — a valuable feature when the laptops are dispersed around the world.

“Backblaze for Business is an essential element in any organization’s IT continuity plan,” says Matt. “You need to be sure that there is a backup solution for your data should anything go wrong.”

To learn more about how charity: water uses Backblaze for Business, visit backblaze.com/charitywater/.

Matt Ward of charity: water

Matt Ward, System Administrator for charity: water

The post Bringing Clean and Safe Drinking Water to Developing Countries appeared first on Backblaze Blog | Cloud Storage & Cloud Backup.

Cloudflare Bans Sites For Using Cryptocurrency Miners

Post Syndicated from Andy original https://torrentfreak.com/cloudflare-bans-sites-for-using-cryptocurrency-miners-171004/

After years of accepting donations via Bitcoin, last month various ‘pirate’ sites began to generate digital currency revenues in a brand new way.

It all began with The Pirate Bay, which quietly added a Javascript cryptocurrency miner to its main site, something that first manifested itself as a large spike in CPU utilization on the machines of visitors.

The stealth addition to the platform, which its operators later described as a test, was extremely controversial. While many thought of the miner as a cool and innovative way to generate revenue in a secure fashion, a vocal majority expressed a preference for permission being requested first, in case they didn’t want to participate in the program.

Over the past couple of weeks, several other sites have added similar miners, some which ask permission to run and others that do not. While the former probably aren’t considered problematic, the latter are now being viewed as a serious problem by an unexpected player in the ecosystem.

TorrentFreak has learned that popular CDN service Cloudflare, which is often criticized for not being harsh enough on ‘pirate’ sites, is actively suspending the accounts of sites that deploy cryptocurrency miners on their platforms.

“Cloudflare kicked us from their service for using a Coinhive miner,” the operator of ProxyBunker.online informed TF this morning.

ProxyBunker is a site that that links to several other domains that offer unofficial proxy services for the likes of The Pirate Bay, RARBG, KickassTorrents, Torrentz2, and dozens of other sites. It first tested a miner for four days starting September 23. Official implementation began October 1 but was ended last evening, abruptly.

“Late last night, all our domains got deleted off Cloudflare without warning so I emailed Cloudflare to ask what was going on,” the operator explained.

Bye bye

As the email above shows, Cloudflare cited only a “possible” terms of service violation. Further clarification was needed to get to the root of the problem.

So, just a few minutes later, the site operator contacted Cloudflare, acknowledging the suspension but pointing out that the notification email was somewhat vague and didn’t give a reason for the violation. A follow-up email from Cloudflare certainly put some meat on the bones.

“Multiple domains in your account were injecting Coinhive mining code without
notifying users and without any option to disabling [sic] the mining,” wrote Justin Paine, Head of Trust & Safety at Cloudflare.

“We consider this to be malware, and as such the account was suspended, and all domains removed from Cloudflare.”

Cloudflare: Unannounced miners are malware

ProxyBunker’s operator wrote back to Cloudflare explaining that the Coinhive miner had been running on his domains but that his main domain had a way of disabling mining, as per new code made available from Coinhive.

“We were running the miner on our proxybunker.online domain using Coinhive’s new Javacode Simple Miner UI that lets the user stop the miner at anytime and set the CPU speed it mines at,” he told TF.

Nevertheless, some element of the configuration appears to have fallen short of Cloudflare’s standards. So, shortly after Cloudflare’s explanation, the site operator asked if he could be reinstated if he completely removed the miner from his site. The response was a ‘yes’ but with a stern caveat attached.

“We will remove the account suspension, however do note you’ll need to re-sign up the domains as they were removed as a result of the account suspension. Please note — if we discover similar activity again the domains and account will be permanently blocked,” Cloudflare’s Justin warned.

ProxyBunker’s operator says that while he sees the value in cryptocurrency miners, he can understand why people might be opposed to them too. That being said, he would appreciate it if services like Cloudflare published clear guidelines on what is and is not acceptable.

“We do understand that most users will not like the miner using up a bit of their CPU but we do see the full potential as a new revenue stream,” he explains.

“I think third-party services need to post clear information that they’re not allowed on their services, if that’s the case.”

At time of publication, Cloudflare had not responded to TorrentFreak’s requests for comment.

Source: TF, for the latest info on copyright, file-sharing, torrent sites and ANONYMOUS VPN services.