Tag Archives: map

Weekly roundup: Successful juggling

Post Syndicated from Eevee original https://eev.ee/dev/2017/06/19/weekly-roundup-successful-juggling/

Despite flipping my sleep, as I seem to end up doing every month now, I’ve had a pretty solid week. We finally got our hands on a Switch, so I just played Zelda to stay up a ridiculously long time and restore my schedule pretty quickly.

  • potluck: I started building the potluck game in LÖVE, and it’s certainly come along much faster — I have map transitions, dialogue, and a couple moving platforms working. I still don’t quite know what this game is, but I’m starting to get some ideas.

    I also launched GAMES MADE QUICK??? 1½, a game jam for making a game while watching GDQ, instead of just plain watching GDQ. I intend to spend the week working on the potluck game, though I’m not sure whether I’ll finish it then.

  • fox flux: I started planning out a more interesting overworld and doodled a couple relevant tiles. Terrain is still hard. Also some more player frames.

  • art: I finally finished a glorious new banner, which now hangs proudly above my Twitter and Patreon. I did a bedtime slate doodle. I made and animated a low-poly Yoshi. I sketched Styx based on a photo.

    I keep wishing I have time to dedicate to painting experiments, but I guess this is pretty good output.

  • veekun: Wow! I touched veekun on three separate occasions. I have basic item data actually physically dumping now, I fixed some stuff with Pokémon, and I got evolutions working. Progress! Getting there! So close!

  • blog: Per request, I wrote about digital painting software, though it was hampered slightly by the fact that most of it doesn’t run on my operating system.

I seem to be maintaining tangible momentum on multiple big projects, which is fantastic. And there’s still 40% of the month left! I’m feeling pretty good about where I’m standing; if I can get potluck and veekun done soon, that’ll be a medium and a VERY LARGE weight off my shoulders.

Digital painter rundown

Post Syndicated from Eevee original https://eev.ee/blog/2017/06/17/digital-painter-rundown/

Another patron post! IndustrialRobot asks:

You should totally write about drawing/image manipulation programs! (Inspired by https://eev.ee/blog/2015/05/31/text-editor-rundown/)

This is a little trickier than a text editor comparison — while most text editors are cross-platform, quite a few digital art programs are not. So I’m effectively unable to even try a decent chunk of the offerings. I’m also still a relatively new artist, and image editors are much harder to briefly compare than text editors…

Right, now that your expectations have been suitably lowered:

Krita

I do all of my digital art in Krita. It’s pretty alright.

Okay so Krita grew out of Calligra, which used to be KOffice, which was an office suite designed for KDE (a Linux desktop environment). I bring this up because KDE has a certain… reputation. With KDE, there are at least three completely different ways to do anything, each of those ways has ludicrous amounts of customization and settings, and somehow it still can’t do what you want.

Krita inherits this aesthetic by attempting to do literally everything. It has 17 different brush engines, more than 70 layer blending modes, seven color picker dockers, and an ungodly number of colorspaces. It’s clearly intended primarily for drawing, but it also supports animation and vector layers and a pretty decent spread of raster editing tools. I just right now discovered that it has Photoshop-like “layer styles” (e.g. drop shadow), after a year and a half of using it.

In fairness, Krita manages all of this stuff well enough, and (apparently!) it manages to stay out of your way if you’re not using it. In less fairness, they managed to break erasing with a Wacom tablet pen for three months?

I don’t want to rag on it too hard; it’s an impressive piece of work, and I enjoy using it! The emotion it evokes isn’t so much frustration as… mystified bewilderment.

I once filed a ticket suggesting the addition of a brush size palette — a panel showing a grid of fixed brush sizes that makes it easy to switch between known sizes with a tablet pen (and increases the chances that you’ll be able to get a brush back to the right size again). It’s a prominent feature of Paint Tool SAI and Clip Studio Paint, and while I’ve never used either of those myself, I’ve seen a good few artists swear by it.

The developer response was that I could emulate the behavior by creating brush presets. But that’s flat-out wrong: getting the same effect would require creating a ton of brush presets for every brush I have, plus giving them all distinct icons so the size is obvious at a glance. Even then, it would be much more tedious to use and fill my presets with junk.

And that sort of response is what’s so mysterious to me. I’ve never even been able to use this feature myself, but a year of amateur painting with Krita has convinced me that it would be pretty useful. But a developer didn’t see the use and suggested an incredibly tedious alternative that only half-solves the problem and creates new ones. Meanwhile, of the 28 existing dockable panels, a quarter of them are different ways to choose colors.

What is Krita trying to be, then? What does Krita think it is? Who precisely is the target audience? I have no idea.


Anyway, I enjoy drawing in Krita well enough. It ships with a respectable set of brushes, and there are plenty more floating around. It has canvas rotation, canvas mirroring, perspective guide tools, and other art goodies. It doesn’t colordrop on right click by default, which is arguably a grave sin (it shows a customizable radial menu instead), but that’s easy to rebind. It understands having a background color beneath a bottom transparent layer, which is very nice. You can also toggle any brush between painting and erasing with the press of a button, and that turns out to be very useful.

It doesn’t support infinite canvases, though it does offer a one-click button to extend the canvas in a given direction. I’ve never used it (and didn’t even know what it did until just now), but would totally use an infinite canvas.

I haven’t used the animation support too much, but it’s pretty nice to have. Granted, the only other animation software I’ve used is Aseprite, so I don’t have many points of reference here. It’s a relatively new addition, too, so I assume it’ll improve over time.

The one annoyance I remember with animation was really an interaction with a larger annoyance, which is: working with selections kind of sucks. You can’t drag a selection around with the selection tool; you have to switch to the move tool. That would be fine if you could at least drag the selection ring around with the selection tool, but you can’t do that either; dragging just creates a new selection.

If you want to copy a selection, you have to explicitly copy it to the clipboard and paste it, which creates a new layer. Ctrl-drag with the move tool doesn’t work. So then you have to merge that layer down, which I think is where the problem with animation comes in: a new layer is non-animated by default, meaning it effectively appears in any frame, so simply merging it down with merge it onto every single frame of the layer below. And you won’t even notice until you switch frames or play back the animation. Not ideal.

This is another thing that makes me wonder about Krita’s sense of identity. It has a lot of fancy general-purpose raster editing features that even GIMP is still struggling to implement, like high color depth support and non-destructive filters, yet something as basic as working with selections is clumsy. (In fairness, GIMP is a bit clumsy here too, but it has a consistent notion of “floating selection” that’s easy enough to work with.)

I don’t know how well Krita would work as a general-purpose raster editor; I’ve never tried to use it that way. I can’t think of anything obvious that’s missing. The only real gotcha is that some things you might expect to be tools, like smudge or clone, are just types of brush in Krita.

GIMP

Ah, GIMP — open source’s answer to Photoshop.

It’s very obviously intended for raster editing, and I’m pretty familiar with it after half a lifetime of only using Linux. I even wrote a little Scheme script for it ages ago to automate some simple edits to a couple hundred files, back before I was aware of ImageMagick. I don’t know what to say about it, specifically; it’s fairly powerful and does a wide variety of things.

In fact I’d say it’s almost frustratingly intended for raster editing. I used GIMP in my first attempts at digital painting, before I’d heard of Krita. It was okay, but so much of it felt clunky and awkward. Painting is split between a pencil tool, a paintbrush tool, and an airbrush tool; I don’t really know why. The default brushes are largely uninteresting. Instead of brush presets, there are tool presets that can be saved for any tool; it’s a neat idea, but doesn’t feel like a real substitute for brush presets.

Much of the same functionality as Krita is there, but it’s all somehow more clunky. I’m sure it’s possible to fiddle with the interface to get something friendlier for painting, but I never really figured out how.

And then there’s the surprising stuff that’s missing. There’s no canvas rotation, for example. There’s only one type of brush, and it just stamps the same pattern along a path. I don’t think it’s possible to smear or blend or pick up color while painting. The only way to change the brush size is via the very sensitive slider on the tool options panel, which I remember being a little annoying with a tablet pen. Also, you have to specifically enable tablet support? It’s not difficult or anything, but I have no idea why the default is to ignore tablet pressure and treat it like a regular mouse cursor.

As I mentioned above, there’s also no support for high color depth or non-destructive editing, which is honestly a little embarrassing. Those are the major things Serious Professionals™ have been asking for for ages, and GIMP has been trying to provide them, but it’s taking a very long time. The first signs of GEGL, a new library intended to provide these features, appeared in GIMP 2.6… in 2008. The last major release was in 2012. GIMP has been working on this new plumbing for almost as long as Krita’s entire development history. (To be fair, Krita has also raised almost €90,000 from three Kickstarters to fund its development; I don’t know that GIMP is funded at all.)

I don’t know what’s up with GIMP nowadays. It’s still under active development, but the exact status and roadmap are a little unclear. I still use it for some general-purpose editing, but I don’t see any reason to use it to draw.

I do know that canvas rotation will be in the next release, and there was some experimentation with embedding MyPaint’s brush engine (though when I tried it it was basically unusable), so maybe GIMP is interested in wooing artists? I guess we’ll see.

MyPaint

Ah, MyPaint. I gave it a try once. Once.

It’s a shame, really. It sounds pretty great: specifically built for drawing, has very powerful brushes, supports an infinite canvas, supports canvas rotation, has a simple UI that gets out of your way. Perfect.

Or so it seems. But in MyPaint’s eagerness to shed unnecessary raster editing tools, it forgot a few of the more useful ones. Like selections.

MyPaint has no notion of a selection, nor of copy/paste. If you want to move a head to align better to a body, for example, the sanctioned approach is to duplicate the layer, erase the head from the old layer, erase everything but the head from the new layer, then move the new layer.

I can’t find anything that resembles HSL adjustment, either. I guess the workaround for that is to create H/S/L layers and floodfill them with different colors until you get what you want.

I can’t work seriously without these basic editing tools. I could see myself doodling in MyPaint, but Krita works just as well for doodling as for serious painting, so I’ve never gone back to it.

Drawpile

Drawpile is the modern equivalent to OpenCanvas, I suppose? It lets multiple people draw on the same canvas simultaneously. (I would not recommend it as a general-purpose raster editor.)

It’s a little clunky in places — I sometimes have bugs where keyboard focus gets stuck in the chat, or my tablet cursor becomes invisible — but the collaborative part works surprisingly well. It’s not a brush powerhouse or anything, and I don’t think it allows textured brushes, but it supports tablet pressure and canvas rotation and locked alpha and selections and whatnot.

I’ve used it a couple times, and it’s worked well enough that… well, other people made pretty decent drawings with it? I’m not sure I’ve managed yet. And I wouldn’t use it single-player. Still, it’s fun.

Aseprite

Aseprite is for pixel art so it doesn’t really belong here at all. But it’s very good at that and I like it a lot.

That’s all

I can’t name any other serious contender that exists for Linux.

I’m dimly aware of a thing called “Photo Shop” that’s more intended for photos but functions as a passable painter. More artists seem to swear by Paint Tool SAI and Clip Studio Paint. Also there’s Paint.NET, but I have no idea how well it’s actually suited for painting.

And that’s it! That’s all I’ve got. Krita for drawing, GIMP for editing, Drawpile for collaborative doodling.

AWS Named as a Leader in Gartner’s Infrastructure as a Service (IaaS) Magic Quadrant for 7th Consecutive Year

Post Syndicated from Jeff Barr original https://aws.amazon.com/blogs/aws/aws-named-as-a-leader-in-gartners-infrastructure-as-a-service-iaas-magic-quadrant-for-7th-consecutive-year/

Every product planning session at AWS revolves around customers. We do our best to listen and to learn, and to use what we hear to build the roadmaps for future development. Approximately 90% of the items on the roadmap originate with customer requests and are designed to meet specific needs and requirements that they share with us.

I strongly believe that this customer-driven innovation has helped us to secure the top-right corner of the Leaders quadrant in Gartner’s Magic Quadrant for Cloud Infrastructure as a Service (IaaS) for the 7th consecutive year, earning highest placement for ability to execute and furthest for completeness of vision:

To learn more, read the full report. It contains a lot of detail and is a great summary of the features and factors that our customers examine when choosing a cloud provider.

Jeff;

BackMap, the haptic navigation system

Post Syndicated from Janina Ander original https://www.raspberrypi.org/blog/backmap-haptic/

At this year’s TechCrunch Disrupt NY hackathon, one team presented BackMap, a haptic feedback system which helps visually impaired people to navigate cities and venues. It is assisted by a Raspberry Pi and integrated into a backpack.

Good vibrations with BackMap

The team, including Shashank Sharma, wrote an iOS phone app in Swift, Apple’s open-source programming language. To convert between addresses and geolocations, they used the Esri APIs offered by PubNub. So far, so standard. However, they then configured their BackMap setup so that the user can input their destination via the app, and then follow the route without having to look at a screen or listen to directions. Instead, vibrating motors have been integrated into the straps of a backpack and hooked up to a Raspberry Pi. Whenever the user needs to turn left or right, the Pi makes the respective motor vibrate.

Disrupt NY 2017 Hackathon | Part 1

Disrupt NY 2017 Hackathon presentations filmed live on May 15th, 2017. Preceding the Disrupt Conference is Hackathon weekend on May 13-14, where developers and engineers descend from all over the world to take part in a 24-hour hacking endurance test.

BackMap can also be adapted for indoor navigation by receiving signals from beacons. This could be used to direct users to toilet facilities or exhibition booths at conferences. The team hopes to upgrade the BackMap device to use a wristband format in the future.

Accessible Pi

Here at Pi Towers, we are always glad to see Pi builds for people with disabilities: we’ve seen Sanskriti and Aman’s Braille teacher Mudra, the audio e-reader Valdema by Finnish non-profit Kolibre, and Myrijam and Paul’s award-winning, eye-movement-controlled wheelchair, to name but a few.

Our mission is to bring the power of coding and digital making to everyone, and we are lucky to be part of a diverse community of makers and educators who have often worked proactively to make events and resources accessible to as many people as possible. There is, for example, the autism- and Tourette’s syndrome-friendly South London Raspberry Jam, organised by Femi Owolade-Coombes and his mum Grace. The Raspberry VI website is a portal to all things Pi for visually impaired and blind people. Deaf digital makers may find Jim Roberts’ video tutorials, which are signed in ASL, useful. And anyone can contribute subtitles in any language to our YouTube channel.

If you create or use accessible tutorials, or run a Jam, Code Club, or CoderDojo that is designed to be friendly to people who are neuroatypical or have a disability, let us know how to find your resource or event in the comments!

The post BackMap, the haptic navigation system appeared first on Raspberry Pi.

Latency Distribution Graph in AWS X-Ray

Post Syndicated from Randall Hunt original https://aws.amazon.com/blogs/aws/latency-distribution-graph-in-aws-x-ray/

We’re continuing to iterate on the AWS X-Ray service based on customer feedback and today we’re excited to release a set of tools to help you quickly dive deep on latencies in your applications. Visual Node and Edge latency distribution graphs are shown in a handy new “Service Details” side bar in your X-Ray Service Map.

The X-Ray service graph gives you a visual representation of services and their interactions over a period of time that you select. The nodes represent services and the edges between the nodes represent calls between the services. The nodes and edges each have a set of statistics associated with them. While the visualizations provided in the service map are useful for estimating the average latency in an application they don’t help you to dive deep on specific issues. Most of the time issues occur at statistical outliers. To alleviate this X-Ray computes histograms like the one above help you solve those 99th percentile bugs.

To see a Response Distribution for a Node just click on it in the service graph. You can also click on the edges between the nodes to see the Response Distribution from the viewpoint of the calling service.

The team had a few interesting problems to solve while building out this feature and I wanted to share a bit of that with you now! Given the large number of traces an app can produce it’s not a great idea (for your browser) to plot every single trace client side. Instead most plotting libraries, when dealing with many points, use approximations and bucketing to get a network and performance friendly histogram. If you’ve used monitoring software in the past you’ve probably seen as you zoom in on the data you get higher fidelity. The interesting thing about the latencies coming in from X-Ray is that they vary by several orders of magnitude.

If the latencies were distributed between strictly 0s and 1s you could easily just create 10 buckets of 100 milliseconds. If your apps are anything like mine there’s a lot of interesting stuff happening in the outliers, so it’s beneficial to have more fidelity at 1% and 99% than it is at 50%. The problem with fixed bucket sizes is that they’re not necessarily giving you an accurate summary of data. So X-Ray, for now, uses dynamic bucket sizing based on the t-digests algorithm by Ted Dunning and Otmar Ertl. One of the distinct advantages of this algorithm over other approximation algorithms is its accuracy and precision at extremes (where most errors typically are).

An additional advantage of X-Ray over other monitoring software is the ability to measure two perspectives of latency simultaneously. Developers almost always have some view into the server side latency from their application logs but with X-Ray you can examine latency from the view of each of the clients, services, and microservices that you’re interacting with. You can even dive deeper by adding additional restrictions and queries on your selection. You can identify the specific users and clients that are having issues at that 99th percentile.

This info has already been available in API calls to GetServiceGraph as ResponseTimeHistogram but now we’re exposing it in the console as well to make it easier for customers to consume. For more information check out the documentation here.

Randall

Manage Instances at Scale without SSH Access Using EC2 Run Command

Post Syndicated from Jeff Barr original https://aws.amazon.com/blogs/aws/manage-instances-at-scale-without-ssh-access-using-ec2-run-command/

The guest post below, written by Ananth Vaidyanathan (Senior Product Manager for EC2 Systems Manager) and Rich Urmston (Senior Director of Cloud Architecture at Pegasystems) shows you how to use EC2 Run Command to manage a large collection of EC2 instances without having to resort to SSH.

Jeff;


Enterprises often have several managed environments and thousands of Amazon EC2 instances. It’s important to manage systems securely, without the headaches of Secure Shell (SSH). Run Command, part of Amazon EC2 Systems Manager, allows you to run remote commands on instances (or groups of instances using tags) in a controlled and auditable manner. It’s been a nice added productivity boost for Pega Cloud operations, which rely daily on Run Command services.

You can control Run Command access through standard IAM roles and policies, define documents to take input parameters, control the S3 bucket used to return command output. You can also share your documents with other AWS accounts, or with the public. All in all, Run Command provides a nice set of remote management features.

Better than SSH
Here’s why Run Command is a better option than SSH and why Pegasystems has adopted it as their primary remote management tool:

Run Command Takes Less Time –  Securely connecting to an instance requires a few steps e.g. jumpboxes to connect to or IP addresses to whitelist etc. With Run Command, cloud ops engineers can invoke commands directly from their laptop, and never have to find keys or even instance IDs. Instead, system security relies on AWS auth, IAM roles and policies.

Run Command Operations are Fully Audited – With SSH, there is no real control over what they can do, nor is there an audit trail. With Run Command, every invoked operation is audited in CloudTrail, including information on the invoking user, instances on which command was run, parameters, and operation status. You have full control and ability to restrict what functions engineers can perform on a system.

Run Command has no SSH keys to Manage – Run Command leverages standard AWS credentials, API keys, and IAM policies. Through integration with a corporate auth system, engineers can interact with systems based on their corporate credentials and identity.

Run Command can Manage Multiple Systems at the Same Time – Simple tasks such as looking at the status of a Linux service or retrieving a log file across a fleet of managed instances is cumbersome using SSH. Run Command allows you to specify a list of instances by IDs or tags, and invokes your command, in parallel, across the specified fleet. This provides great leverage when troubleshooting or managing more than the smallest Pega clusters.

Run Command Makes Automating Complex Tasks Easier – Standardizing operational tasks requires detailed procedure documents or scripts describing the exact commands. Managing or deploying these scripts across the fleet is cumbersome. Run Command documents provide an easy way to encapsulate complex functions, and handle document management and access controls. When combined with AWS Lambda, documents provide a powerful automation platform to handle any complex task.

Example – Restarting a Docker Container
Here is an example of a simple document used to restart a Docker container. It takes one parameter; the name of the Docker container to restart. It uses the AWS-RunShellScript method to invoke the command. The output is collected automatically by the service and returned to the caller. For an example of the latest document schema, see Creating Systems Manager Documents.

{
  "schemaVersion":"1.2",
  "description":"Restart the specified docker container.",
  "parameters":{
    "param":{
      "type":"String",
      "description":"(Required) name of the container to restart.",
      "maxChars":1024
    }
  },
  "runtimeConfig":{
    "aws:runShellScript":{
      "properties":[
        {
          "id":"0.aws:runShellScript",
          "runCommand":[
            "docker restart {{param}}"
          ]
        }
      ]
    }
  }
}

Putting Run Command into practice at Pegasystems
The Pegasystems provisioning system sits on AWS CloudFormation, which is used to deploy and update Pega Cloud resources. Layered on top of it is the Pega Provisioning Engine, a serverless, Lambda-based service that manages a library of CloudFormation templates and Ansible playbooks.

A Configuration Management Database (CMDB) tracks all the configurations details and history of every deployment and update, and lays out its data using a hierarchical directory naming convention. The following diagram shows how the various systems are integrated:

For cloud system management, Pega operations uses a command line version called cuttysh and a graphical version based on the Pega 7 platform, called the Pega Operations Portal. Both tools allow you to browse the CMDB of deployed environments, view configuration settings, and interact with deployed EC2 instances through Run Command.

CLI Walkthrough
Here is a CLI walkthrough for looking into a customer deployment and interacting with instances using Run Command.

Launching the cuttysh tool brings you to the root of the CMDB and a list of the provisioned customers:

% cuttysh
d CUSTA
d CUSTB
d CUSTC
d CUSTD

You interact with the CMDB using standard Linux shell commands, such as cd, ls, cat, and grep. Items prefixed with s are services that have viewable properties. Items prefixed with d are navigable subdirectories in the CMDB hierarchy.

In this example, change directories into customer CUSTB’s portion of the CMDB hierarchy, and then further into a provisioned Pega environment called env1, under the Dev network. The tool displays the artifacts that are provisioned for that environment. These entries map to provisioned CloudFormation templates.

> cd CUSTB
/ROOT/CUSTB/us-east-1 > cd DEV/env1

The ls –l command shows the version of the provisioned resources. These version numbers map back to source control–managed artifacts for the CloudFormation, Ansible, and other components that compose a version of the Pega Cloud.

/ROOT/CUSTB/us-east-1/DEV/env1 > ls -l
s 1.2.5 RDSDatabase 
s 1.2.5 PegaAppTier 
s 7.2.1 Pega7 

Now, use Run Command to interact with the deployed environments. To do this, use the attach command and specify the service with which to interact. In the following example, you attach to the Pega Web Tier. Using the information in the CMDB and instance tags, the CLI finds the corresponding EC2 instances and displays some basic information about them. This deployment has three instances.

/ROOT/CUSTB/us-east-1/DEV/env1 > attach PegaWebTier
 # ID         State  Public Ip    Private Ip  Launch Time
 0 i-0cf0e84 running 52.63.216.42 10.96.15.70 2017-01-16 
 1 i-0043c1d running 53.47.191.22 10.96.15.43 2017-01-16 
 2 i-09b879e running 55.93.118.27 10.96.15.19 2017-01-16 

From here, you can use the run command to invoke Run Command documents. In the following example, you run the docker-ps document against instance 0 (the first one on the list). EC2 executes the command and returns the output to the CLI, which in turn shows it.

/ROOT/CUSTB/us-east-1/DEV/env1 > run 0 docker-ps
. . 
CONTAINER ID IMAGE             CREATED      STATUS        NAMES
2f187cc38c1  pega-7.2         10 weeks ago  Up 8 weeks    pega-web

Using the same command and some of the other documents that have been defined, you can restart a Docker container or even pull back the contents of a file to your local system. When you get a file, Run Command also leaves a copy in an S3 bucket in case you want to pass the link along to a colleague.

/ROOT/CUSTB/us-east-1/DEV/env1 > run 0 docker-restart pega-web
..
pega-web

/ROOT/CUSTB/us-east-1/DEV/env1 > run 0 get-file /var/log/cfn-init-cmd.log
. . . . . 
get-file

Data has been copied locally to: /tmp/get-file/i-0563c9e/data
Data is also available in S3 at: s3://my-bucket/CUSTB/cuttysh/get-file/data

Now, leverage the Run Command ability to do more than one thing at a time. In the following example, you attach to a deployment with three running instances and want to see the uptime for each instance. Using the par (parallel) option for run, the CLI tells Run Command to execute the uptime document on all instances in parallel.

/ROOT/CUSTB/us-east-1/DEV/env1 > run par uptime
 …
Output for: i-006bdc991385c33
 20:39:12 up 15 days, 3:54, 0 users, load average: 0.42, 0.32, 0.30

Output for: i-09390dbff062618
 20:39:12 up 15 days, 3:54, 0 users, load average: 0.08, 0.19, 0.22

Output for: i-08367d0114c94f1
 20:39:12 up 15 days, 3:54, 0 users, load average: 0.36, 0.40, 0.40

Commands are complete.
/ROOT/PEGACLOUD/CUSTB/us-east-1/PROD/prod1 > 

Summary
Run Command improves productivity by giving you faster access to systems and the ability to run operations across a group of instances. Pega Cloud operations has integrated Run Command with other operational tools to provide a clean and secure method for managing systems. This greatly improves operational efficiency, and gives greater control over who can do what in managed deployments. The Pega continual improvement process regularly assesses why operators need access, and turns those operations into new Run Command documents to be added to the library. In fact, their long-term goal is to stop deploying cloud systems with SSH enabled.

If you have any questions or suggestions, please leave a comment for us!

— Ananth and Rich

credmap – The Credential Mapper

Post Syndicated from Darknet original http://feedproxy.google.com/~r/darknethackers/~3/0qwBmGRtTRY/

Credmap is an open source credential mapper tool that was created to bring awareness to the dangers of credential reuse. It is capable of testing supplied user credentials on several known websites to test if the password has been reused on any of these. It is not uncommon for people who are not experts in […]

The post credmap – The…

Read the full post at darknet.org.uk

Teaching tech

Post Syndicated from Eevee original https://eev.ee/blog/2017/06/10/teaching-tech/

A sponsored post from Manishearth:

I would kinda like to hear about any thoughts you have on technical teaching or technical writing. Pedagogy is something I care about. But I don’t know how much you do, so feel free to ignore this suggestion 🙂

Good news: I care enough that I’m trying to write a sorta-kinda-teaching book!

Ironically, one of the biggest problems I’ve had with writing the introduction to that book is that I keep accidentally rambling on for pages about problems and difficulties with teaching technical subjects. So maybe this is a good chance to get it out of my system.

Phaser

I recently tried out a new thing. It was Phaser, but this isn’t a dig on them in particular, just a convenient example fresh in my mind. If anything, they’re better than most.

As you can see from Phaser’s website, it appears to have tons of documentation. Two of the six headings are “LEARN” and “EXAMPLES”, which seems very promising. And indeed, Phaser offers:

  • Several getting-started walkthroughs
  • Possibly hundreds of examples
  • A news feed that regularly links to third-party tutorials
  • Thorough API docs

Perfect. Beautiful. Surely, a dream.

Well, almost.

The examples are all microscopic, usually focused around a single tiny feature — many of them could be explained just as well with one line of code. There are a few example games, but they’re short aimless demos. None of them are complete games, and there’s no showcase either. Games sometimes pop up in the news feed, but most of them don’t include source code, so they’re not useful for learning from.

Likewise, the API docs are just API docs, leading to the sorts of problems you might imagine. For example, in a few places there’s a mention of a preUpdate stage that (naturally) happens before update. You might rightfully wonder what kinds of things happen in preUpdate — and more importantly, what should you put there, and why?

Let’s check the API docs for Phaser.Group.preUpdate:

The core preUpdate – as called by World.

Okay, that didn’t help too much, but let’s check what Phaser.World has to say:

The core preUpdate – as called by World.

Ah. Hm. It turns out World is a subclass of Group and inherits this method — and thus its unaltered docstring — from Group.

I did eventually find some brief docs attached to Phaser.Stage (but only by grepping the source code). It mentions what the framework uses preUpdate for, but not why, and not when I might want to use it too.


The trouble here is that there’s no narrative documentation — nothing explaining how the library is put together and how I’m supposed to use it. I get handed some brief primers and a massive reference, but nothing in between. It’s like buying an O’Reilly book and finding out it only has one chapter followed by a 500-page glossary.

API docs are great if you know specifically what you’re looking for, but they don’t explain the best way to approach higher-level problems, and they don’t offer much guidance on how to mesh nicely with the design of a framework or big library. Phaser does a decent chunk of stuff for you, off in the background somewhere, so it gives the strong impression that it expects you to build around it in a particular way… but it never tells you what that way is.

Tutorials

Ah, but this is what tutorials are for, right?

I confess I recoil whenever I hear the word “tutorial”. It conjures an image of a uniquely useless sort of post, which goes something like this:

  1. Look at this cool thing I made! I’ll teach you how to do it too.

  2. Press all of these buttons in this order. Here’s a screenshot, which looks nothing like what you have, because I’ve customized the hell out of everything.

  3. You did it!

The author is often less than forthcoming about why they made any of the decisions they did, where you might want to try something else, or what might go wrong (and how to fix it).

And this is to be expected! Writing out any of that stuff requires far more extensive knowledge than you need just to do the thing in the first place, and you need to do a good bit of introspection to sort out something coherent to say.

In other words, teaching is hard. It’s a skill, and it takes practice, and most people blogging are not experts at it. Including me!


With Phaser, I noticed that several of the third-party tutorials I tried to look at were 404s — sometimes less than a year after they were linked on the site. Pretty major downside to relying on the community for teaching resources.

But I also notice that… um…

Okay, look. I really am not trying to rag on this author. I’m not. They tried to share their knowledge with the world, and that’s a good thing, something worthy of praise. I’m glad they did it! I hope it helps someone.

But for the sake of example, here is the most recent entry in Phaser’s list of community tutorials. I have to link it, because it’s such a perfect example. Consider:

  • The post itself is a bulleted list of explanation followed by a single contiguous 250 lines of source code. (Not that there’s anything wrong with bulleted lists, mind you.) That code contains zero comments and zero blank lines.

  • This is only part two in what I think is a series aimed at beginners, yet the title and much of the prose focus on object pooling, a performance hack that’s easy to add later and that’s almost certainly unnecessary for a game this simple. There is no explanation of why this is done; the prose only says you’ll understand why it’s critical once you add a lot more game objects.

  • It turns out I only have two things to say here so I don’t know why I made this a bulleted list.

In short, it’s not really a guided explanation; it’s “look what I did”.

And that’s fine, and it can still be interesting. I’m not sure English is even this person’s first language, so I’m hardly going to criticize them for not writing a novel about platforming.

The trouble is that I doubt a beginner would walk away from this feeling very enlightened. They might be closer to having the game they wanted, so there’s still value in it, but it feels closer to having someone else do it for them. And an awful lot of tutorials I’ve seen — particularly of the “post on some blog” form (which I’m aware is the genre of thing I’m writing right now) — look similar.

This isn’t some huge social problem; it’s just people writing on their blog and contributing to the corpus of written knowledge. It does become a bit stickier when a large project relies on these community tutorials as its main set of teaching aids.


Again, I’m not ragging on Phaser here. I had a slightly frustrating experience with it, coming in knowing what I wanted but unable to find a description of the semantics anywhere, but I do sympathize. Teaching is hard, writing documentation is hard, and programmers would usually rather program than do either of those things. For free projects that run on volunteer work, and in an industry where anything other than programming is a little undervalued, getting good docs written can be tricky.

(Then again, Phaser sells books and plugins, so maybe they could hire a documentation writer. Or maybe the whole point is for you to buy the books?)

Some pretty good docs

Python has pretty good documentation. It introduces the language with a tutorial, then documents everything else in both a library and language reference.

This sounds an awful lot like Phaser’s setup, but there’s some considerable depth in the Python docs. The tutorial is highly narrative and walks through quite a few corners of the language, stopping to mention common pitfalls and possible use cases. I clicked an arbitrary heading and found a pleasant, informative read that somehow avoids being bewilderingly dense.

The API docs also take on a narrative tone — even something as humble as the collections module offers numerous examples, use cases, patterns, recipes, and hints of interesting ways you might extend the existing types.

I’m being a little vague and hand-wavey here, but it’s hard to give specific examples without just quoting two pages of Python documentation. Hopefully you can see right away what I mean if you just take a look at them. They’re good docs, Bront.

I’ve likewise always enjoyed the SQLAlchemy documentation, which follows much the same structure as the main Python documentation. SQLAlchemy is a database abstraction layer plus ORM, so it can do a lot of subtly intertwined stuff, and the complexity of the docs reflects this. Figuring out how to do very advanced things correctly, in particular, can be challenging. But for the most part it does a very thorough job of introducing you to a large library with a particular philosophy and how to best work alongside it.

I softly contrast this with, say, the Perl documentation.

It’s gotten better since I first learned Perl, but Perl’s docs are still a bit of a strange beast. They exist as a flat collection of manpage-like documents with terse names like perlootut. The documentation is certainly thorough, but much of it has a strange… allocation of detail.

For example, perllol — the explanation of how to make a list of lists, which somehow merits its own separate documentation — offers no fewer than nine similar variations of the same code for reading a file into a nested lists of words on each line. Where Python offers examples for a variety of different problems, Perl shows you a lot of subtly different ways to do the same basic thing.

A similar problem is that Perl’s docs sometimes offer far too much context; consider the references tutorial, which starts by explaining that references are a powerful “new” feature in Perl 5 (first released in 1994). It then explains why you might want to nest data structures… from a Perl 4 perspective, thus explaining why Perl 5 is so much better.

Some stuff I’ve tried

I don’t claim to be a great teacher. I like to talk about stuff I find interesting, and I try to do it in ways that are accessible to people who aren’t lugging around the mountain of context I already have. This being just some blog, it’s hard to tell how well that works, but I do my best.

I also know that I learn best when I can understand what’s going on, rather than just seeing surface-level cause and effect. Of course, with complex subjects, it’s hard to develop an understanding before you’ve seen the cause and effect a few times, so there’s a balancing act between showing examples and trying to provide an explanation. Too many concrete examples feel like rote memorization; too much abstract theory feels disconnected from anything tangible.

The attempt I’m most pleased with is probably my post on Perlin noise. It covers a fairly specific subject, which made it much easier. It builds up one step at a time from scratch, with visualizations at every point. It offers some interpretations of what’s going on. It clearly explains some possible extensions to the idea, but distinguishes those from the core concept.

It is a little math-heavy, I grant you, but that was hard to avoid with a fundamentally mathematical topic. I had to be economical with the background information, so I let the math be a little dense in places.

But the best part about it by far is that I learned a lot about Perlin noise in the process of writing it. In several places I realized I couldn’t explain what was going on in a satisfying way, so I had to dig deeper into it before I could write about it. Perhaps there’s a good guideline hidden in there: don’t try to teach as much as you know?

I’m also fairly happy with my series on making Doom maps, though they meander into tangents a little more often. It’s hard to talk about something like Doom without meandering, since it’s a convoluted ecosystem that’s grown organically over the course of 24 years and has at least three ways of doing anything.


And finally there’s the book I’m trying to write, which is sort of about game development.

One of my biggest grievances with game development teaching in particular is how often it leaves out important touches. Very few guides will tell you how to make a title screen or menu, how to handle death, how to get a Mario-style variable jump height. They’ll show you how to build a clearly unfinished demo game, then leave you to your own devices.

I realized that the only reliable way to show how to build a game is to build a real game, then write about it. So the book is laid out as a narrative of how I wrote my first few games, complete with stumbling blocks and dead ends and tiny bits of polish.

I have no idea how well this will work, or whether recapping my own mistakes will be interesting or distracting for a beginner, but it ought to be an interesting experiment.

Secure API Access with Amazon Cognito Federated Identities, Amazon Cognito User Pools, and Amazon API Gateway

Post Syndicated from Ed Lima original https://aws.amazon.com/blogs/compute/secure-api-access-with-amazon-cognito-federated-identities-amazon-cognito-user-pools-and-amazon-api-gateway/

Ed Lima, Solutions Architect

 

Our identities are what define us as human beings. Philosophical discussions aside, it also applies to our day-to-day lives. For instance, I need my work badge to get access to my office building or my passport to travel overseas. My identity in this case is attached to my work badge or passport. As part of the system that checks my access, these documents or objects help define whether I have access to get into the office building or travel internationally.

This exact same concept can also be applied to cloud applications and APIs. To provide secure access to your application users, you define who can access the application resources and what kind of access can be granted. Access is based on identity controls that can confirm authentication (AuthN) and authorization (AuthZ), which are different concepts. According to Wikipedia:

 

The process of authorization is distinct from that of authentication. Whereas authentication is the process of verifying that “you are who you say you are,” authorization is the process of verifying that “you are permitted to do what you are trying to do.” This does not mean authorization presupposes authentication; an anonymous agent could be authorized to a limited action set.

Amazon Cognito allows building, securing, and scaling a solution to handle user management and authentication, and to sync across platforms and devices. In this post, I discuss the different ways that you can use Amazon Cognito to authenticate API calls to Amazon API Gateway and secure access to your own API resources.

 

Amazon Cognito Concepts

 

It’s important to understand that Amazon Cognito provides three different services:

Today, I discuss the use of the first two. One service doesn’t need the other to work; however, they can be configured to work together.
 

Amazon Cognito Federated Identities

 
To use Amazon Cognito Federated Identities in your application, create an identity pool. An identity pool is a store of user data specific to your account. It can be configured to require an identity provider (IdP) for user authentication, after you enter details such as app IDs or keys related to that specific provider.

After the user is validated, the provider sends an identity token to Amazon Cognito Federated Identities. In turn, Amazon Cognito Federated Identities contacts the AWS Security Token Service (AWS STS) to retrieve temporary AWS credentials based on a configured, authenticated IAM role linked to the identity pool. The role has appropriate IAM policies attached to it and uses these policies to provide access to other AWS services.

Amazon Cognito Federated Identities currently supports the IdPs listed in the following graphic.

 



Continue reading Secure API Access with Amazon Cognito Federated Identities, Amazon Cognito User Pools, and Amazon API Gateway

[$] Language summit lightning talks

Post Syndicated from jake original https://lwn.net/Articles/723823/rss

Over the course of the day, the 2017 Python Language Summit hosted a
handful of lightning talks, several of which were worked into the dynamic
schedule when an opportunity presented itself. They ranged from the
traditional “less than five minutes” format to some that strayed well
outside of that time frame—some generated a fair amount of discussion as
well. Topics were all over the map: board elections, beta releases,
Python as a security vulnerability, Jython, and more.

Storm Glass: simulate the weather at your desk

Post Syndicated from Alex Bate original https://www.raspberrypi.org/blog/storm-glass/

Inspired by the tempescope, The Modern Inventor’s Storm Glass is a weather-simulating lamp that can recreate the weather of any location in the world, all thanks to the help of a Raspberry Pi Zero W.

The Modern Inventors Storm Glass

Image c/o The Modern Inventor

The lamp uses the Weather Underground API, which allows the Raspberry Pi to access current and predicted weather conditions across the globe. Some may argue “Why do I need a recreation of the weather if I can look out my window?”, but I think the idea of observing tomorrow’s weather today, or keeping an eye on conditions in another location, say your favourite holiday destination, is pretty sweet.

Building a Storm Glass

The Modern Inventor, whose name I haven’t found out yet so I’ll call him TMI, designed and 3D printed the base and cap for the lamp. The glass bottle that sits between the two is one of those fancy mineral water bottles you’ve seen in the supermarket but never could justify buying before.

The base holds the Pi, as well as a speaker, a microphone, and various other components such as a Speaker Bonnet and NeoPixel Ring from Adafruit.

The Modern Inventors Storm Glass

Image c/o The Modern Inventor

“The rain maker is a tiny 5V centrifuge pump I got online, which pumps water along some glass tubing and into the lid where the rain falls from”, TMI explains on his Instructables project page. “The cloud generator is a USB-powered ultrasonic diffuser/humidifier. I just pulled out the guts and got rid of the rest. Make sure to keep the electronics which create the ultrasonic signal that drives the diffuser.”

The Modern Inventor's Storm Glass

Image c/o The Modern Inventor

With the tech in place, TMI (yes, I do appreciate the irony of using TMI as a designator for someone about whom I lack information) used hot glue like his life depended on it, bringing the whole build together into one slick-looking lamp.

Coding the storm

TMI set up the Storm Glass to pull data about weather conditions in a designated location via the Weather Underground API and recreate these within the lamp. He also installed Alexa Voice Service in it, giving the lamp a secondary function as a home automation device.

The Modern Inventor's Storm Glass

Image c/o The Modern Inventor

Code for the Storm Glass, alongside a far more detailed explanation of the build process, can be found on TMI’s project page. He says the total cost of this make comes to less than $80.

Create your own weather device

If you’d like to start using weather APIs to track conditions at home or abroad, we have a whole host of free Raspberry Pi resources for you to try your hand on: begin by learning how to fetch weather data using the RESTful API or using Scratch and the OpenWeatherMap to create visual representations of weather across the globe. You could even create a ‘Dress for the weather’ indicator so you’re never caught without a coat, an umbrella, or sunscreen again!

However you use the weather in your digital making projects, we’d love to see what you’ve been up to in the comments below.

The post Storm Glass: simulate the weather at your desk appeared first on Raspberry Pi.

Weekly roundup: Flux capacity

Post Syndicated from Eevee original https://eev.ee/dev/2017/06/06/weekly-roundup-flux-capacity/

  • blog: I wrote about introspection, or more specifically, what it’s been like to gradually shift from “pure tech” to more creative work.

  • fox flux: I am trapped in animation hell, but making progress. I made a pretty cool new parallax background. Also took a short break from drawing to do some physics, and now I have critters that can stop at ledges.

  • idchoppers: I wrote a quick thing that counts texture usage in maps, by both count and area.

    Honestly, I’ve been avoiding idchoppers because getting the API right is going to be a good bit of tedious work, so if it’s going to go anywhere then I’ll need to devote some serious time to it. Not sure whether I will or not. Little cute tools like a texture count make for useful example cases for an API, though, so I guess this helps.

  • flora: I did some secret work for a not-yet-released thing. It is very good.

  • art: I drew a June avatar which I am super happy with!

Mostly I’ve been doing a lot of pixel art, and I should maybe take a break from it for a little while, because I have a lot of stuff to do this month. Stay tuned.

Torrents Help Researchers Worldwide to Study Babies’ Brains

Post Syndicated from Ernesto original https://torrentfreak.com/torrents-help-researchers-worldwide-to-study-babies-brains-170603/

One of the core pillars of academic research is sharing.

By letting other researchers know what you do, ideas are criticized, improved upon and extended. In today’s digital age, sharing is easier than ever before, especially with help from torrents.

One of the leading scientific projects that has adopted BitTorrent is the developing Human Connectome Project, or dHCP for short. The goal of the project is to map the brain wiring of developing babies in the wombs of their mothers.

To do so, a consortium of researchers with expertise ranging from computer science, to MRI physics and clinical medicine, has teamed up across three British institutions: Imperial College London, King’s College London and the University of Oxford.

The collected data is extremely valuable for the neuroscience community and the project has received mainstream press coverage and financial backing from the European Union Research Council. Not only to build the dataset, but also to share it with researchers around the globe. This is where BitTorrent comes in.

Sharing more than 150 GB of data with researchers all over the world can be quite a challenge. Regular HTTP downloads are not really up to the task, and many other transfer options have a high failure rate.

Baby brain scan (Credit: Developing Human Connectome Project)

This is why Jonathan Passerat-Palmbach, Research Associate Department of Computing Imperial College London, came up with the idea to embrace BitTorrent instead.

“For me, it was a no-brainer from day one that we couldn’t rely on plain old HTTP to make this dataset available. Our first pilot release is 150GB, and I expect the next ones to reach a couple of TB. Torrents seemed like the de facto solution to share this data with the world’s scientific community.” Passerat-Palmbach says.

The researchers opted to go for the Academic Torrents tracker, which specializes in sharing research data. A torrent with the first batch of images was made available there a few weeks ago.

“This initial release contains 3,629 files accounting for 167.20GB of data. While this figure might not appear extremely large at the moment, it will significantly grow as the project aims to make the data of 1,000 subjects available by the time it has completed.”

Torrent of the first dataset

The download numbers are nowhere in the region of an average Hollywood blockbuster, of course. Thus far the tracker has registered just 28 downloads. That said, as a superior and open file-transfer protocol, BitTorrent does aid in critical research that helps researchers to discover more about the development of conditions such as ADHD and autism.

Interestingly, the biggest challenges of implementing the torrent solution were not of a technical nature. Most time and effort went into assuring other team members that this was the right solution.

“I had to push for more than a year for the adoption of torrents within the consortium. While my colleagues could understand the potential of the approach and its technical inputs, they remained skeptical as to the feasibility to implement such a solution within an academic context and its reception by the world community.

“However, when the first dataset was put together, amounting to 150GB, it became obvious all the HTTP and FTP fallback plans would not fit our needs,” Passerat-Palmbach adds.

Baby brain scans (Credit: Developing Human Connectome Project)

When the consortium finally agreed that BitTorrent was an acceptable way to share the data, local IT staff at the university had to give their seal of approval. Imperial College London doesn’t allow torrent traffic to flow freely across the network, so an exception had to be made.

“Torrents are blocked across the wireless and VPN networks at Imperial. Getting an explicit firewall exception created for our seeding machine was not a walk in the park. It was the first time they were faced with such a situation and we were clearly told that it was not to become the rule.”

Then, finally, the data could be shared around the world.

While BitTorrent is probably the most efficient way to share large files, there were other proprietary solutions that could do the same. However, Passerat-Palmbach preferred not to force other researchers to install “proprietary black boxes” on their machines.

Torrents are free and open, which is more in line with the Open Access approach more academics take today.

Looking back, it certainly wasn’t a walk in the park to share the data via BitTorrent. Passerat-Palmbach was frequently confronted with the piracy stigma torrents have amoung many of his peers, even among younger generations.

“Considering how hard it was to convince my colleagues within the project to actually share this dataset using torrents (‘isn’t it illegal?’ and other kinds of misconceptions…), I think there’s still a lot of work to do to demystify the use of torrents with the public.

“I was even surprised to see that these misconceptions spread out not only to more senior scientists but also to junior researchers who I was expecting to be more tech-aware,” Passerat-Palmbach adds.

That said, the hard work is done now and in the months and years ahead the neuroscience community will have access to Petabytes of important data, with help from BitTorrent. That is definitely worth the effort.

Finally, we thought it was fitting to end with Passerat-Palmbach’s “pledge to seed,” which he shared with his peers. Keep on sharing!


On the importance of seeding

Dear fellow scientist,

Thank for you very much for the interest you are showing in the dHCP dataset!

Once you start downloading the dataset, you’ll notice that your torrent client mentions a sharing / seeding ratio. It means that as soon as you start downloading the dataset, you become part of our community of sharers and contribute to making the dataset available to other researchers all around the world!

There’s no reason to be scared! It’s perfectly legal as long as you’re allowed to have a copy of the dataset (that’s the bit you need to forward to your lab’s IT staff if they’re blocking your ports).

You’re actually providing a tremendous contribution to dHCP by spreading the data, so thank you again for that!

With your help, we can make sure this data remains available and can be downloaded relatively fast in the future. Over time, the dataset will grow and your contribution will be more and more important so that each and everyone of you can still obtain the data in the smoothest possible way.

We cannot do it without you. By seeding, you’re actually saying “cheers!” to your peers whom you downloaded your data from. So leave your client open and stay tuned!

All this is made possible thanks to the amazing folks at academictorrents and their infrastructure, so kudos academictorrents!

You can learn more about their project here and get some help to get started with torrent downloading here.

Jonathan Passerat-Palmbach

Source: TF, for the latest info on copyright, file-sharing, torrent sites and ANONYMOUS VPN services.

Seven Tips for Using S3DistCp on Amazon EMR to Move Data Efficiently Between HDFS and Amazon S3

Post Syndicated from Illya Yalovyy original https://aws.amazon.com/blogs/big-data/seven-tips-for-using-s3distcp-on-amazon-emr-to-move-data-efficiently-between-hdfs-and-amazon-s3/

Have you ever needed to move a large amount of data between Amazon S3 and Hadoop Distributed File System (HDFS) but found that the data set was too large for a simple copy operation? EMR can help you with this. In addition to processing and analyzing petabytes of data, EMR can move large amounts of data.

In the Hadoop ecosystem, DistCp is often used to move data. DistCp provides a distributed copy capability built on top of a MapReduce framework. S3DistCp is an extension to DistCp that is optimized to work with S3 and that adds several useful features. In addition to moving data between HDFS and S3, S3DistCp is also a Swiss Army knife of file manipulations. In this post we’ll cover the following tips for using S3DistCp, starting with basic use cases and then moving to more advanced scenarios:

1. Copy or move files without transformation
2. Copy and change file compression on the fly
3. Copy files incrementally
4. Copy multiple folders in one job
5. Aggregate files based on a pattern
6. Upload files larger than 1 TB in size
7. Submit a S3DistCp step to an EMR cluster

1. Copy or move files without transformation

We’ve observed that customers often use S3DistCp to copy data from one storage location to another, whether S3 or HDFS. Syntax for this operation is simple and straightforward:

$ s3-dist-cp --src /data/incoming/hourly_table --dest s3://my-tables/incoming/hourly_table

The source location may contain extra files that we don’t necessarily want to copy. Here, we can use filters based on regular expressions to do things such as copying files with the .log extension only.

Each subfolder has the following files:

$ hadoop fs -ls /data/incoming/hourly_table/2017-02-01/03
Found 8 items
-rw-r--r--   1 hadoop hadoop     197850 2017-02-19 03:41 /data/incoming/hourly_table/2017-02-01/03/2017-02-01.03.25845.log
-rw-r--r--   1 hadoop hadoop     484006 2017-02-19 03:41 /data/incoming/hourly_table/2017-02-01/03/2017-02-01.03.32953.log
-rw-r--r--   1 hadoop hadoop     868522 2017-02-19 03:41 /data/incoming/hourly_table/2017-02-01/03/2017-02-01.03.62649.log
-rw-r--r--   1 hadoop hadoop     408072 2017-02-19 03:41 /data/incoming/hourly_table/2017-02-01/03/2017-02-01.03.64637.log
-rw-r--r--   1 hadoop hadoop    1031949 2017-02-19 03:41 /data/incoming/hourly_table/2017-02-01/03/2017-02-01.03.70767.log
-rw-r--r--   1 hadoop hadoop     368240 2017-02-19 03:41 /data/incoming/hourly_table/2017-02-01/03/2017-02-01.03.89910.log
-rw-r--r--   1 hadoop hadoop     437348 2017-02-19 03:41 /data/incoming/hourly_table/2017-02-01/03/2017-02-01.03.96053.log
-rw-r--r--   1 hadoop hadoop        800 2017-02-19 03:41 /data/incoming/hourly_table/2017-02-01/03/processing.meta

To copy only the required files, let’s use the --srcPattern option:

$ s3-dist-cp --src /data/incoming/hourly_table --dest s3://my-tables/incoming/hourly_table_filtered --srcPattern .*\.log

After the upload has finished successfully, let’s check the folder contents in the destination location to confirm only the files ending in .log were copied:

$ hadoop fs -ls s3://my-tables/incoming/hourly_table_filtered/2017-02-01/03
-rw-rw-rw-   1     197850 2017-02-19 22:56 s3://my-tables/incoming/hourly_table_filtered/2017-02-01/03/2017-02-01.03.25845.log
-rw-rw-rw-   1     484006 2017-02-19 22:56 s3://my-tables/incoming/hourly_table_filtered/2017-02-01/03/2017-02-01.03.32953.log
-rw-rw-rw-   1     868522 2017-02-19 22:56 s3://my-tables/incoming/hourly_table_filtered/2017-02-01/03/2017-02-01.03.62649.log
-rw-rw-rw-   1     408072 2017-02-19 22:56 s3://my-tables/incoming/hourly_table_filtered/2017-02-01/03/2017-02-01.03.64637.log
-rw-rw-rw-   1    1031949 2017-02-19 22:56 s3://my-tables/incoming/hourly_table_filtered/2017-02-01/03/2017-02-01.03.70767.log
-rw-rw-rw-   1     368240 2017-02-19 22:56 s3://my-tables/incoming/hourly_table_filtered/2017-02-01/03/2017-02-01.03.89910.log
-rw-rw-rw-   1     437348 2017-02-19 22:56 s3://my-tables/incoming/hourly_table_filtered/2017-02-01/03/2017-02-01.03.96053.log

Sometimes, data needs to be moved instead of copied. In this case, we can use the --deleteOnSuccess option. This option is similar to aws s3 mv, which you might have used previously with the AWS CLI. The files are first copied and then deleted from the source:

$ s3-dist-cp --src s3://my-tables/incoming/hourly_table --dest s3://my-tables/incoming/hourly_table_archive --deleteOnSuccess

After the preceding operation, the source location has only empty folders, and the target location contains all files.

$ hadoop fs -ls -R s3://my-tables/incoming/hourly_table/2017-02-01/
drwxrwxrwx   -          0 1970-01-01 00:00 s3://my-tables/incoming/hourly_table/2017-02-01/00
drwxrwxrwx   -          0 1970-01-01 00:00 s3://my-tables/incoming/hourly_table/2017-02-01/01
...
drwxrwxrwx   -          0 1970-01-01 00:00 s3://my-tables/incoming/hourly_table/2017-02-01/21
drwxrwxrwx   -          0 1970-01-01 00:00 s3://my-tables/incoming/hourly_table/2017-02-01/22


$ hadoop fs -ls s3://my-tables/incoming/hourly_table_archive/2017-02-01/01
-rw-rw-rw-   1     676756 2017-02-19 23:27 s3://my-tables/incoming/hourly_table_archive/2017-02-01/01/2017-02-01.01.27047.log
-rw-rw-rw-   1     780197 2017-02-19 23:27 s3://my-tables/incoming/hourly_table_archive/2017-02-01/01/2017-02-01.01.59789.log
-rw-rw-rw-   1    1041789 2017-02-19 23:27 s3://my-tables/incoming/hourly_table_archive/2017-02-01/01/2017-02-01.01.82293.log
-rw-rw-rw-   1        400 2017-02-19 23:27 s3://my-tables/incoming/hourly_table_archive/2017-02-01/01/processing.meta

The important things to remember here are that S3DistCp deletes only files with the --deleteOnSuccess flag and that it doesn’t delete parent folders, even when they are empty.

2. Copy and change file compression on the fly

Raw files often land in S3 or HDFS in an uncompressed text format. This format is suboptimal both for the cost of storage and for running analytics on that data. S3DistCp can help you efficiently store data and compress files on the fly with the --outputCodec option:

$ s3-dist-cp --src s3://my-tables/incoming/hourly_table_filtered --dest s3://my-tables/incoming/hourly_table_gz --outputCodec=gz

The current version of S3DistCp supports the codecs gzip, gz, lzo, lzop, and snappy, and the keywords none and keep (the default). These keywords have the following meaning:

  • none” – Save files uncompressed. If the files are compressed, then S3DistCp decompresses them.
  • keep” – Don’t change the compression of the files but copy them as-is.

Let’s check the files in the target folder, which have now been compressed with the gz codec:

$ hadoop fs -ls s3://my-tables/incoming/hourly_table_gz/2017-02-01/01/
Found 3 items
-rw-rw-rw-   1     78756 2017-02-20 00:07 s3://my-tables/incoming/hourly_table_gz/2017-02-01/01/2017-02-01.01.27047.log.gz
-rw-rw-rw-   1     80197 2017-02-20 00:07 s3://my-tables/incoming/hourly_table_gz/2017-02-01/01/2017-02-01.01.59789.log.gz
-rw-rw-rw-   1    121178 2017-02-20 00:07 s3://my-tables/incoming/hourly_table_gz/2017-02-01/01/2017-02-01.01.82293.log.gz

3. Copy files incrementally

In real life, the upstream process drops files in some cadence. For instance, new files might get created every hour, or every minute. The downstream process can be configured to pick it up at a different schedule.

Let’s say data lands on S3 and we want to process it on HDFS daily. Copying all files every time doesn’t scale very well. Fortunately, S3DistCp has a built-in solution for that.

For this solution, we use a manifest file. That file allows S3DistCp to keep track of copied files. Following is an example of the command:

$ s3-dist-cp --src s3://my-tables/incoming/hourly_table --dest s3://my-tables/processing/hourly_table --srcPattern .*\.log --outputManifest=manifest-2017-02-25.gz --previousManifest=s3://my-tables/processing/hourly_table/manifest-2017-02-24.gz

The command takes two manifest files as parameters, outputManifest and previousManifest. The first one contains a list of all copied files (old and new), and the second contains a list of files copied previously. This way, we can recreate the full history of operations and see what files were copied during each run:

$ hadoop fs -text s3://my-tables/processing/hourly_table/manifest-2017-02-24.gz > previous.lst
$ hadoop fs -text s3://my-tables/processing/hourly_table/manifest-2017-02-25.gz > current.lst
$ diff previous.lst current.lst
2548a2549,2550
> {"path":"s3://my-tables/processing/hourly_table/2017-02-25/00/2017-02-15.00.50958.log","baseName":"2017-02-25/00/2017-02-15.00.50958.log","srcDir":"s3://my-tables/processing/hourly_table","size":610308}
> {"path":"s3://my-tables/processing/hourly_table/2017-02-25/00/2017-02-25.00.93423.log","baseName":"2017-02-25/00/2017-02-25.00.93423.log","srcDir":"s3://my-tables/processing/hourly_table","size":178928}

S3DistCp creates the file in the local file system using the provided path, /tmp/mymanifest.gz. When the copy operation finishes, it moves that manifest to <DESTINATION LOCATION>.

4. Copy multiple folders in one job

Imagine that we need to copy several folders. Usually, we run as many copy jobs as there are folders that need to be copied. With S3DistCp, the copy can be done in one go. All we need is to prepare a file with list of prefixes and use it as a parameter for the tool:

$ s3-dist-cp --src s3://my-tables/incoming/hourly_table_filtered --dest s3://my-tables/processing/sample_table --srcPrefixesFile file://${PWD}/folders.lst

In this case, the folders.lst file contains the following prefixes:

$ cat folders.lst
s3://my-tables/incoming/hourly_table_filtered/2017-02-10/11
s3://my-tables/incoming/hourly_table_filtered/2017-02-19/02
s3://my-tables/incoming/hourly_table_filtered/2017-02-23

As a result, the target location has only the requested subfolders:

$ hadoop fs -ls -R s3://my-tables/processing/sample_table
drwxrwxrwx   -          0 1970-01-01 00:00 s3://my-tables/processing/sample_table/2017-02-10
drwxrwxrwx   -          0 1970-01-01 00:00 s3://my-tables/processing/sample_table/2017-02-10/11
-rw-rw-rw-   1     139200 2017-02-24 05:59 s3://my-tables/processing/sample_table/2017-02-10/11/2017-02-10.11.12980.log
...
drwxrwxrwx   -          0 1970-01-01 00:00 s3://my-tables/processing/sample_table/2017-02-19
drwxrwxrwx   -          0 1970-01-01 00:00 s3://my-tables/processing/sample_table/2017-02-19/02
-rw-rw-rw-   1     702058 2017-02-24 05:59 s3://my-tables/processing/sample_table/2017-02-19/02/2017-02-19.02.19497.log
-rw-rw-rw-   1     265404 2017-02-24 05:59 s3://my-tables/processing/sample_table/2017-02-19/02/2017-02-19.02.26671.log
...
drwxrwxrwx   -          0 1970-01-01 00:00 s3://my-tables/processing/sample_table/2017-02-23
drwxrwxrwx   -          0 1970-01-01 00:00 s3://my-tables/processing/sample_table/2017-02-23/00
-rw-rw-rw-   1     310425 2017-02-24 05:59 s3://my-tables/processing/sample_table/2017-02-23/00/2017-02-23.00.10061.log
-rw-rw-rw-   1    1030397 2017-02-24 05:59 s3://my-tables/processing/sample_table/2017-02-23/00/2017-02-23.00.22664.log
...

5. Aggregate files based on a pattern

Hadoop is optimized for reading a fewer number of large files rather than many small files, whether from S3 or HDFS. You can use S3DistCp to aggregate small files into fewer large files of a size that you choose, which can optimize your analysis for both performance and cost.

In the following example, we combine small files into bigger files. We do so by using a regular expression with the –groupBy option.

$ s3-dist-cp --src /data/incoming/hourly_table --dest s3://my-tables/processing/daily_table --targetSize=10 --groupBy=’.*/hourly_table/.*/(\d\d)/.*\.log’

Let’s take a look into the target folders and compare them to the corresponding source folders:

$ hadoop fs -ls /data/incoming/hourly_table/2017-02-22/05/
Found 8 items
-rw-r--r--   1 hadoop hadoop     289949 2017-02-19 06:07 /data/incoming/hourly_table/2017-02-22/05/2017-02-22.05.11125.log
-rw-r--r--   1 hadoop hadoop     407290 2017-02-19 06:07 /data/incoming/hourly_table/2017-02-22/05/2017-02-22.05.19596.log
-rw-r--r--   1 hadoop hadoop     253434 2017-02-19 06:07 /data/incoming/hourly_table/2017-02-22/05/2017-02-22.05.30135.log
-rw-r--r--   1 hadoop hadoop     590655 2017-02-19 06:07 /data/incoming/hourly_table/2017-02-22/05/2017-02-22.05.36531.log
-rw-r--r--   1 hadoop hadoop     762076 2017-02-19 06:07 /data/incoming/hourly_table/2017-02-22/05/2017-02-22.05.47822.log
-rw-r--r--   1 hadoop hadoop     489783 2017-02-19 06:07 /data/incoming/hourly_table/2017-02-22/05/2017-02-22.05.80518.log
-rw-r--r--   1 hadoop hadoop     205976 2017-02-19 06:07 /data/incoming/hourly_table/2017-02-22/05/2017-02-22.05.99127.log
-rw-r--r--   1 hadoop hadoop        800 2017-02-19 06:07 /data/incoming/hourly_table/2017-02-22/05/processing.meta

 

$ hadoop fs -ls s3://my-tables/processing/daily_table/2017-02-22/05/
Found 2 items
-rw-rw-rw-   1   10541944 2017-02-28 05:16 s3://my-tables/processing/daily_table/2017-02-22/05/054
-rw-rw-rw-   1   10511516 2017-02-28 05:16 s3://my-tables/processing/daily_table/2017-02-22/05/055

As you can see, seven data files were combined into two with a size close to the requested 10 MB. The *.meta file was filtered out because --groupBy pattern works in a similar way to –srcPattern. We recommend keeping files larger than the default block size, which is 128 MB on EMR.

The name of the final file is composed of groups in the regular expression used in --groupBy plus some number to make the name unique. The pattern must have at least one group defined.

Let’s consider one more example. This time, we want the file name to be formed from three parts: year, month, and file extension (.log in this case). Here is an updated command:

$ s3-dist-cp --src /data/incoming/hourly_table --dest s3://my-tables/processing/daily_table_2017 --targetSize=10 --groupBy=’.*/hourly_table/.*(2017-).*/(\d\d)/.*\.(log)’

Now we have final files named in a different way:

$ hadoop fs -ls s3://my-tables/processing/daily_table_2017/2017-02-22/05/
Found 2 items
-rw-rw-rw-   1   10541944 2017-02-28 05:16 s3://my-tables/processing/daily_table/2017-02-22/05/2017-05log4
-rw-rw-rw-   1   10511516 2017-02-28 05:16 s3://my-tables/processing/daily_table/2017-02-22/05/2017-05log5

As you can see, names of final files consist of concatenation of 3 groups from the regular expression (2017-), (\d\d), (log).

You might find that occasionally you get an error that looks like the following:

$ s3-dist-cp --src /data/incoming/hourly_table --dest s3://my-tables/processing/daily_table_2017 --targetSize=10 --groupBy=’.*/hourly_table/.*(2018-).*/(\d\d)/.*\.(log)’
...
17/04/27 15:37:45 INFO S3DistCp.S3DistCp: Created 0 files to copy 0 files
... 
Exception in thread “main” java.lang.RuntimeException: Error running job
	at com.amazon.elasticmapreduce.S3DistCp.S3DistCp.run(S3DistCp.java:927)
	at com.amazon.elasticmapreduce.S3DistCp.S3DistCp.run(S3DistCp.java:705)
	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84)
	at com.amazon.elasticmapreduce.S3DistCp.Main.main(Main.java:22)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
…

In this case, the key information is contained in Created 0 files to copy 0 files. S3DistCp didn’t find any files to copy because the regular expression in the --groupBy option doesn’t match any files in the source location.

The reason for this issue varies. For example, it can be a mistake in the specified pattern. In the preceding example, we don’t have any files for the year 2018. Another common reason is incorrect escaping of the pattern when we submit S3DistCp command as a step, which is addressed later later in this post.

6. Upload files larger than 1 TB in size

The default upload chunk size when doing an S3 multipart upload is 128 MB. When files are larger than 1 TB, the total number of parts can reach over 10,000. Such a large number of parts can make the job run for a very long time or even fail.

In this case, you can improve job performance by increasing the size of each part. In S3DistCp, you can do this by using the --multipartUploadChunkSize option.

Let’s test how it works on several files about 200 GB in size. With the default part size, it takes about 84 minutes to copy them to S3 from HDFS.

We can increase the default part size to 1000 MB:

$ time s3-dist-cp --src /data/gb200 --dest s3://my-tables/data/S3DistCp/gb200_2 --multipartUploadChunkSize=1000
...
real    41m1.616s

The maximum part size is 5 GB. Keep in mind that larger parts have a higher chance to fail during upload and don’t necessarily speed up the process. Let’s run the same job with the maximum part size:

time s3-dist-cp --src /data/gb200 --dest s3://my-tables/data/S3DistCp/gb200_2 --multipartUploadChunkSize=5000
...
real    40m17.331s

7. Submit a S3DistCp step to an EMR cluster

You can run the S3DistCp tool in several ways. First, you can SSH to the master node and execute the command in a terminal window as we did in the preceding examples. This approach might be convenient for many use cases, but sometimes you might want to create a cluster that has some data on HDFS. You can do this by submitting a step directly in the AWS Management Console when creating a cluster.

In the console add step dialog box, we can fill the fields in the following way:

  • Step type: Custom JAR
  • Name*: S3DistCp Stepli>
  • JAR location: command-runner.jar
  • Arguments: s3-dist-cp --src s3://my-tables/incoming/hourly_table --dest /data/input/hourly_table --targetSize 10 --groupBy .*/hourly_table/.*(2017-).*/(\d\d)/.*\.(log)
  • Action of failure: Continue

Notice that we didn’t add quotation marks around our pattern. We needed quotation marks when we were using bash in the terminal window, but not here. The console takes care of escaping and transferring our arguments to the command on the cluster.

Another common use case is to run S3DistCp recurrently or on some event. We can always submit a new step to the existing cluster. The syntax here is slightly different than in previous examples. We separate arguments by commas. In the case of a complex pattern, we shield the whole step option with single quotation marks:

aws emr add-steps --cluster-id j-ABC123456789Z --steps 'Name=LoadData,Jar=command-runner.jar,ActionOnFailure=CONTINUE,Type=CUSTOM_JAR,Args=s3-dist-cp,--src,s3://my-tables/incoming/hourly_table,--dest,/data/input/hourly_table,--targetSize,10,--groupBy,.*/hourly_table/.*(2017-).*/(\d\d)/.*\.(log)'

Summary

This post showed you the basics of how S3DistCp works and highlighted some of its most useful features. It covered how you can use S3DistCp to optimize for raw files of different sizes and also selectively copy different files between locations. We also looked at several options for using the tool from SSH, the AWS Management Console, and the AWS CLI.

If you have questions or suggestions, leave a message in the comments.


Next Steps

Take your new knowledge to the next level! Click on the post below and learn the top 10 tips to improve query performance in Amazon Athena.

Top 10 Performance Tuning Tips for Amazon Athena


About the Author

Illya Yalovyy is a Senior Software Development Engineer with Amazon Web Services. He works on cutting-edge features of EMR and is heavily involved in open source projects such as Apache Hive, Apache Zookeeper, Apache Sqoop. His spare time is completely dedicated to his children and family.