Tag Archives: schemas

AWS Summit New York – Summary of Announcements

Post Syndicated from Jeff Barr original https://aws.amazon.com/blogs/aws/aws-summit-new-york-summary-of-announcements/

Whew – what a week! Tara, Randall, Ana, and I have been working around the clock to create blog posts for the announcements that we made at the AWS Summit in New York. Here’s a summary to help you to get started:

Amazon Macie – This new service helps you to discover, classify, and secure content at scale. Powered by machine learning and making use of Natural Language Processing (NLP), Macie looks for patterns and alerts you to suspicious behavior, and can help you with governance, compliance, and auditing. You can read Tara’s post to see how to put Macie to work; you select the buckets of interest, customize the classification settings, and review the results in the Macie Dashboard.

AWS GlueRandall’s post (with deluxe animated GIFs) introduces you to this new extract, transform, and load (ETL) service. Glue is serverless and fully managed, As you can see from the post, Glue crawls your data, infers schemas, and generates ETL scripts in Python. You define jobs that move data from place to place, with a wide selection of transforms, each expressed as code and stored in human-readable form. Glue uses Development Endpoints and notebooks to provide you with a testing environment for the scripts you build. We also announced that Amazon Athena now integrates with Amazon Glue, as does Apache Spark and Hive on Amazon EMR.

AWS Migration Hub – This new service will help you to migrate your application portfolio to AWS. My post outlines the major steps and shows you how the Migration Hub accelerates, tracks,and simplifies your migration effort. You can begin with a discovery step, or you can jump right in and migrate directly. Migration Hub integrates with tools from our migration partners and builds upon the Server Migration Service and the Database Migration Service.

CloudHSM Update – We made a major upgrade to AWS CloudHSM, making the benefits of hardware-based key management available to a wider audience. The service is offered on a pay-as-you-go basis, and is fully managed. It is open and standards compliant, with support for multiple APIs, programming languages, and cryptography extensions. CloudHSM is an integral part of AWS and can be accessed from the AWS Management Console, AWS Command Line Interface (CLI), and through API calls. Read my post to learn more and to see how to set up a CloudHSM cluster.

Managed Rules to Secure S3 Buckets – We added two new rules to AWS Config that will help you to secure your S3 buckets. The s3-bucket-public-write-prohibited rule identifies buckets that have public write access and the s3-bucket-public-read-prohibited rule identifies buckets that have global read access. As I noted in my post, you can run these rules in response to configuration changes or on a schedule. The rules make use of some leading-edge constraint solving techniques, as part of a larger effort to use automated formal reasoning about AWS.

CloudTrail for All Customers – Tara’s post revealed that AWS CloudTrail is now available and enabled by default for all AWS customers. As a bonus, Tara reviewed the principal benefits of CloudTrail and showed you how to review your event history and to deep-dive on a single event. She also showed you how to create a second trail, for use with CloudWatch CloudWatch Events.

Encryption of Data at Rest for EFS – When you create a new file system, you now have the option to select a key that will be used to encrypt the contents of the files on the file system. The encryption is done using an industry-standard AES-256 algorithm. My post shows you how to select a key and to verify that it is being used.

Watch the Keynote
My colleagues Adrian Cockcroft and Matt Wood talked about these services and others on the stage, and also invited some AWS customers to share their stories. Here’s the video:

Jeff;

 

Launch – AWS Glue Now Generally Available

Post Syndicated from Randall Hunt original https://aws.amazon.com/blogs/aws/launch-aws-glue-now-generally-available/

Today we’re excited to announce the general availability of AWS Glue. Glue is a fully managed, serverless, and cloud-optimized extract, transform and load (ETL) service. Glue is different from other ETL services and platforms in a few very important ways.

First, Glue is “serverless” – you don’t need to provision or manage any resources and you only pay for resources when Glue is actively running. Second, Glue provides crawlers that can automatically detect and infer schemas from many data sources, data types, and across various types of partitions. It stores these generated schemas in a centralized Data Catalog for editing, versioning, querying, and analysis. Third, Glue can automatically generate ETL scripts (in Python!) to translate your data from your source formats to your target formats. Finally, Glue allows you to create development endpoints that allow your developers to use their favorite toolchains to construct their ETL scripts. Ok, let’s dive deep with an example.

In my job as a Developer Evangelist I spend a lot of time traveling and I thought it would be cool to play with some flight data. The Bureau of Transportations Statistics is kind enough to share all of this data for anyone to use here. We can easily download this data and put it in an Amazon Simple Storage Service (S3) bucket. This data will be the basis of our work today.

Crawlers

First, we need to create a Crawler for our flights data from S3. We’ll select Crawlers in the Glue console and follow the on screen prompts from there. I’ll specify s3://crawler-public-us-east-1/flight/2016/csv/ as my first datasource (we can add more later if needed). Next, we’ll create a database called flights and give our tables a prefix of flights as well.

The Crawler will go over our dataset, detect partitions through various folders – in this case months of the year, detect the schema, and build a table. We could add additonal data sources and jobs into our crawler or create separate crawlers that push data into the same database but for now let’s look at the autogenerated schema.

I’m going to make a quick schema change to year, moving it from BIGINT to INT. Then I can compare the two versions of the schema if needed.

Now that we know how to correctly parse this data let’s go ahead and do some transforms.

ETL Jobs

Now we’ll navigate to the Jobs subconsole and click Add Job. Will follow the prompts from there giving our job a name, selecting a datasource, and an S3 location for temporary files. Next we add our target by specifying “Create tables in your data target” and we’ll specify an S3 location in Parquet format as our target.

After clicking next, we’re at screen showing our various mappings proposed by Glue. Now we can make manual column adjustments as needed – in this case we’re just going to use the X button to remove a few columns that we don’t need.

This brings us to my favorite part. This is what I absolutely love about Glue.

Glue generated a PySpark script to transform our data based on the information we’ve given it so far. On the left hand side we can see a diagram documenting the flow of the ETL job. On the top right we see a series of buttons that we can use to add annotated data sources and targets, transforms, spigots, and other features. This is the interface I get if I click on transform.

If we add any of these transforms or additional data sources, Glue will update the diagram on the left giving us a useful visualization of the flow of our data. We can also just write our own code into the console and have it run. We can add triggers to this job that fire on completion of another job, a schedule, or on demand. That way if we add more flight data we can reload this same data back into S3 in the format we need.

I could spend all day writing about the power and versatility of the jobs console but Glue still has more features I want to cover. So, while I might love the script editing console, I know many people prefer their own development environments, tools, and IDEs. Let’s figure out how we can use those with Glue.

Development Endpoints and Notebooks

A Development Endpoint is an environment used to develop and test our Glue scripts. If we navigate to “Dev endpoints” in the Glue console we can click “Add endpoint” in the top right to get started. Next we’ll select a VPC, a security group that references itself and then we wait for it to provision.


Once it’s provisioned we can create an Apache Zeppelin notebook server by going to actions and clicking create notebook server. We give our instance an IAM role and make sure it has permissions to talk to our data sources. Then we can either SSH into the server or connect to the notebook to interactively develop our script.

Pricing and Documentation

You can see detailed pricing information here. Glue crawlers, ETL jobs, and development endpoints are all billed in Data Processing Unit Hours (DPU) (billed by minute). Each DPU-Hour costs $0.44 in us-east-1. A single DPU provides 4vCPU and 16GB of memory.

We’ve only covered about half of the features that Glue has so I want to encourage everyone who made it this far into the post to go read the documentation and service FAQs. Glue also has a rich and powerful API that allows you to do anything console can do and more.

We’re also releasing two new projects today. The aws-glue-libs provide a set of utilities for connecting, and talking with Glue. The aws-glue-samples repo contains a set of example jobs.

I hope you find that using Glue reduces the time it takes to start doing things with your data. Look for another post from me on AWS Glue soon because I can’t stop playing with this new service.
Randall

Register for and Attend This Free April 27 Tech Talk—Deep Dive on Amazon Cloud Directory

Post Syndicated from Craig Liebendorfer original https://aws.amazon.com/blogs/security/register-for-and-attend-this-april-27-tech-talk-deep-dive-on-amazon-cloud-directory/

AWS webinars logo

As part of the AWS Monthly Online Tech Talks series, AWS will present Deep Dive on Amazon Cloud Directory on Thursday, April 27. This tech talk will start at noon and end at 1:00 P.M. Pacific Time.

AWS Cloud Directory Expert Quint Van Deman will show you how Amazon Cloud Directory enables you to build flexible cloud-native directories for organizing hierarchies of data along multiple dimensions. Using Cloud Directory, you can easily build organizational charts, device registries, course catalogs, and network configurations with multiple hierarchies. For example, you can build an organizational chart with one hierarchy based on reporting structure, a second hierarchy based on physical location, and a third based on cost center.

You also will learn:

  • About the benefits and features of Cloud Directory.
  • The advantages of using Cloud Directory over traditional directory solutions.
  • How to efficiently organize hierarchies of data across multiple dimensions.
  • How to create and extend Cloud Directory schemas.
  • How to search your directory using strongly consistent and eventually consistent search APIs.

This tech talk is free. Register today.

– Craig

How to Create an Organizational Chart with Separate Hierarchies by Using Amazon Cloud Directory

Post Syndicated from Srikanth Mandadi original https://aws.amazon.com/blogs/security/how-to-create-an-organizational-chart-with-separate-hierarchies-by-using-amazon-cloud-directory/

Amazon Cloud Directory enables you to create directories for a variety of use cases, such as organizational charts, course catalogs, and device registries. Cloud Directory offers you the flexibility to create directories with hierarchies that span multiple dimensions. For example, you can create an organizational chart that you can navigate through separate hierarchies for reporting structure, location, and cost center.

In this blog post, I show how to use Cloud Directory APIs to create an organizational chart with two separate hierarchies in a single directory. I also show how to navigate the hierarchies and retrieve data. I use the Java SDK for all the sample code in this post, but you can use other language SDKs or the AWS CLI.

Define a schema

The first step in using Cloud Directory is to define a schema, which describes the data that will be stored in the directory that you will create later in this post. In this example, I define the schema by providing a JSON document. The schema has two facets: Employee and Group. I constrain the attributes within these facets by using various rules provided by Cloud Directory. For example, I specify that the Name attribute is of type STRING and must have a minimum length of 3 characters and maximum length of 100 characters. Similarly, I specify that the Status attribute is of type STRING and the value of this attribute must have one of the following three values: ACTIVE, INACTIVE, or TERMINATED. Having Cloud Directory handle these constraints means that I do not need to handle the validation of these constraints in my code, and it also lets multiple applications share the data in my directory without violating these constraints.

I also specify that the objectType of Employee is a LEAF_NODE. Therefore, employee objects cannot have any children, but can have multiple parents. The objectType of Group is NODE, which means group objects can have children, but they can only have one parent object. In the next section, I show you how to create a directory with this schema by using some sample Java code. Save the following JSON document to a file and provide the path to the file in the code for creating the schema in the next section.

{
  "facets" : {
    "Employee" : {
      "facetAttributes" : {
        "Name" : {
          "attributeDefinition" : {
            "attributeType" : "STRING",
            "isImmutable" : false,
            "attributeRules" : {
              "NameLengthRule" : {
                "parameters" : {
                  "min" : "3",
                  "max" : "100"
                },
                "ruleType": "STRING_LENGTH"
              }
            }
          },
          "requiredBehavior" : "REQUIRED_ALWAYS"
        },
        "EmailAddress" : {
          "attributeDefinition" : {
            "attributeType" : "STRING",
            "isImmutable" : true,
            "attributeRules" : {
              "NameLengthRule" : {
                "parameters" : {
                  "min" : "3",
                  "max" : "100"
                },
                "ruleType": "STRING_LENGTH"
              }
            }
          },
          "requiredBehavior" : "REQUIRED_ALWAYS"
        },
        "Status" : {
          "attributeDefinition" : {
            "attributeType" : "STRING",
            "isImmutable" : true,
            "attributeRules" : {
              "rule1" : {
                "parameters" : {
                  "allowedValues" : "ACTIVE, INACTIVE, TERMINATED"
                },
                "ruleType": "STRING_FROM_SET"
              }
            }
          },
          "requiredBehavior" : "REQUIRED_ALWAYS"
        }
      },
      "objectType" : "LEAF_NODE"
    },
    "Group" : {
      "facetAttributes" : {
        "Name" : {
          "attributeDefinition" : {
            "attributeType" : "STRING",
            "isImmutable" : true
          },
          "requiredBehavior" : "REQUIRED_ALWAYS"
        }
      },
      "objectType" : "NODE"
    }
  }
}

Create and publish the schema

Similar to other AWS services, I have to create the client for Cloud Directory to call the service APIs. To create a client, I use the following Java code.

    AWSCredentialsProvider credentials = null;
    try {
        credentials = new ProfileCredentialsProvider("default");
    } catch (Exception e) {
        throw new AmazonClientException(
            "Cannot load the credentials from the credential profiles file. " +
            "Please make sure that your credentials file is at the correct " +
            "location, and is in valid format.",
            e);
    }
    AmazonCloudDirectory client = AmazonCloudDirectoryClientBuilder.standard()
            .withRegion(Regions.US_EAST_1)
            .withCredentials(credentials)
            .build(); 

Now, I am ready to create the schema that I defined in the JSON file earlier in the post. When I create the schema, it is in the Development state. A schema in Cloud Directory can be in the Development, Published, or Applied state. When the schema is in the Development state, I can make more changes to the schema. In this case, however, I don’t want to make additional changes. Therefore, I will just publish the schema, which makes it available for creating directories (you cannot modify a schema in the Published state). I discuss the Applied state for schemas in the next section. In the following code, change the jsonFilePath variable to the file location where you saved the JSON schema in the previous step.

    //Read the JSON schema content from the file. 
    String jsonFilePath = <Provide the location of the json schema file here>;
    String schemaDocument;
    try
    {
        schemaDocument = new String(Files.readAllBytes(Paths.get(jsonFilePath)));
    }
    catch(IOException e)
    {
        throw new RuntimeException(e);
    }
    
    //Create an empty schema with a schema name. The schema name needs to be unique
    //within an AWS account.    
    CreateSchemaRequest createSchemaRequest = new CreateSchemaRequest()
        .withName("EmployeeSchema");
    String developmentSchemaArn =  client.createSchema(createSchemaRequest).getSchemaArn();    
    
    //Load the previously defined JSON into the empty schema that was just created
    PutSchemaFromJsonRequest putSchemaRequest = new PutSchemaFromJsonRequest()
           .withDocument(schemaDocument)
           .withSchemaArn(developmentSchemaArn);
    PutSchemaFromJsonResult putSchemaResult =  client.putSchemaFromJson(putSchemaRequest);

    //No more changes needed for schema so publish the schema
    PublishSchemaRequest publishSchemaRequest = new PublishSchemaRequest()
        .withDevelopmentSchemaArn(developmentSchemaArn)
        .withVersion("1.0");
    String publishedSchemaArn =  client.publishSchema(publishSchemaRequest).getPublishedSchemaArn();

Create a directory by using the published schema

I am now ready to create a directory by using the schema I just published. When I create a directory, Cloud Directory copies the published schema to the newly created directory. The schema copied to this directory is in the Applied state, which means if I had a scenario in which a schema attached to a particular directory needed to be changed, I could make changes to the schema that is applied to that specific directory.

The following code creates the directory and receives the Applied schema ARN and directory ARN. This Applied schema ARN is useful if I need to make changes to the schema applied to this directory. The directory ARN will be used in all subsequent operations associated with the directory. Cloud Directory will use the directory ARN to identify the directory associated with incoming requests because a single customer can create multiple directories.

    //Create a directory using the published schema. Specify a directory name, which must be unique within an account.
    CreateDirectoryRequest createDirectoryRequest = new CreateDirectoryRequest()
        .withName("EmployeeDirectory")
        .withSchemaArn(publishedSchemaArn);
    CreateDirectoryResult createDirectoryResult =  client.createDirectory(createDirectoryRequest);
    String directoryArn = createDirectoryResult.getDirectoryArn();
    String appliedSchemaArn = createDirectoryResult.getAppliedSchemaArn();

How hierarchies are stored in a directory

The organizational chart I want to create has a simple hierarchy as shown in the following diagram. Anna belongs to both the ITStaff and Managers groups. This example demonstrates a capability of Cloud Directory that enables me to build multiple hierarchies in a single directory. These hierarchies can have their own structure and leaf nodes belonging to more than one hierarchy because lead nodes can have more than one parent.

Being able to create multiple hierarchies within a single directory gives me some flexibility in how I organize my employees. For example, I can create a hierarchy representing departments in my organization and add employees to their respective departments, as illustrated in the following diagram. I can create another hierarchy representing geographic locations and add employees to the geographic location where they work. The first step in creating this hierarchy is to create the ITStaff and Managers group objects, which is what I do in the next section.

Hierarchy diagram

Create group objects

I will now create the data representing my organizational chart in the directory that I created. The following code creates the ITStaff and Managers group objects, which are created under the root node of the directory.

    for (String groupName : Arrays.asList("ITStaff", "Managers")) {         
        CreateObjectRequest request = new CreateObjectRequest()
            .withDirectoryArn(directoryArn)
           // The parent of the object we are creating. We are rooting the group nodes   
           // under root object. The root object exists in all directories and the path         
           // to the root node is always "/".
           .withParentReference(new ObjectReference().withSelector("/"))
           // The name attached to the link between the parent and the child objects.
           .withLinkName(groupName)
           .withSchemaFacets(new SchemaFacet()
               .withSchemaArn(appliedSchemaArn)
               .withFacetName("Group"))
               //We specify the attributes to attach to this object.
                .withObjectAttributeList(new AttributeKeyAndValue()
                    .withKey(new AttributeKey()
                             // Name attribute for the group
                             .withSchemaArn(appliedSchemaArn)
                             .withFacetName("Group")
                             .withName("Name"))
                        // We provide the attribute value. The type used here must match the type defined in schema
                             .withValue(new TypedAttributeValue().withStringValue(groupName)));
    client.createObject(request);
    }

Create employee objects

The group objects are now in the directory. Next, I create employee objects for Anna and Bob under the ITStaff group. The following Java code creates the Anna object. Creating the Bob object is similar. When creating the Bob object, I provide different attribute values for Name, EmailAddress, and the like.

    CreateObjectRequest createAnna = new CreateObjectRequest()
                .withDirectoryArn(directoryArn)
                .withLinkName("Anna")
                .withParentReference(new ObjectReference().withSelector("/ITStaff"))
                .withSchemaFacets(new SchemaFacet()
                        .withSchemaArn(appliedSchemaArn)
                        .withFacetName("Employee"))
                .withObjectAttributeList(new AttributeKeyAndValue()
                        .withKey(new AttributeKey()
                                // Name attribute from employee facet
                                .withSchemaArn(appliedSchemaArn)
                                .withFacetName("Employee")
                                .withName("Name"))
                        .withValue(new TypedAttributeValue().withStringValue("Anna")),
                        new AttributeKeyAndValue()
                        .withKey(new AttributeKey()
                                // EmailAddress attribute from employee facet
                                .withSchemaArn(appliedSchemaArn)
                                .withFacetName("Employee")
                                .withName("EmailAddress"))
                        .withValue(new TypedAttributeValue().withStringValue("[email protected]")),
                        new AttributeKeyAndValue()
                        .withKey(new AttributeKey()
                                 // Status attribute from employee facet
                                .withSchemaArn(appliedSchemaArn)
                                .withFacetName("Employee")
                                .withName("Status"))
                        .withValue(new TypedAttributeValue().withStringValue("ACTIVE")));
     // CreateObject provides the object identifier of the object that was created.  An object identifier
       // is a globally unique, immutable identifier assigned to every object.
     String annasObjectId = client.createObject(createAnna).getObjectIdentifier();

Both the Bob and Anna objects are created under ITStaff, but Anna is also a manager and needs to be added under the Managers group. The following code does just that.

   AttachObjectRequest makeAnnaAManager = new AttachObjectRequest()
           .withDirectoryArn(directoryArn)
           .withLinkName("Anna")
           // Provide the parent object that Anna needs to be attached to using the path to the Managers object
           .withParentReference(new ObjectReference().withSelector("/Managers"))
           // Here we use the object identifier syntax to specify Anna's node. We could have used the
           // following path instead: /ITStaff/Anna. Both are equivalent.
           .withChildReference(new ObjectReference().withSelector("$" + annasObjectId));
   client.attachObject(makeAnnaAManager);

Retrieving objects in the directory

Now that I have populated my directory, I want to find a specific object. I can do that either by using the path to the object or the object identifier. I use the getObjectInformation API to first get the Anna object by specifying its path, and then I print the object identifiers of all the parents of the Anna object. I should print two parent object identifiers because Anna has both ITStaff and Managers as its parent. Here I am listing parents; however, I also can perform other operations on the object such as listing its children or its attributes. Using listChildren and listObjectAttributes, I can retrieve all the information stored in my directory.

    // First get the object for Anna
    GetObjectInformationRequest annaObjectRequest = new GetObjectInformationRequest()
           .withObjectReference(new ObjectReference().withSelector("/Managers/Anna"))
           .withDirectoryArn(directoryArn);
    GetObjectInformationResult annaObjectResult =  client.getObjectInformation(annaObjectRequest);
    // List parent objects for Anna to give her groups
    ListObjectParentsRequest annaGroupsRequest = new ListObjectParentsRequest()
           .withDirectoryArn(directoryArn)
           .withObjectReference(new ObjectReference().withSelector("$" + annaObjectResult.getObjectIdentifier()));
    ListObjectParentsResult annaGroupsResult =  client.listObjectParents(annaGroupsRequest);
    for(Map.Entry<String, String> entry : annaGroupsResult.getParents().entrySet())
    {
       System.out.println("Parent Object Identifier:" + entry.getKey());
       System.out.println("Link Name:" + entry.getValue());
    } 

Summary

In this post, I showed how to use Cloud Directory APIs to create an organizational chart with multiple hierarchies. Keep in mind that Cloud Directory offers additional functionality such as batch operations and indexing that I have not covered in this blog post. For more information, see the Amazon Cloud Directory API Reference.

If you have questions or suggestions about this blog post, start a new thread on the Directory Service forum.

– Srikanth