All posts by Songzhi Liu

Build a Multi-Tenant Amazon EMR Cluster with Kerberos, Microsoft Active Directory Integration and EMRFS Authorization

Post Syndicated from Songzhi Liu original https://aws.amazon.com/blogs/big-data/build-a-multi-tenant-amazon-emr-cluster-with-kerberos-microsoft-active-directory-integration-and-emrfs-authorization/

One of the challenges faced by our customers—especially those in highly regulated industries—is balancing the need for security with flexibility. In this post, we cover how to enable multi-tenancy and increase security by using EMRFS (EMR File System) authorization, the Amazon S3 storage-level authorization on Amazon EMR.

Amazon EMR is an easy, fast, and scalable analytics platform enabling large-scale data processing. EMRFS authorization provides Amazon S3 storage-level authorization by configuring EMRFS with multiple IAM roles. With this functionality enabled, different users and groups can share the same cluster and assume their own IAM roles respectively.

Simply put, on Amazon EMR, we can now have an Amazon EC2 role per user assumed at run time instead of one general EC2 role at the cluster level. When the user is trying to access Amazon S3 resources, Amazon EMR evaluates against a predefined mappings list in EMRFS configurations and picks up the right role for the user.

In this post, we will discuss what EMRFS authorization is (Amazon S3 storage-level access control) and show how to configure the role mappings with detailed examples. You will then have the desired permissions in a multi-tenant environment. We also demo Amazon S3 access from HDFS command line, Apache Hive on Hue, and Apache Spark.

EMRFS authorization for Amazon S3

There are two prerequisites for using this feature:

  1. Users must be authenticated, because EMRFS needs to map the current user/group/prefix to a predefined user/group/prefix. There are several authentication options. In this post, we launch a Kerberos-enabled cluster that manages the Key Distribution Center (KDC) on the master node, and enable a one-way trust from the KDC to a Microsoft Active Directory domain.
  2. The application must support accessing Amazon S3 via Applications that have their own S3FileSystem APIs (for example, Presto) are not supported at this time.

EMRFS supports three types of mapping entries: user, group, and Amazon S3 prefix. Let’s use an example to show how this works.

Assume that you have the following three identities in your organization, and they are defined in the Active Directory:

To enable all these groups and users to share the EMR cluster, you need to define the following IAM roles:

In this case, you create a separate Amazon EC2 role that doesn’t give any permission to Amazon S3. Let’s call the role the base role (the EC2 role attached to the EMR cluster), which in this example is named EMR_EC2_RestrictedRole. Then, you define all the Amazon S3 permissions for each specific user or group in their own roles. The restricted role serves as the fallback role when the user doesn’t belong to any user/group, nor does the user try to access any listed Amazon S3 prefixes defined on the list.

Important: For all other roles, like emrfs_auth_group_role_data_eng, you need to add the base role (EMR_EC2_RestrictedRole) as the trusted entity so that it can assume other roles. See the following example:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "Service": "ec2.amazonaws.com"
      },
      "Action": "sts:AssumeRole"
    },
    {
      "Effect": "Allow",
      "Principal": {
        "AWS": "arn:aws:iam::511586466501:role/EMR_EC2_RestrictedRole"
      },
      "Action": "sts:AssumeRole"
    }
  ]
}

The following is an example policy for the admin user role (emrfs_auth_user_role_admin_user):

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": "s3:*",
            "Resource": "*"
        }
    ]
}

We are assuming the admin user has access to all buckets in this example.

The following is an example policy for the data science group role (emrfs_auth_group_role_data_sci):

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Resource": [
                "arn:aws:s3:::emrfs-auth-data-science-bucket-demo/*",
                "arn:aws:s3:::emrfs-auth-data-science-bucket-demo"
            ],
            "Action": [
                "s3:*"
            ]
        }
    ]
}

This role grants all Amazon S3 permissions to the emrfs-auth-data-science-bucket-demo bucket and all the objects in it. Similarly, the policy for the role emrfs_auth_group_role_data_eng is shown below:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Resource": [
                "arn:aws:s3:::emrfs-auth-data-engineering-bucket-demo/*",
                "arn:aws:s3:::emrfs-auth-data-engineering-bucket-demo"
            ],
            "Action": [
                "s3:*"
            ]
        }
    ]
}

Example role mappings configuration

To configure EMRFS authorization, you use EMR security configuration. Here is the configuration we use in this post

Consider the following scenario.

First, the admin user admin1 tries to log in and run a command to access Amazon S3 data through EMRFS. The first role emrfs_auth_user_role_admin_user on the mapping list, which is a user role, is mapped and picked up. Then admin1 has access to the Amazon S3 locations that are defined in this role.

Then a user from the data engineer group (grp_data_engineering) tries to access a data bucket to run some jobs. When EMRFS sees that the user is a member of the grp_data_engineering group, the group role emrfs_auth_group_role_data_eng is assumed, and the user has proper access to Amazon S3 that is defined in the emrfs_auth_group_role_data_eng role.

Next, the third user comes, who is not an admin and doesn’t belong to any of the groups. After failing evaluation of the top three entries, EMRFS evaluates whether the user is trying to access a certain Amazon S3 prefix defined in the last mapping entry. This type of mapping entry is called the prefix type. If the user is trying to access s3://emrfs-auth-default-bucket-demo/, then the prefix mapping is in effect, and the prefix role emrfs_auth_prefix_role_default_s3_prefix is assumed.

If the user is not trying to access any of the Amazon S3 paths that are defined on the list—which means it failed the evaluation of all the entries—it only has the permissions defined in the EMR_EC2RestrictedRole. This role is assumed by the EC2 instances in the cluster.

In this process, all the mappings defined are evaluated in the defined order, and the first role that is mapped is assumed, and the rest of the list is skipped.

Setting up an EMR cluster and mapping Active Directory users and groups

Now that we know how EMRFS authorization role mapping works, the next thing we need to think about is how we can use this feature in an easy and manageable way.

Active Directory setup

Many customers manage their users and groups using Microsoft Active Directory or other tools like OpenLDAP. In this post, we create the Active Directory on an Amazon EC2 instance running Windows Server and create the users and groups we will be using in the example below. After setting up Active Directory, we use the Amazon EMR Kerberos auto-join capability to establish a one-way trust from the KDC running on the EMR master node to the Active Directory domain on the EC2 instance. You can use your own directory services as long as it talks to the LDAP (Lightweight Directory Access Protocol).

To create and join Active Directory to Amazon EMR, follow the steps in the blog post Use Kerberos Authentication to Integrate Amazon EMR with Microsoft Active Directory.

After configuring Active Directory, you can create all the users and groups using the Active Directory tools and add users to appropriate groups. In this example, we created users like admin1, dataeng1, datascientist1, grp_data_engineering, and grp_data_science, and then add the users to the right groups.

Join the EMR cluster to an Active Directory domain

For clusters with Kerberos, Amazon EMR now supports automated Active Directory domain joins. You can use the security configuration to configure the one-way trust from the KDC to the Active Directory domain. You also configure the EMRFS role mappings in the same security configuration.

The following is an example of the EMR security configuration with a trusted Active Directory domain EMRKRB.TEST.COM and the EMRFS role mappings as we discussed earlier:

The EMRFS role mapping configuration is shown in this example:

We will also provide an example AWS CLI command that you can run.

Launching the EMR cluster and running the tests

Now you have configured Kerberos and EMRFS authorization for Amazon S3.

Additionally, you need to configure Hue with Active Directory using the Amazon EMR configuration API in order to log in using the AD users created before. The following is an example of Hue AD configuration.

[
  {
    "Classification":"hue-ini",
    "Properties":{

    },
    "Configurations":[
      {
        "Classification":"desktop",
        "Properties":{

        },
        "Configurations":[
          {
            "Classification":"ldap",
            "Properties":{

            },
            "Configurations":[
              {
                "Classification":"ldap_servers",
                "Properties":{

                },
                "Configurations":[
                  {
                    "Classification":"AWS",
                    "Properties":{
                      "base_dn":"DC=emrkrb,DC=test,DC=com",
                      "ldap_url":"ldap://emrkrb.test.com",
                      "search_bind_authentication":"false",
                      "bind_dn":"CN=adjoiner,CN=users,DC=emrkrb,DC=test,DC=com",
                      "bind_password":"Abc123456",
                      "create_users_on_login":"true",
                      "nt_domain":"emrkrb.test.com"
                    },
                    "Configurations":[

                    ]
                  }
                ]
              }
            ]
          },
          {
            "Classification":"auth",
            "Properties":{
              "backend":"desktop.auth.backend.LdapBackend"
            },
            "Configurations":[

            ]
          }
        ]
      }
    ]
  }

Note: In the preceding configuration JSON file, change the values as required before pasting it into the software setting section in the Amazon EMR console.

Now let’s use this configuration and the security configuration you created before to launch the cluster.

In the Amazon EMR console, choose Create cluster. Then choose Go to advanced options. On the Step1: Software and Steps page, under Edit software settings (optional), paste the configuration in the box.

The rest of the setup is the same as an ordinary cluster setup, except in the Security Options section. In Step 4: Security, under Permissions, choose Custom, and then choose the RestrictedRole that you created before.

Choose the appropriate subnets (these should meet the base requirement in order for a successful Active Directory join—see the Amazon EMR Management Guide for more details), and choose the appropriate security groups to make sure it talks to the Active Directory. Choose a key so that you can log in and configure the cluster.

Most importantly, choose the security configuration that you created earlier to enable Kerberos and EMRFS authorization for Amazon S3.

You can use the following AWS CLI command to create a cluster.

aws emr create-cluster --name "TestEMRFSAuthorization" \ 
--release-label emr-5.10.0 \ --instance-type m3.xlarge \ 
--instance-count 3 \ 
--ec2-attributes InstanceProfile=EMR_EC2_DefaultRole,KeyName=MyEC2KeyPair \ --service-role EMR_DefaultRole \ 
--security-configuration MyKerberosConfig \ 
--configurations file://hue-config.json \
--applications Name=Hadoop Name=Hive Name=Hue Name=Spark \ 
--kerberos-attributes Realm=EC2.INTERNAL, \ KdcAdminPassword=<YourClusterKDCAdminPassword>, \ ADDomainJoinUser=<YourADUserLogonName>,ADDomainJoinPassword=<YourADUserPassword>, \ 
CrossRealmTrustPrincipalPassword=<MatchADTrustPwd>

Note: If you create the cluster using CLI, you need to save the JSON configuration for Hue into a file named hue-config.json and place it on the server where you run the CLI command.

After the cluster gets into the Waiting state, try to connect by using SSH into the cluster using the Active Directory user name and password.

ssh -l [email protected] <EMR IP or DNS name>

Quickly run two commands to show that the Active Directory join is successful:

  1. id [user name] shows the mapped AD users and groups in Linux.
  2. hdfs groups [user name] shows the mapped group in Hadoop.

Both should return the current Active Directory user and group information if the setup is correct.

Now, you can test the user mapping first. Log in with the admin1 user, and run a Hadoop list directory command:

hadoop fs -ls s3://emrfs-auth-data-science-bucket-demo/

Now switch to a user from the data engineer group.

Retry the previous command to access the admin’s bucket. It should throw an Amazon S3 Access Denied exception.

When you try listing the Amazon S3 bucket that a data engineer group member has accessed, it triggers the group mapping.

hadoop fs -ls s3://emrfs-auth-data-engineering-bucket-demo/

It successfully returns the listing results. Next we will test Apache Hive and then Apache Spark.

 

To run jobs successfully, you need to create a home directory for every user in HDFS for staging data under /user/<username>. Users can configure a step to create a home directory at cluster launch time for every user who has access to the cluster. In this example, you use Hue since Hue will create the home directory in HDFS for the user at the first login. Here Hue also needs to be integrated with the same Active Directory as explained in the example configuration described earlier.

First, log in to Hue as a data engineer user, and open a Hive Notebook in Hue. Then run a query to create a new table pointing to the data engineer bucket, s3://emrfs-auth-data-engineering-bucket-demo/table1_data_eng/.

You can see that the table was created successfully. Now try to create another table pointing to the data science group’s bucket, where the data engineer group doesn’t have access.

It failed and threw an Amazon S3 Access Denied error.

Now insert one line of data into the successfully create table.

Next, log out, switch to a data science group user, and create another table, test2_datasci_tb.

The creation is successful.

The last task is to test Spark (it requires the user directory, but Hue created one in the previous step).

Now let’s come back to the command line and run some Spark commands.

Login to the master node using the datascientist1 user:

Start the SparkSQL interactive shell by typing spark-sql, and run the show tables command. It should list the tables that you created using Hive.

As a data science group user, try select on both tables. You will find that you can only select the table defined in the location that your group has access to.

Conclusion

EMRFS authorization for Amazon S3 enables you to have multiple roles on the same cluster, providing flexibility to configure a shared cluster for different teams to achieve better efficiency. The Active Directory integration and group mapping make it much easier for you to manage your users and groups, and provides better auditability in a multi-tenant environment.


Additional Reading

If you found this post useful, be sure to check out Use Kerberos Authentication to Integrate Amazon EMR with Microsoft Active Directory and Launching and Running an Amazon EMR Cluster inside a VPC.


About the Authors

Songzhi Liu is a Big Data Consultant with AWS Professional Services. He works closely with AWS customers to provide them Big Data & Machine Learning solutions and best practices on the Amazon cloud.

 

 

 

 

Analyze Data with Presto and Airpal on Amazon EMR

Post Syndicated from Songzhi Liu original https://blogs.aws.amazon.com/bigdata/post/Tx1BF2DN6KRFI27/Analyze-Data-with-Presto-and-Airpal-on-Amazon-EMR

Songzhi Liu is a Professional Services Consultant with AWS

You can now launch Presto version 0.119 on Amazon EMR, allowing you to easily spin up a managed EMR cluster with the Presto query engine and run interactive analysis on data stored in Amazon S3. You can integrate with Spot instances, publish logs to an S3 bucket, and use EMR’s configure API to configure Presto. In this post, I’ll show you how to set up a Presto cluster and use Airpal to process data stored in S3.

What is Presto?

Presto is a distributed SQL query engine optimized for ad hoc analysis. It supports the ANSI SQL standard, including complex queries, aggregations, joins, and window functions. Presto can run on multiple data sources, including Amazon S3. Presto’s execution framework is fundamentally different from that of Hive/MapReduce: Presto has a custom query and execution engine where the stages of execution are pipelined, similar to a directed acyclic graph (DAG), and all processing occurs in memory to reduce disk I/O. This pipelined execution model can run multiple stages in parallel and streams data from one stage to another as the data becomes available. This reduces end-to-end latency and makes Presto a great tool for ad hoc data exploration over large datasets.

What is Airpal?

Airpal is a web-based query execution tool open-sourced by Airbnb that leverages Presto to facilitate data analysis. Airpal has many helplful features. For example, you can highlight syntax, export results to  CSV for download, view query history, save queries, use a Table Finder to search for appropriate tables, and use Table Explorer to visualize the schema of a table. We have created an AWS CloudFormation script that makes it easy to set up Airpal on an Amazon  EC2 instance on AWS.

For this blog post we will use the Wikimedia’s page count data, which is publicly available at ‘s3://support.elasticmapreduce/training/dataset/wikistats/’. This data is in textfile format. We will also convert the table to Parquet and ORC.

Spin up an EMR cluster with Hive and Presto installed

First, log in to the AWS console and navigate to the EMR console. Choose EMR-4.1.0 and Presto-Sandbox. Make sure you provide SSH keys so that you can log into the cluster.

Note: Write down the DNS name after creation is complete. You’ll need this for the next step.

 

Use AWS CloudFormation to deploy the Airpal server

Make sure you have a valid Key Pair for the region in which you want to deploy Airpal.

Navigate to AWS CloudFormation, click Create New Stack, name your stack, and choose Specify an Amazon S3 template URL.

Use the template in https://s3-external-1.amazonaws.com/emr.presto.airpal/scripts/deploy_airpal_env.json

Click Next and configure the parameters.

Important parameters you should configure:

PrestoCoordinatorURL Use the DNS name you noted earlier. Follow the format: http://<DNS Name of the cluster>:<Port of Presto>. The default port for Presto installation is 8889.

Example: http://ec2-xx-xx-xx-xx.compute-1.amazonaws.com:8889

AirpalPort  Choose the port on which the Airpal server should run. The default is 8193. Adjust this according to your firewall setting and make sure it’s not blocked.

S3BootstrapBuckets  This is the S3 bucket name that holds the bootstrap scripts. There is no need to change the default value of emr.presto.airpal.

InstallAirpal  This is the path to the installation script of Airpal Server. There is no need to change the default value of scripts/install_airpal.sh.

StartAirpal  This is the path to the starting script of Airpal Server. There is no need to change the default value of scripts/start_airpal.sh.

MyKeyPairName  Select a valid Key Pair you have in this region. You’ll use this to log in to the master node.

Click Next and add a tag to the stack if needed. Select the check box for IAM policy and click Create.

Wait 5 -10 minutes after the stack status changes to create_complete. (The server configuration takes longer than the stack creation.)

Navigate to the EC2 console and select the Airpal Server instance and note its public IP address.

Open a browser. Use: <PublicIP>:<Airpal Port> to go to Airpal. Make sure that the Port 8889 is allowed on the Master Security Group for your EMR cluster.

Log in to the master node and run Hive scripts

Presto ships with several connectors. To query data from Amazon S3, you need to use the Hive connector. Presto only uses Hive to create the meta-data. Presto’s execution engine is different from that of Hive. By default, when you install Presto on your cluster, EMR installs Hive as well. The metadata is stored in a database such a MySQL and is accessed by the Hive metastore service. The Hive metastore service is also installed.

The dataset contains hits data for the Wikipedia pages of around 7 GB. The schema is as follows:

Language of the page

Title of the page

Number of hits

Retrieved page size

Define the schema

To define the schema:

Log in to the master node using the following command in the terminal:

ssh -i YourKeyPair.pem [email protected]

Replace YourKeyPair.pem with the place and name of your pem file. Replace ec2-xx-xx-xx-xx.compute-1.amazonaws.com with the public DNS name of your EMR cluster.

Type “hive” in the command line to enter Hive interactive mode and run the following commands:

CREATE EXTERNAL TABLE wikistats (
language STRING,
page_title STRING,
hits BIGINT,
retrived_size BIGINT
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ‘ ‘
LINES TERMINATED BY ‘n’
LOCATION ‘s3://support.elasticmapreduce/training/datasets/wikistats/’;

Now you have created a “wikistats” table in csv format. You can also store this table using the Parquet format using the following command:

CREATE EXTERNAL TABLE wikistats_parq (
language STRING,
page_title STRING,
hits BIGINT,
retrived_size BIGINT
)
STORED AS PARQUET
LOCATION ‘s3://emr.presto.airpal/wikistats/parquet’;

You can store it in the compressed ORC format using the following command:

CREATE EXTERNAL TABLE wikistats_orc (
language STRING,
page_title STRING,
hits BIGINT,
retrived_size BIGINT
)
STORED AS ORC
LOCATION ‘s3://emr.presto.airpal/wikistats/orc’;

Now we have three tables holding the same data of three different data formats.

Try Presto in Airpal

Open a browser and go to ‘http://<ip address of the ec2 instance>:8193’.

You will use Presto queries to answer the questions below. Paste the following queries into the Airpal query editor.

What is the most frequently viewed page with page_title that contains “Amazon”?

SELECT language,
page_title,
SUM(hits) AS hits
FROM default.wikistats
WHERE language = ‘en’
AND page_title LIKE ‘%Amazon%’
GROUP BY language,
page_title
ORDER BY hits DESC
LIMIT 10;

 

On average, what  page is hit most in English?

SELECT language, page_title, AVG(hits) AS avg_hits
FROM default.wikistats
WHERE language = ‘en’
GROUP BY language, page_title
ORDER BY avg_hits DESC
LIMIT 10;

Try wikistats_orc and wikistats_parq with the same query. Do you see any difference in performance?

Go back to Airpal and view the results. The top records are Main_Page, Special: or 404_error, etc., which we don’t really care about. These words are noise here so you should filter them out in your query:

SELECT language, page_title, AVG(hits) AS avg_hits
FROM default.wikistats
WHERE language = ‘en’
AND page_title NOT IN (‘Main_Page’, ‘404_error/’)
AND page_title NOT LIKE ‘%Special%’
AND page_title NOT LIKE ‘%index%’
AND page_title NOT LIKE ‘%Search%’
AND NOT regexp_like(page_title, ‘%20’)
GROUP BY language, page_title
ORDER BY avg_hits DESC
LIMIT 10;

Using the Presto CLI

You can also use the Presto CLI directly on the EMR cluster to query the data.

Log in to the master node using the following command in the terminal:

ssh -i YourKeyPair.pem [email protected]

Replace YourKeyPair.pem with the location and name of your pem file. Replace ec2-xx-xx-xx-xx.compute-1.amazonaws.com with the public DNS name of your EMR cluster.

Assuming you already defined the schema using Hive, start the Presto-CLI.

Run the following command:

 $ presto-cli –catalog hive –schema default

Check to see if the table is still there.

Try the same query you tried earlier.

SELECT language, page_title, AVG(hits) AS avg_hits
FROM default.wikistats
WHERE language = ‘en’
AND page_title NOT IN (‘Main_Page’, ‘404_error/’)
AND page_title NOT LIKE ‘%Special%’
AND page_title NOT LIKE ‘%index%’
AND page_title NOT LIKE ‘%Search%’
AND NOT regexp_like(page_title, ‘%20’)
GROUP BY language, page_title
ORDER BY avg_hits DESC
LIMIT 10;

As you can see, you can also execute the query from the Presto CLI.

Summary

Presto is a distributed SQL query engine optimized for ad hoc analysis and data-exploration. It supports ANSI SQL standard, including complex queries, aggregations, joins, and window functions. In this post, I’ve shown you how easy it is to set up an EMR cluster with Presto 0.119, create metadata using Hive, and use either the Presto-CLI or Airpal to run interactive queries.

If you have questions or suggestions, please leave a comment below.

——————————————-

Related

Large-Scale Machine Learning with Spark on Amazon EMR

——————————————–

Love to work on open source? Check out EMR’s careers page.

 

Analyze Data with Presto and Airpal on Amazon EMR

Post Syndicated from Songzhi Liu original https://blogs.aws.amazon.com/bigdata/post/Tx1BF2DN6KRFI27/Analyze-Data-with-Presto-and-Airpal-on-Amazon-EMR

Songzhi Liu is a Professional Services Consultant with AWS

You can now launch Presto version 0.119 on Amazon EMR, allowing you to easily spin up a managed EMR cluster with the Presto query engine and run interactive analysis on data stored in Amazon S3. You can integrate with Spot instances, publish logs to an S3 bucket, and use EMR’s configure API to configure Presto. In this post, I’ll show you how to set up a Presto cluster and use Airpal to process data stored in S3.

What is Presto?

Presto is a distributed SQL query engine optimized for ad hoc analysis. It supports the ANSI SQL standard, including complex queries, aggregations, joins, and window functions. Presto can run on multiple data sources, including Amazon S3. Presto’s execution framework is fundamentally different from that of Hive/MapReduce: Presto has a custom query and execution engine where the stages of execution are pipelined, similar to a directed acyclic graph (DAG), and all processing occurs in memory to reduce disk I/O. This pipelined execution model can run multiple stages in parallel and streams data from one stage to another as the data becomes available. This reduces end-to-end latency and makes Presto a great tool for ad hoc data exploration over large datasets.

What is Airpal?

Airpal is a web-based query execution tool open-sourced by Airbnb that leverages Presto to facilitate data analysis. Airpal has many helplful features. For example, you can highlight syntax, export results to  CSV for download, view query history, save queries, use a Table Finder to search for appropriate tables, and use Table Explorer to visualize the schema of a table. We have created an AWS CloudFormation script that makes it easy to set up Airpal on an Amazon  EC2 instance on AWS.

For this blog post we will use the Wikimedia’s page count data, which is publicly available at ‘s3://support.elasticmapreduce/training/dataset/wikistats/’. This data is in textfile format. We will also convert the table to Parquet and ORC.

Spin up an EMR cluster with Hive and Presto installed

First, log in to the AWS console and navigate to the EMR console. Choose EMR-4.1.0 and Presto-Sandbox. Make sure you provide SSH keys so that you can log into the cluster.

Note: Write down the DNS name after creation is complete. You’ll need this for the next step.

 

Use AWS CloudFormation to deploy the Airpal server

Make sure you have a valid Key Pair for the region in which you want to deploy Airpal.

Navigate to AWS CloudFormation, click Create New Stack, name your stack, and choose Specify an Amazon S3 template URL.

Use the template in https://s3-external-1.amazonaws.com/emr.presto.airpal/scripts/deploy_airpal_env.json

Click Next and configure the parameters.

Important parameters you should configure:

PrestoCoordinatorURL Use the DNS name you noted earlier. Follow the format: http://<DNS Name of the cluster>:<Port of Presto>. The default port for Presto installation is 8889.

Example: http://ec2-xx-xx-xx-xx.compute-1.amazonaws.com:8889

AirpalPort  Choose the port on which the Airpal server should run. The default is 8193. Adjust this according to your firewall setting and make sure it’s not blocked.

S3BootstrapBuckets  This is the S3 bucket name that holds the bootstrap scripts. There is no need to change the default value of emr.presto.airpal.

InstallAirpal  This is the path to the installation script of Airpal Server. There is no need to change the default value of scripts/install_airpal.sh.

StartAirpal  This is the path to the starting script of Airpal Server. There is no need to change the default value of scripts/start_airpal.sh.

MyKeyPairName  Select a valid Key Pair you have in this region. You’ll use this to log in to the master node.

Click Next and add a tag to the stack if needed. Select the check box for IAM policy and click Create.

Wait 5 -10 minutes after the stack status changes to create_complete. (The server configuration takes longer than the stack creation.)

Navigate to the EC2 console and select the Airpal Server instance and note its public IP address.

Open a browser. Use: <PublicIP>:<Airpal Port> to go to Airpal. Make sure that the Port 8889 is allowed on the Master Security Group for your EMR cluster.

Log in to the master node and run Hive scripts

Presto ships with several connectors. To query data from Amazon S3, you need to use the Hive connector. Presto only uses Hive to create the meta-data. Presto’s execution engine is different from that of Hive. By default, when you install Presto on your cluster, EMR installs Hive as well. The metadata is stored in a database such a MySQL and is accessed by the Hive metastore service. The Hive metastore service is also installed.

The dataset contains hits data for the Wikipedia pages of around 7 GB. The schema is as follows:

Language of the page

Title of the page

Number of hits

Retrieved page size

Define the schema

To define the schema:

Log in to the master node using the following command in the terminal:

ssh -i YourKeyPair.pem [email protected]

Replace YourKeyPair.pem with the place and name of your pem file. Replace ec2-xx-xx-xx-xx.compute-1.amazonaws.com with the public DNS name of your EMR cluster.

Type “hive” in the command line to enter Hive interactive mode and run the following commands:

CREATE EXTERNAL TABLE wikistats (
language STRING,
page_title STRING,
hits BIGINT,
retrived_size BIGINT
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ‘ ‘
LINES TERMINATED BY ‘n’
LOCATION ‘s3://support.elasticmapreduce/training/datasets/wikistats/’;

Now you have created a “wikistats” table in csv format. You can also store this table using the Parquet format using the following command:

CREATE EXTERNAL TABLE wikistats_parq (
language STRING,
page_title STRING,
hits BIGINT,
retrived_size BIGINT
)
STORED AS PARQUET
LOCATION ‘s3://emr.presto.airpal/wikistats/parquet’;

You can store it in the compressed ORC format using the following command:

CREATE EXTERNAL TABLE wikistats_orc (
language STRING,
page_title STRING,
hits BIGINT,
retrived_size BIGINT
)
STORED AS ORC
LOCATION ‘s3://emr.presto.airpal/wikistats/orc’;

Now we have three tables holding the same data of three different data formats.

Try Presto in Airpal

Open a browser and go to ‘http://<ip address of the ec2 instance>:8193’.

You will use Presto queries to answer the questions below. Paste the following queries into the Airpal query editor.

What is the most frequently viewed page with page_title that contains “Amazon”?

SELECT language,
page_title,
SUM(hits) AS hits
FROM default.wikistats
WHERE language = ‘en’
AND page_title LIKE ‘%Amazon%’
GROUP BY language,
page_title
ORDER BY hits DESC
LIMIT 10;

 

On average, what  page is hit most in English?

SELECT language, page_title, AVG(hits) AS avg_hits
FROM default.wikistats
WHERE language = ‘en’
GROUP BY language, page_title
ORDER BY avg_hits DESC
LIMIT 10;

Try wikistats_orc and wikistats_parq with the same query. Do you see any difference in performance?

Go back to Airpal and view the results. The top records are Main_Page, Special: or 404_error, etc., which we don’t really care about. These words are noise here so you should filter them out in your query:

SELECT language, page_title, AVG(hits) AS avg_hits
FROM default.wikistats
WHERE language = ‘en’
AND page_title NOT IN (‘Main_Page’, ‘404_error/’)
AND page_title NOT LIKE ‘%Special%’
AND page_title NOT LIKE ‘%index%’
AND page_title NOT LIKE ‘%Search%’
AND NOT regexp_like(page_title, ‘%20’)
GROUP BY language, page_title
ORDER BY avg_hits DESC
LIMIT 10;

Using the Presto CLI

You can also use the Presto CLI directly on the EMR cluster to query the data.

Log in to the master node using the following command in the terminal:

ssh -i YourKeyPair.pem [email protected]

Replace YourKeyPair.pem with the location and name of your pem file. Replace ec2-xx-xx-xx-xx.compute-1.amazonaws.com with the public DNS name of your EMR cluster.

Assuming you already defined the schema using Hive, start the Presto-CLI.

Run the following command:

 $ presto-cli –catalog hive –schema default

Check to see if the table is still there.

Try the same query you tried earlier.

SELECT language, page_title, AVG(hits) AS avg_hits
FROM default.wikistats
WHERE language = ‘en’
AND page_title NOT IN (‘Main_Page’, ‘404_error/’)
AND page_title NOT LIKE ‘%Special%’
AND page_title NOT LIKE ‘%index%’
AND page_title NOT LIKE ‘%Search%’
AND NOT regexp_like(page_title, ‘%20’)
GROUP BY language, page_title
ORDER BY avg_hits DESC
LIMIT 10;

As you can see, you can also execute the query from the Presto CLI.

Summary

Presto is a distributed SQL query engine optimized for ad hoc analysis and data-exploration. It supports ANSI SQL standard, including complex queries, aggregations, joins, and window functions. In this post, I’ve shown you how easy it is to set up an EMR cluster with Presto 0.119, create metadata using Hive, and use either the Presto-CLI or Airpal to run interactive queries.

If you have questions or suggestions, please leave a comment below.

——————————————-

Related

Large-Scale Machine Learning with Spark on Amazon EMR

——————————————–

Love to work on open source? Check out EMR’s careers page.