Tag Archives: Jobs

Implement continuous integration and delivery of serverless AWS Glue ETL applications using AWS Developer Tools

Post Syndicated from Prasad Alle original https://aws.amazon.com/blogs/big-data/implement-continuous-integration-and-delivery-of-serverless-aws-glue-etl-applications-using-aws-developer-tools/

AWS Glue is an increasingly popular way to develop serverless ETL (extract, transform, and load) applications for big data and data lake workloads. Organizations that transform their ETL applications to cloud-based, serverless ETL architectures need a seamless, end-to-end continuous integration and continuous delivery (CI/CD) pipeline: from source code, to build, to deployment, to product delivery. Having a good CI/CD pipeline can help your organization discover bugs before they reach production and deliver updates more frequently. It can also help developers write quality code and automate the ETL job release management process, mitigate risk, and more.

AWS Glue is a fully managed data catalog and ETL service. It simplifies and automates the difficult and time-consuming tasks of data discovery, conversion, and job scheduling. AWS Glue crawls your data sources and constructs a data catalog using pre-built classifiers for popular data formats and data types, including CSV, Apache Parquet, JSON, and more.

When you are developing ETL applications using AWS Glue, you might come across some of the following CI/CD challenges:

  • Iterative development with unit tests
  • Continuous integration and build
  • Pushing the ETL pipeline to a test environment
  • Pushing the ETL pipeline to a production environment
  • Testing ETL applications using real data (live test)
  • Exploring and validating data

In this post, I walk you through a solution that implements a CI/CD pipeline for serverless AWS Glue ETL applications supported by AWS Developer Tools (including AWS CodePipeline, AWS CodeCommit, and AWS CodeBuild) and AWS CloudFormation.

Solution overview

The following diagram shows the pipeline workflow:

This solution uses AWS CodePipeline, which lets you orchestrate and automate the test and deploy stages for ETL application source code. The solution consists of a pipeline that contains the following stages:

1.) Source Control: In this stage, the AWS Glue ETL job source code and the AWS CloudFormation template file for deploying the ETL jobs are both committed to version control. I chose to use AWS CodeCommit for version control.

To get the ETL job source code and AWS CloudFormation template, download the gluedemoetl.zip file. This solution is developed based on a previous post, Build a Data Lake Foundation with AWS Glue and Amazon S3.

2.) LiveTest: In this stage, all resources—including AWS Glue crawlers, jobs, S3 buckets, roles, and other resources that are required for the solution—are provisioned, deployed, live tested, and cleaned up.

The LiveTest stage includes the following actions:

  • Deploy: In this action, all the resources that are required for this solution (crawlers, jobs, buckets, roles, and so on) are provisioned and deployed using an AWS CloudFormation template.
  • AutomatedLiveTest: In this action, all the AWS Glue crawlers and jobs are executed and data exploration and validation tests are performed. These validation tests include, but are not limited to, record counts in both raw tables and transformed tables in the data lake and any other business validations. I used AWS CodeBuild for this action.
  • LiveTestApproval: This action is included for the cases in which a pipeline administrator approval is required to deploy/promote the ETL applications to the next stage. The pipeline pauses in this action until an administrator manually approves the release.
  • LiveTestCleanup: In this action, all the LiveTest stage resources, including test crawlers, jobs, roles, and so on, are deleted using the AWS CloudFormation template. This action helps minimize cost by ensuring that the test resources exist only for the duration of the AutomatedLiveTest and LiveTestApproval

3.) DeployToProduction: In this stage, all the resources are deployed using the AWS CloudFormation template to the production environment.

Try it out

This code pipeline takes approximately 20 minutes to complete the LiveTest test stage (up to the LiveTest approval stage, in which manual approval is required).

To get started with this solution, choose Launch Stack:

This creates the CI/CD pipeline with all of its stages, as described earlier. It performs an initial commit of the sample AWS Glue ETL job source code to trigger the first release change.

In the AWS CloudFormation console, choose Create. After the template finishes creating resources, you see the pipeline name on the stack Outputs tab.

After that, open the CodePipeline console and select the newly created pipeline. Initially, your pipeline’s CodeCommit stage shows that the source action failed.

Allow a few minutes for your new pipeline to detect the initial commit applied by the CloudFormation stack creation. As soon as the commit is detected, your pipeline starts. You will see the successful stage completion status as soon as the CodeCommit source stage runs.

In the CodeCommit console, choose Code in the navigation pane to view the solution files.

Next, you can watch how the pipeline goes through the LiveTest stage of the deploy and AutomatedLiveTest actions, until it finally reaches the LiveTestApproval action.

At this point, if you check the AWS CloudFormation console, you can see that a new template has been deployed as part of the LiveTest deploy action.

At this point, make sure that the AWS Glue crawlers and the AWS Glue job ran successfully. Also check whether the corresponding databases and external tables have been created in the AWS Glue Data Catalog. Then verify that the data is validated using Amazon Athena, as shown following.

Open the AWS Glue console, and choose Databases in the navigation pane. You will see the following databases in the Data Catalog:

Open the Amazon Athena console, and run the following queries. Verify that the record counts are matching.

SELECT count(*) FROM "nycitytaxi_gluedemocicdtest"."data";
SELECT count(*) FROM "nytaxiparquet_gluedemocicdtest"."datalake";

The following shows the raw data:

The following shows the transformed data:

The pipeline pauses the action until the release is approved. After validating the data, manually approve the revision on the LiveTestApproval action on the CodePipeline console.

Add comments as needed, and choose Approve.

The LiveTestApproval stage now appears as Approved on the console.

After the revision is approved, the pipeline proceeds to use the AWS CloudFormation template to destroy the resources that were deployed in the LiveTest deploy action. This helps reduce cost and ensures a clean test environment on every deployment.

Production deployment is the final stage. In this stage, all the resources—AWS Glue crawlers, AWS Glue jobs, Amazon S3 buckets, roles, and so on—are provisioned and deployed to the production environment using the AWS CloudFormation template.

After successfully running the whole pipeline, feel free to experiment with it by changing the source code stored on AWS CodeCommit. For example, if you modify the AWS Glue ETL job to generate an error, it should make the AutomatedLiveTest action fail. Or if you change the AWS CloudFormation template to make its creation fail, it should affect the LiveTest deploy action. The objective of the pipeline is to guarantee that all changes that are deployed to production are guaranteed to work as expected.

Conclusion

In this post, you learned how easy it is to implement CI/CD for serverless AWS Glue ETL solutions with AWS developer tools like AWS CodePipeline and AWS CodeBuild at scale. Implementing such solutions can help you accelerate ETL development and testing at your organization.

If you have questions or suggestions, please comment below.

 


Additional Reading

If you found this post useful, be sure to check out Implement Continuous Integration and Delivery of Apache Spark Applications using AWS and Build a Data Lake Foundation with AWS Glue and Amazon S3.

 


About the Authors

Prasad Alle is a Senior Big Data Consultant with AWS Professional Services. He spends his time leading and building scalable, reliable Big data, Machine learning, Artificial Intelligence and IoT solutions for AWS Enterprise and Strategic customers. His interests extend to various technologies such as Advanced Edge Computing, Machine learning at Edge. In his spare time, he enjoys spending time with his family.

 
Luis Caro is a Big Data Consultant for AWS Professional Services. He works with our customers to provide guidance and technical assistance on big data projects, helping them improving the value of their solutions when using AWS.

 

 

 

How to retain system tables’ data spanning multiple Amazon Redshift clusters and run cross-cluster diagnostic queries

Post Syndicated from Karthik Sonti original https://aws.amazon.com/blogs/big-data/how-to-retain-system-tables-data-spanning-multiple-amazon-redshift-clusters-and-run-cross-cluster-diagnostic-queries/

Amazon Redshift is a data warehouse service that logs the history of the system in STL log tables. The STL log tables manage disk space by retaining only two to five days of log history, depending on log usage and available disk space.

To retain STL tables’ data for an extended period, you usually have to create a replica table for every system table. Then, for each you load the data from the system table into the replica at regular intervals. By maintaining replica tables for STL tables, you can run diagnostic queries on historical data from the STL tables. You then can derive insights from query execution times, query plans, and disk-spill patterns, and make better cluster-sizing decisions. However, refreshing replica tables with live data from STL tables at regular intervals requires schedulers such as Cron or AWS Data Pipeline. Also, these tables are specific to one cluster and they are not accessible after the cluster is terminated. This is especially true for transient Amazon Redshift clusters that last for only a finite period of ad hoc query execution.

In this blog post, I present a solution that exports system tables from multiple Amazon Redshift clusters into an Amazon S3 bucket. This solution is serverless, and you can schedule it as frequently as every five minutes. The AWS CloudFormation deployment template that I provide automates the solution setup in your environment. The system tables’ data in the Amazon S3 bucket is partitioned by cluster name and query execution date to enable efficient joins in cross-cluster diagnostic queries.

I also provide another CloudFormation template later in this post. This second template helps to automate the creation of tables in the AWS Glue Data Catalog for the system tables’ data stored in Amazon S3. After the system tables are exported to Amazon S3, you can run cross-cluster diagnostic queries on the system tables’ data and derive insights about query executions in each Amazon Redshift cluster. You can do this using Amazon QuickSight, Amazon Athena, Amazon EMR, or Amazon Redshift Spectrum.

You can find all the code examples in this post, including the CloudFormation templates, AWS Glue extract, transform, and load (ETL) scripts, and the resolution steps for common errors you might encounter in this GitHub repository.

Solution overview

The solution in this post uses AWS Glue to export system tables’ log data from Amazon Redshift clusters into Amazon S3. The AWS Glue ETL jobs are invoked at a scheduled interval by AWS Lambda. AWS Systems Manager, which provides secure, hierarchical storage for configuration data management and secrets management, maintains the details of Amazon Redshift clusters for which the solution is enabled. The last-fetched time stamp values for the respective cluster-table combination are maintained in an Amazon DynamoDB table.

The following diagram covers the key steps involved in this solution.

The solution as illustrated in the preceding diagram flows like this:

  1. The Lambda function, invoke_rs_stl_export_etl, is triggered at regular intervals, as controlled by Amazon CloudWatch. It’s triggered to look up the AWS Systems Manager parameter store to get the details of the Amazon Redshift clusters for which the system table export is enabled.
  2. The same Lambda function, based on the Amazon Redshift cluster details obtained in step 1, invokes the AWS Glue ETL job designated for the Amazon Redshift cluster. If an ETL job for the cluster is not found, the Lambda function creates one.
  3. The ETL job invoked for the Amazon Redshift cluster gets the cluster credentials from the parameter store. It gets from the DynamoDB table the last exported time stamp of when each of the system tables was exported from the respective Amazon Redshift cluster.
  4. The ETL job unloads the system tables’ data from the Amazon Redshift cluster into an Amazon S3 bucket.
  5. The ETL job updates the DynamoDB table with the last exported time stamp value for each system table exported from the Amazon Redshift cluster.
  6. The Amazon Redshift cluster system tables’ data is available in Amazon S3 and is partitioned by cluster name and date for running cross-cluster diagnostic queries.

Understanding the configuration data

This solution uses AWS Systems Manager parameter store to store the Amazon Redshift cluster credentials securely. The parameter store also securely stores other configuration information that the AWS Glue ETL job needs for extracting and storing system tables’ data in Amazon S3. Systems Manager comes with a default AWS Key Management Service (AWS KMS) key that it uses to encrypt the password component of the Amazon Redshift cluster credentials.

The following table explains the global parameters and cluster-specific parameters required in this solution. The global parameters are defined once and applicable at the overall solution level. The cluster-specific parameters are specific to an Amazon Redshift cluster and repeat for each cluster for which you enable this post’s solution. The CloudFormation template explained later in this post creates these parameters as part of the deployment process.

Parameter name Type Description
Global parametersdefined once and applied to all jobs
redshift_query_logs.global.s3_prefix String The Amazon S3 path where the query logs are exported. Under this path, each exported table is partitioned by cluster name and date.
redshift_query_logs.global.tempdir String The Amazon S3 path that AWS Glue ETL jobs use for temporarily staging the data.
redshift_query_logs.global.role> String The name of the role that the AWS Glue ETL jobs assume. Just the role name is sufficient. The complete Amazon Resource Name (ARN) is not required.
redshift_query_logs.global.enabled_cluster_list StringList A comma-separated list of cluster names for which system tables’ data export is enabled. This gives flexibility for a user to exclude certain clusters.
Cluster-specific parametersfor each cluster specified in the enabled_cluster_list parameter
redshift_query_logs.<<cluster_name>>.connection String The name of the AWS Glue Data Catalog connection to the Amazon Redshift cluster. For example, if the cluster name is product_warehouse, the entry is redshift_query_logs.product_warehouse.connection.
redshift_query_logs.<<cluster_name>>.user String The user name that AWS Glue uses to connect to the Amazon Redshift cluster.
redshift_query_logs.<<cluster_name>>.password Secure String The password that AWS Glue uses to connect the Amazon Redshift cluster’s encrypted-by key that is managed in AWS KMS.

For example, suppose that you have two Amazon Redshift clusters, product-warehouse and category-management, for which the solution described in this post is enabled. In this case, the parameters shown in the following screenshot are created by the solution deployment CloudFormation template in the AWS Systems Manager parameter store.

Solution deployment

To make it easier for you to get started, I created a CloudFormation template that automatically configures and deploys the solution—only one step is required after deployment.

Prerequisites

To deploy the solution, you must have one or more Amazon Redshift clusters in a private subnet. This subnet must have a network address translation (NAT) gateway or a NAT instance configured, and also a security group with a self-referencing inbound rule for all TCP ports. For more information about why AWS Glue ETL needs the configuration it does, described previously, see Connecting to a JDBC Data Store in a VPC in the AWS Glue documentation.

To start the deployment, launch the CloudFormation template:

CloudFormation stack parameters

The following table lists and describes the parameters for deploying the solution to export query logs from multiple Amazon Redshift clusters.

Property Default Description
S3Bucket mybucket The bucket this solution uses to store the exported query logs, stage code artifacts, and perform unloads from Amazon Redshift. For example, the mybucket/extract_rs_logs/data bucket is used for storing all the exported query logs for each system table partitioned by the cluster. The mybucket/extract_rs_logs/temp/ bucket is used for temporarily staging the unloaded data from Amazon Redshift. The mybucket/extract_rs_logs/code bucket is used for storing all the code artifacts required for Lambda and the AWS Glue ETL jobs.
ExportEnabledRedshiftClusters Requires Input A comma-separated list of cluster names from which the system table logs need to be exported.
DataStoreSecurityGroups Requires Input A list of security groups with an inbound rule to the Amazon Redshift clusters provided in the parameter, ExportEnabledClusters. These security groups should also have a self-referencing inbound rule on all TCP ports, as explained on Connecting to a JDBC Data Store in a VPC.

After you launch the template and create the stack, you see that the following resources have been created:

  1. AWS Glue connections for each Amazon Redshift cluster you provided in the CloudFormation stack parameter, ExportEnabledRedshiftClusters.
  2. All parameters required for this solution created in the parameter store.
  3. The Lambda function that invokes the AWS Glue ETL jobs for each configured Amazon Redshift cluster at a regular interval of five minutes.
  4. The DynamoDB table that captures the last exported time stamps for each exported cluster-table combination.
  5. The AWS Glue ETL jobs to export query logs from each Amazon Redshift cluster provided in the CloudFormation stack parameter, ExportEnabledRedshiftClusters.
  6. The IAM roles and policies required for the Lambda function and AWS Glue ETL jobs.

After the deployment

For each Amazon Redshift cluster for which you enabled the solution through the CloudFormation stack parameter, ExportEnabledRedshiftClusters, the automated deployment includes temporary credentials that you must update after the deployment:

  1. Go to the parameter store.
  2. Note the parameters <<cluster_name>>.user and redshift_query_logs.<<cluster_name>>.password that correspond to each Amazon Redshift cluster for which you enabled this solution. Edit these parameters to replace the placeholder values with the right credentials.

For example, if product-warehouse is one of the clusters for which you enabled system table export, you edit these two parameters with the right user name and password and choose Save parameter.

Querying the exported system tables

Within a few minutes after the solution deployment, you should see Amazon Redshift query logs being exported to the Amazon S3 location, <<S3Bucket_you_provided>>/extract_redshift_query_logs/data/. In that bucket, you should see the eight system tables partitioned by customer name and date: stl_alert_event_log, stl_dlltext, stl_explain, stl_query, stl_querytext, stl_scan, stl_utilitytext, and stl_wlm_query.

To run cross-cluster diagnostic queries on the exported system tables, create external tables in the AWS Glue Data Catalog. To make it easier for you to get started, I provide a CloudFormation template that creates an AWS Glue crawler, which crawls the exported system tables stored in Amazon S3 and builds the external tables in the AWS Glue Data Catalog.

Launch this CloudFormation template to create external tables that correspond to the Amazon Redshift system tables. S3Bucket is the only input parameter required for this stack deployment. Provide the same Amazon S3 bucket name where the system tables’ data is being exported. After you successfully create the stack, you can see the eight tables in the database, redshift_query_logs_db, as shown in the following screenshot.

Now, navigate to the Athena console to run cross-cluster diagnostic queries. The following screenshot shows a diagnostic query executed in Athena that retrieves query alerts logged across multiple Amazon Redshift clusters.

You can build the following example Amazon QuickSight dashboard by running cross-cluster diagnostic queries on Athena to identify the hourly query count and the key query alert events across multiple Amazon Redshift clusters.

How to extend the solution

You can extend this post’s solution in two ways:

  • Add any new Amazon Redshift clusters that you spin up after you deploy the solution.
  • Add other system tables or custom query results to the list of exports from an Amazon Redshift cluster.

Extend the solution to other Amazon Redshift clusters

To extend the solution to more Amazon Redshift clusters, add the three cluster-specific parameters in the AWS Systems Manager parameter store following the guidelines earlier in this post. Modify the redshift_query_logs.global.enabled_cluster_list parameter to append the new cluster to the comma-separated string.

Extend the solution to add other tables or custom queries to an Amazon Redshift cluster

The current solution ships with the export functionality for the following Amazon Redshift system tables:

  • stl_alert_event_log
  • stl_dlltext
  • stl_explain
  • stl_query
  • stl_querytext
  • stl_scan
  • stl_utilitytext
  • stl_wlm_query

You can easily add another system table or custom query by adding a few lines of code to the AWS Glue ETL job, <<cluster-name>_extract_rs_query_logs. For example, suppose that from the product-warehouse Amazon Redshift cluster you want to export orders greater than $2,000. To do so, add the following five lines of code to the AWS Glue ETL job product-warehouse_extract_rs_query_logs, where product-warehouse is your cluster name:

  1. Get the last-processed time-stamp value. The function creates a value if it doesn’t already exist.

salesLastProcessTSValue = functions.getLastProcessedTSValue(trackingEntry=”mydb.sales_2000",job_configs=job_configs)

  1. Run the custom query with the time stamp.

returnDF=functions.runQuery(query="select * from sales s join order o where o.order_amnt > 2000 and sale_timestamp > '{}'".format (salesLastProcessTSValue) ,tableName="mydb.sales_2000",job_configs=job_configs)

  1. Save the results to Amazon S3.

functions.saveToS3(dataframe=returnDF,s3Prefix=s3Prefix,tableName="mydb.sales_2000",partitionColumns=["sale_date"],job_configs=job_configs)

  1. Get the latest time-stamp value from the returned data frame in Step 2.

latestTimestampVal=functions.getMaxValue(returnDF,"sale_timestamp",job_configs)

  1. Update the last-processed time-stamp value in the DynamoDB table.

functions.updateLastProcessedTSValue(“mydb.sales_2000",latestTimestampVal[0],job_configs)

Conclusion

In this post, I demonstrate a serverless solution to retain the system tables’ log data across multiple Amazon Redshift clusters. By using this solution, you can incrementally export the data from system tables into Amazon S3. By performing this export, you can build cross-cluster diagnostic queries, build audit dashboards, and derive insights into capacity planning by using services such as Athena. I also demonstrate how you can extend this solution to other ad hoc query use cases or tables other than system tables by adding a few lines of code.


Additional Reading

If you found this post useful, be sure to check out Using Amazon Redshift Spectrum, Amazon Athena, and AWS Glue with Node.js in Production and Amazon Redshift – 2017 Recap.


About the Author

Karthik Sonti is a senior big data architect at Amazon Web Services. He helps AWS customers build big data and analytical solutions and provides guidance on architecture and best practices.

 

 

 

 

MPAA Aims to Prevent Piracy Leaks With New Security Program

Post Syndicated from Andy original https://torrentfreak.com/mpaa-aims-to-prevent-piracy-leaks-with-new-security-program-180403/

When movies and TV shows leak onto the Internet in advance of their intended release dates, it’s generally a time of celebration for pirates.

Grabbing a workprint or DVD screener of an Oscar nominee or a yet to be aired on TV show makes the Internet bubble with excitement. But for the studios and companies behind the products, it presents their worst nightmare.

Despite all the takedown efforts known to man, once content appears, there’s no putting the genie back into the bottle.

With this in mind, the solution doesn’t lie with reactionary efforts such as Internet disconnections, site-blocking and similar measures, but better hygiene while content is still in production or being prepared for distribution. It’s something the MPAA hopes to address with a brand new program designed to bring the security of third-party vendors up to scratch.

The Trusted Partner Network (TPN) is the brainchild of the MPAA and the Content Delivery & Security Association (CDSA), a worldwide forum advocating the innovative and responsible delivery and storage of entertainment content.

TPN is being touted as a global industry-wide film and television content protection initiative which will help companies prevent leaks, breaches, and hacks of their customers’ movies and television shows prior to their intended release.

“Content is now created by a growing ecosystem of third-party vendors, who collaborate with varying degrees of security,” TPN explains.

“This has escalated the security threat to the entertainment industry’s most prized asset, its content. The TPN program seeks to raise security awareness, preparedness, and capabilities within our industry.”

The TPN will establish a “single benchmark of minimum security preparedness” for vendors whose details will be available via centralized and global “trusted partner” database. The TPN will replace security assessments programs already in place at the MPAA and CDSA.

While content owners and vendors are still able to conduct their own security assessments on an “as-needed” basis, the aim is for the TPN to reduce the number of assessments carried out while assisting in identifying vulnerabilities. The pool of “trusted partners” is designed to help all involved understand and meet the challenges of leaks, whether that’s movie, TV show, or associated content.

While joining the TPN program is voluntary, there’s a strong suggestion that becoming involved in the program is in vendors’ best interests. Being able to carry the TPN logo will be an asset to doing business with others involved in the scheme, it’s suggested.

Once in, vendors will need to hire a TPN-approved assessor to carry out an initial audit of their supply chain and best practices, which in turn will need to be guided by the MPAA’s existing content security guidelines.

“Vendors will hire a Qualified Assessor from the TPN database and will schedule their assessment and manage the process via the secure online platform,” TPN says, noting that vendors will cover their own costs unless an assessment is carried out at the request of a content owner.

The TPN explains that members of the scheme aren’t passed or failed in respect of their security preparedness. However, there’s an expectation they will be expected to come up to scratch and prove that with a subsequent positive report from a TPN approved assessor. Assessors themselves will also be assessed via the TPN Qualified Assessor Program.

By imposing MPAA best practices upon partner companies, it’s hoped that some if not all of the major leaks that have plagued the industry over the past several years will be prevented in future. Whether that’s the usual DVD screener leaks, workprints, scripts or other content, it’s believed the TPN should be able to help in some way, although the former might be a more difficult nut to crack.

There’s no doubting that the problem TPN aims to address is serious. In 2017 alone, hackers and other individuals obtained and then leaked episodes of Orange is the New Black, unreleased ABC content, an episode of Game of Thrones sourced from India and scripts from the same show. Even blundering efforts managed to make their mark.

“Creating the films and television shows enjoyed by audiences around the world increasingly requires a network of specialized vendors and technicians,” says MPAA chairman and CEO Charles Rivkin.

“That’s why maintaining high security standards for all third-party operations — from script to screen — is such an important part of preventing the theft of creative works and ultimately protects jobs and the health of our vibrant creative economy.”

According to TPN, the first class of TPN Assessors was recruited and tested last month while beta-testing of key vendors will begin in April. The full program will roll out in June 2018.

Source: TF, for the latest info on copyright, file-sharing, torrent sites and more. We also have VPN reviews, discounts, offers and coupons.

UK Urges Online Intermediaries to Tackle Piracy, Or Else

Post Syndicated from Ernesto original https://torrentfreak.com/uk-urges-online-intermediaries-to-tackle-piracy-or-else-180329/

In recent years the UK Government has been very proactive when it comes to intellectual property enforcement, supporting a broad range of anti-piracy initiatives.

The authorities have also pushed for cooperation between copyright holders and online intermediaries. Last year, this resulted in a ‘landmark’ agreement between the creative industries and search engines, to tackle online piracy.

In a new Industrial Strategy White Paper released this week the Government highlights this deal as a great success. However, it was only the start. More is needed to properly address the piracy problem.

“Online piracy continues to be a serious inhibitor to growth in the creative industries. Technologies like stream ripping and illicit streaming devices enable illegitimate access to content without rewarding its creators.

“Many rights holders are also concerned about how their works are exploited online, especially where they are used without generating substantial returns for content creators,” the Government adds.

The report outlines a broad strategy on how the Government and the creative industries can work together. This includes financial support but also concrete interventions regarding online intermediaries.

As with the search engines before, the Government plans to host a series of roundtables with copyright holders, social media companies, user upload platforms, digital advertising outfits, and online marketplaces. The goal of these meetings is to broker voluntary anti-piracy agreements.

The roundtables will be used to identify any significant piracy problems and develop ‘voluntary’ codes of practice to address these, including upload filters.

“These measures could include proactive steps to detect and remove illegal content, improving the effectiveness of notice and takedown arrangements, reducing incentives for illegal sites to engage in infringement online and reducing the burdens on rights holders in relation to protecting their content,” the Government writes.

While the envisioned codes of conduct are voluntary, the Government notes that if these roundtables fail to produce the desired outcome, new legislation may be put in place.

“[If this] fails to result in the agreement of an effective code by 31 December 2018, government will consider further legislative action to strengthen the UK copyright framework to ensure that the identified problems are addressed.”

This type of warning is not new. The UK Government used similar language when it tried to convince search engines to reach a voluntary anti-piracy agreement with copyright holders. This eventually paid off.

In addition to brokering voluntary codes, the UK Government says it will also continue to address the so-called “value gap” in both the UK and Europe.

At the same time, the Government also renewed its support for the ‘Get it Right’ campaign. It will make an additional £2 million available which, among other things, will be used to educate consumers on the dangers of copyright infringement and warn pirating subscribers.

The UK Government hopes that these and other incentives will eventually help the creative industry to flourish, so it created new jobs and benefit the UK economy as a whole.

“Together we can build on the UK’s position as a global leader and strengthen its advantage as a creative nation by increasing the number of opportunities and jobs in the creative industries across the country, improving their productivity, and enabling us to greatly expand our trading ambitions abroad.”

A copy of the white paper “Industrial Strategy: building a Britain fit for the future” is available here.

Source: TF, for the latest info on copyright, file-sharing, torrent sites and more. We also have VPN reviews, discounts, offers and coupons.

Welcome Nathan – Our Solutions Engineer

Post Syndicated from Yev original https://www.backblaze.com/blog/welcome-nathan-our-solutions-engineer/

Backblaze is growing, and with it our need to cater to a lot of different use cases that our customers bring to us. We needed a Solutions Engineer to help out, and after a long search we’ve hired our first one! Lets learn a bit more about Nathan shall we?

What is your Backblaze Title?
Solutions Engineer. Our customers bring a thousand different use cases to both B1 and B2, and I’m here to help them figure out how best to make those use cases a reality. Also, any odd jobs that Nilay wants me to do.

Where are you originally from?
I am native to the San Francisco Bay Area, studying mathematics at UC Santa Cruz, and then computer science at California University of Hayward (which has since renamed itself California University of the East Hills. I observe that it’s still in Hayward).

What attracted you to Backblaze?
As a stable, growing company with huge growth and even bigger potential, the business model is attractive, and the team is outstanding. Add to that the strong commitment to transparency, and it’s a hard company to resist. We can store – and restore – data while offering superior reliability at an economic advantage to do-it-yourself, and that’s a great place to be.

What do you expect to learn while being at Backblaze?
Everything I need to, but principally how our customers choose to interact with web storage. Storage isn’t a solution per se, but it’s an important component of any persistent solution. I’m looking forward to working with all the different concepts our customers have to make use of storage.

Where else have you worked?
All sorts of places, but I’ll admit publicly to EMC, Gemalto, and my own little (failed, alas) startup, IC2N. I worked with low-level document imaging.

Where did you go to school?
UC Santa Cruz, BA Mathematics CU Hayward, Master of Science in Computer Science.

What’s your dream job?
Sipping tea in the California redwood forest. However, solutions engineer at Backblaze is a good second choice!

Favorite place you’ve traveled?
Ashland, Oregon, for the Oregon Shakespeare Festival and the marble caves (most caves form from limestone).

Favorite hobby?
Theater. Pathfinder. Writing. Baking cookies and cakes.

Of what achievement are you most proud?
Marrying the most wonderful man in the world.

Star Trek or Star Wars?
Star Trek’s utopian science fiction vision of humanity and science resonates a lot more strongly with me than the dystopian science fantasy of Star Wars.

Coke or Pepsi?
Neither. I’d much rather have a cup of jasmine tea.

Favorite food?
It varies, but I love Indian and Thai cuisine. Truly excellent Italian food is marvelous – wood fired pizza, if I had to pick only one, but the world would be a boring place with a single favorite food.

Why do you like certain things?
If I knew that, I’d be in marketing.

Anything else you’d like you’d like to tell us?
If you haven’t already encountered the amazing authors Patricia McKillip and Lois McMasters Bujold – go encounter them. Be happy.

There’s nothing wrong with a nice cup of tea and a long game of Pathfinder. Sign us up! Welcome to the team Nathan!

The post Welcome Nathan – Our Solutions Engineer appeared first on Backblaze Blog | Cloud Storage & Cloud Backup.

Wanted: Office Administrator

Post Syndicated from Yev original https://www.backblaze.com/blog/wanted-office-administrator-2/

At inception, Backblaze was a consumer company. Thousands upon thousands of individuals came to our website and gave us $5/mo to keep their data safe. But, we didn’t sell business solutions. It took us years before we had a sales team. In the last couple of years, we’ve released products that businesses of all sizes love: Backblaze B2 Cloud Storage and Backblaze for Business Computer Backup. Those businesses want to integrate Backblaze into their infrastructure, so it’s time to expand our teams!

Company Description:
Founded in 2007, Backblaze started with a mission to make backup software elegant and provide complete peace of mind. Over the course of almost a decade, we have become a pioneer in robust, scalable low cost cloud backup. Recently, we launched B2 – robust and reliable object storage at just $0.005/gb/mo. Part of our differentiation is being able to offer the lowest price of any of the big players while still being profitable.

We’ve managed to nurture a team oriented culture with amazingly low turnover. We value our people and their families. Don’t forget to check out our “About Us” page to learn more about the people and some of our perks.

We have built a profitable, high growth business. While we love our investors, we have maintained control over the business. That means our corporate goals are simple – grow sustainably and profitably.

Some Backblaze Perks:

  • Competitive healthcare plans
  • Competitive compensation and 401k
  • All employees receive Option grants
  • Unlimited vacation days
  • Strong coffee
  • Fully stocked Micro kitchen
  • Catered breakfast and lunches
  • Awesome people who work on awesome projects
  • New Parent Childcare bonus
  • Normal work hours
  • Get to bring your pets into the office
  • San Mateo Office – located near Caltrain and Highways 101 & 280.

Want to know what you’ll be doing?

You will play a pivotal role at Backblaze! You will be the glue that binds people together in the office and one of the main engines that keeps our company running. This is an exciting opportunity to help shape the company culture of Backblaze by making the office a fun and welcoming place to work. As an Office Administrator, your priority is to help employees have what they need to feel happy, comfortable, and productive at work; whether it’s refilling snacks, collecting shipments, responding to maintenance requests, ordering office supplies, or assisting with fun social events, your contributions will be critical to our culture.

Office Administrator Responsibilities:

  • Maintain a clean, well-stocked and organized office
  • Greet visitors and callers, route and resolve information requests
  • Ensure conference rooms and kitchen areas are clean and stocked
  • Sign for all packages delivered to the office as well as forward relevant departments
  • Administrative duties as assigned

Facilities Coordinator Responsibilities:

  • Act as point of contact for building facilities and other office vendors and deliveries
  • Work with HR to ensure new hires are welcomed successfully at Backblaze – to include desk/equipment orders, seat planning, and general facilities preparation
  • Work with the “Fun Committee” to support office events and activities
  • Be available after hours as required for ongoing business success (events, building issues)

Jr. Buyer Responsibilities:

  • Assist with creating purchase orders and buying equipment
  • Compare costs and maintain vendor cards in Quickbooks
  • Assist with booking travel, hotel accommodations, and conference rooms
  • Maintain accurate records of purchases and tracking orders
  • Maintain office equipment, physical space, and maintenance schedules
  • Manage company calendar, snack, and meal orders

Qualifications:

  • 1 year experience in an Inventory/Shipping/Receiving/Admin role preferred
  • Proficiency with Microsoft Office applications, Google Apps, Quickbooks, Excel
  • Experience and skill at adhering to a budget
  • High attention to detail
  • Proven ability to prioritize within a multi-tasking environment; highly organized
  • Collaborative and communicative
  • Hands-on, “can do” attitude
  • Personable and approachable
  • Able to lift up to 50 lbs
  • Strong data entry

This position is located in San Mateo, California. Backblaze is an Equal Opportunity Employer.

If this all sounds like you:

  1. Send an email to [email protected] with the position in the subject line.
  2. Tell us a bit about your work history.
  3. Include your resume.

The post Wanted: Office Administrator appeared first on Backblaze Blog | Cloud Storage & Cloud Backup.

Wanted: Vault Storage Engineer

Post Syndicated from Yev original https://www.backblaze.com/blog/wanted-vault-storage-engineer/

Want to work at a company that helps customers in 156 countries around the world protect the memories they hold dear? A company that stores over 500 petabytes of customers’ photos, music, documents and work files in a purpose-built cloud storage system?

Well here’s your chance. Backblaze is looking for a Vault Storage Engineer!

Company Description:

Founded in 2007, Backblaze started with a mission to make backup software elegant and provide complete peace of mind. Over the course of almost a decade, we have become a pioneer in robust, scalable low cost cloud backup. Recently, we launched B2 — robust and reliable object storage at just $0.005/gb/mo. Part of our differentiation is being able to offer the lowest price of any of the big players while still being profitable.

We’ve managed to nurture a team oriented culture with amazingly low turnover. We value our people and their families. Don’t forget to check out our “About Us” page to learn more about the people and some of our perks.

We have built a profitable, high growth business. While we love our investors, we have maintained control over the business. That means our corporate goals are simple – grow sustainably and profitably.

Some Backblaze Perks:

  • Competitive healthcare plans
  • Competitive compensation and 401k
  • All employees receive Option grants
  • Unlimited vacation days
  • Strong coffee
  • Fully stocked Micro kitchen
  • Catered breakfast and lunches
  • Awesome people who work on awesome projects
  • New Parent Childcare bonus
  • Normal work hours
  • Get to bring your pets into the office
  • San Mateo Office – located near Caltrain and Highways 101 & 280.

Want to know what you’ll be doing?

You will work on the core of the Backblaze: the vault cloud storage system (https://www.backblaze.com/blog/vault-cloud-storage-architecture/). The system accepts files uploaded from customers, stores them durably by distributing them across the data center, automatically handles drive failures, rebuilds data when drives are replaced, and maintains high availability for customers to download their files. There are significant enhancements in the works, and you’ll be a part of making them happen.

Must have a strong background in:

  • Computer Science
  • Multi-threaded programming
  • Distributed Systems
  • Java
  • Math (such as matrix algebra and statistics)
  • Building reliable, testable systems

Bonus points for:

  • Java
  • JavaScript
  • Python
  • Cassandra
  • SQL

Looking for an attitude of:

  • Passionate about building reliable clean interfaces and systems.
  • Likes to work closely with other engineers, support, and sales to help customers.
  • Customer Focused (!!) — always focus on the customer’s point of view and how to solve their problem!

Required for all Backblaze Employees:

  • Good attitude and willingness to do whatever it takes to get the job done
  • Strong desire to work for a small fast-paced company
  • Desire to learn and adapt to rapidly changing technologies and work environment
  • Rigorous adherence to best practices
  • Relentless attention to detail
  • Excellent interpersonal skills and good oral/written communication
  • Excellent troubleshooting and problem solving skills

This position is located in San Mateo, California but will also consider remote work as long as you’re no more than three time zones away and can come to San Mateo now and then.

Backblaze is an Equal Opportunity Employer.

Contact Us:
If this sounds like you, follow these steps:

  1. Send an email to jobscontact@backblaze.com with the position in the subject line.
  2. Include your resume.
  3. Tell us a bit about your programming experience.

The post Wanted: Vault Storage Engineer appeared first on Backblaze Blog | Cloud Storage & Cloud Backup.

Needed: Software Engineering Director

Post Syndicated from Yev original https://www.backblaze.com/blog/needed-software-engineering-director/

Company Description:

Founded in 2007, Backblaze started with a mission to make backup software elegant and provide complete peace of mind. Over the course of almost a decade, we have become a pioneer in robust, scalable low cost cloud backup. Recently, we launched B2, robust and reliable object storage at just $0.005/gb/mo. We offer the lowest price of any of the big players and are still profitable.

Backblaze has a culture of openness. The hardware designs for our storage pods are open source. Key parts of the software, including the Reed-Solomon erasure coding are open-source. Backblaze is the only company that publishes hard drive reliability statistics.

We’ve managed to nurture a team-oriented culture with amazingly low turnover. We value our people and their families. The team is distributed across the U.S., but we work in Pacific Time, so work is limited to work time, leaving evenings and weekends open for personal and family time. Check out our “About Us” page to learn more about the people and some of our perks.

We have built a profitable, high growth business. While we love our investors, we have maintained control over the business. That means our corporate goals are simple – grow sustainably and profitably.

Our engineering team is 10 software engineers, and 2 quality assurance engineers. Most engineers are experienced, and a couple are more junior. The team will be growing as the company grows to meet the demand for our products; we plan to add at least 6 more engineers in 2018. The software includes the storage systems that run in the data center, the web APIs that clients access, the web site, and client programs that run on phones, tablets, and computers.

The Job:

As the Director of Engineering, you will be:

  • managing the software engineering team
  • ensuring consistent delivery of top-quality services to our customers
  • collaborating closely with the operations team
  • directing engineering execution to scale the business and build new services
  • transforming a self-directed, scrappy startup team into a mid-size engineering organization

A successful director will have the opportunity to grow into the role of VP of Engineering. Backblaze expects to continue our exponential growth of our storage services in the upcoming years, with matching growth in the engineering team..

This position is located in San Mateo, California.

Qualifications:

We are a looking for a director who:

  • has a good understanding of software engineering best practices
  • has experience scaling a large, distributed system
  • gets energized by creating an environment where engineers thrive
  • understands the trade-offs between building a solid foundation and shipping new features
  • has a track record of building effective teams

Required for all Backblaze Employees:

  • Good attitude and willingness to do whatever it takes to get the job done
  • Strong desire to work for a small fast-paced company
  • Desire to learn and adapt to rapidly changing technologies and work environment
  • Rigorous adherence to best practices
  • Relentless attention to detail
  • Excellent interpersonal skills and good oral/written communication
  • Excellent troubleshooting and problem solving skills

Some Backblaze Perks:

  • Competitive healthcare plans
  • Competitive compensation and 401k
  • All employees receive Option grants
  • Unlimited vacation days
  • Strong coffee
  • Fully stocked Micro kitchen
  • Catered breakfast and lunches
  • Awesome people who work on awesome projects
  • New Parent Childcare bonus
  • Normal work hours
  • Get to bring your pets into the office
  • San Mateo Office — located near Caltrain and Highways 101 & 280.

Contact Us:

If this sounds like you, follow these steps:

  1. Send an email to jobscontact@backblaze.com with the position in the subject line.
  2. Include your resume.
  3. Tell us a bit about your experience.

Backblaze is an Equal Opportunity Employer.

The post Needed: Software Engineering Director appeared first on Backblaze Blog | Cloud Storage & Cloud Backup.

New Amazon EC2 Spot pricing model: Simplified purchasing without bidding and fewer interruptions

Post Syndicated from Roshni Pary original https://aws.amazon.com/blogs/compute/new-amazon-ec2-spot-pricing/

Contributed by Deepthi Chelupati and Roshni Pary

Amazon EC2 Spot Instances offer spare compute capacity in the AWS Cloud at steep discounts. Customers—including Yelp, NASA JPL, FINRA, and Autodesk—use Spot Instances to reduce costs and get faster results. Spot Instances provide acceleration, scale, and deep cost savings to big data workloads, containerized applications such as web services, test/dev, and many types of HPC and batch jobs.

At re:Invent 2017, we launched a new pricing model that simplified the Spot purchasing experience. The new model gives you predictable prices that adjust slowly over days and weeks, with typical savings of 70-90% over On-Demand. With the previous pricing model, some of you had to invest time and effort to analyze historical prices to determine your bidding strategy and maximum bid price. Not anymore.

How does the new pricing model work?

You don’t have to bid for Spot Instances in the new pricing model, and you just pay the Spot price that’s in effect for the current hour for the instances that you launch. It’s that simple. Now you can request Spot capacity just like you would request On-Demand capacity, without having to spend time analyzing market prices or setting a maximum bid price.

Previously, Spot Instances were terminated in ascending order of bids, and the Spot price was set to the highest unfulfilled bid. The market prices fluctuated frequently because of this. In the new model, the Spot prices are more predictable, updated less frequently, and are determined by supply and demand for Amazon EC2 spare capacity, not bid prices. You can find the price that’s in effect for the current hour in the EC2 console.

As you can see from the above Spot Instance Pricing History graph (available in the EC2 console under Spot Requests), Spot prices were volatile before the pricing model update. However, after the pricing model update, prices are more predictable and change less frequently.

In the new model, you still have the option to further control costs by submitting a “maximum price” that you are willing to pay in the console when you request Spot Instances:

You can also set your maximum price in EC2 RunInstances or RequestSpotFleet API calls, or in command line requests:

$ aws ec2 run-instances --instance-market-options 
'{"MarketType":"Spot", "SpotOptions": {"SpotPrice": "0.12"}}' \
    --image-id ami-1a2b3c4d --count 1 --instance-type c4.2xlarge

The default maximum price is the On-Demand price and you can continue to set a maximum Spot price of up to 10x the On-Demand price. That means, if you have been running applications on Spot Instances and use the RequestSpotInstances or RequestSpotFleet operations, you can continue to do so. The new Spot pricing model is backward compatible and you do not need to make any changes to your existing applications.

Fewer interruptions

Spot Instances receive a two-minute interruption notice when these instances are about to be reclaimed by EC2, because EC2 needs the capacity back. We have significantly reduced the interruptions with the new pricing model. Now instances are not interrupted because of higher competing bids, and you can enjoy longer workload runtimes. The typical frequency of interruption for Spot Instances in the last 30 days was less than 5% on average.

To reduce the impact of interruptions and optimize Spot Instances, diversify and run your application across multiple capacity pools. Each instance family, each instance size, in each Availability Zone, in every Region is a separate Spot pool. You can use the RequestSpotFleet API operation to launch thousands of Spot Instances and diversify resources automatically. To further reduce the impact of interruptions, you can also set up Spot Instances and Spot Fleets to respond to an interruption notice by stopping or hibernating rather than terminating instances when capacity is no longer available.

Spot Instances are now available in 18 Regions and 51 Availability Zones, and offer 100s of instance options. We have eliminated bidding, simplified the pricing model, and have made it easy to get started with Amazon EC2 Spot Instances for you to take advantage of the largest pool of cost-effective compute capacity in the world. See the Spot Instances detail page for more information and create your Spot Instance here.

Needed: Senior Software Engineer

Post Syndicated from Yev original https://www.backblaze.com/blog/needed-senior-software-engineer/

Want to work at a company that helps customers in 156 countries around the world protect the memories they hold dear? A company that stores over 500 petabytes of customers’ photos, music, documents and work files in a purpose-built cloud storage system?

Well, here’s your chance. Backblaze is looking for a Sr. Software Engineer!

Company Description:

Founded in 2007, Backblaze started with a mission to make backup software elegant and provide complete peace of mind. Over the course of almost a decade, we have become a pioneer in robust, scalable low cost cloud backup. Recently, we launched B2 – robust and reliable object storage at just $0.005/gb/mo. Part of our differentiation is being able to offer the lowest price of any of the big players while still being profitable.

We’ve managed to nurture a team oriented culture with amazingly low turnover. We value our people and their families. Don’t forget to check out our “About Us” page to learn more about the people and some of our perks.

We have built a profitable, high growth business. While we love our investors, we have maintained control over the business. That means our corporate goals are simple – grow sustainably and profitably.

Some Backblaze Perks:

  • Competitive healthcare plans
  • Competitive compensation and 401k
  • All employees receive Option grants
  • Unlimited vacation days
  • Strong coffee
  • Fully stocked Micro kitchen
  • Catered breakfast and lunches
  • Awesome people who work on awesome projects
  • New Parent Childcare bonus
  • Normal work hours
  • Get to bring your pets into the office
  • San Mateo Office – located near Caltrain and Highways 101 & 280

Want to know what you’ll be doing?

You will work on the server side APIs that authenticate users when they log in, accept the backups, manage the data, and prepare restored data for customers. And you will help build new features as well as support tools to help chase down and diagnose customer issues.

Must be proficient in:

  • Java
  • Apache Tomcat
  • Large scale systems supporting thousands of servers and millions of customers
  • Cross platform (Linux/Macintosh/Windows) — don’t need to be an expert on all three, but cannot be afraid of any

Bonus points for:

  • Cassandra experience
  • JavaScript
  • ReactJS
  • Python
  • Struts
  • JSP’s

Looking for an attitude of:

  • Passionate about building friendly, easy to use Interfaces and APIs.
  • Likes to work closely with other engineers, support, and sales to help customers.
  • Believes the whole world needs backup, not just English speakers in the USA.
  • Customer Focused (!!) — always focus on the customer’s point of view and how to solve their problem!

Required for all Backblaze Employees:

  • Good attitude and willingness to do whatever it takes to get the job done
  • Strong desire to work for a small, fast-paced company
  • Desire to learn and adapt to rapidly changing technologies and work environment
  • Rigorous adherence to best practices
  • Relentless attention to detail
  • Excellent interpersonal skills and good oral/written communication
  • Excellent troubleshooting and problem solving skills

This position is located in San Mateo, California but will also consider remote work as long as you’re no more than three time zones away and can come to San Mateo now and then.

Backblaze is an Equal Opportunity Employer.

If this sounds like you —follow these steps:

  1. Send an email to [email protected] with the position in the subject line.
  2. Include your resume.
  3. Tell us a bit about your programming experience.

The post Needed: Senior Software Engineer appeared first on Backblaze Blog | Cloud Storage & Cloud Backup.

Needed: Sales Development Representative!

Post Syndicated from Yev original https://www.backblaze.com/blog/needed-sales-development-representative/

At inception, Backblaze was a consumer company. Thousands upon thousands of individuals came to our website and gave us $5/mo to keep their data safe. But, we didn’t sell business solutions. It took us years before we had a sales team. In the last couple of years, we’ve released products that businesses of all sizes love: Backblaze B2 Cloud Storage and Backblaze for Business Computer Backup. Those businesses want to integrate Backblaze into their infrastructure, so it’s time to expand our sales team and hire our first dedicated outbound Sales Development Representative!

Company Description:
Founded in 2007, Backblaze started with a mission to make backup software elegant and provide complete peace of mind. Over the course of almost a decade, we have become a pioneer in robust, scalable low cost cloud backup. Recently, we launched B2 — robust and reliable object storage at just $0.005/gb/mo. Part of our differentiation is being able to offer the lowest price of any of the big players while still being profitable.

We’ve managed to nurture a team oriented culture with amazingly low turnover. We value our people and their families. Don’t forget to check out our “About Us” page to learn more about the people and some of our perks.

We have built a profitable, high growth business. While we love our investors, we have maintained control over the business. That means our corporate goals are simple — grow sustainably and profitably.

Some Backblaze Perks:

  • Competitive healthcare plans
  • Competitive compensation and 401k
  • All employees receive option grants
  • Unlimited vacation days
  • Strong coffee
  • Fully stocked Micro kitchen
  • Catered breakfast and lunches
  • Awesome people who work on awesome projects
  • New Parent Childcare bonus
  • Normal work hours
  • Get to bring your pets into the office
  • San Mateo Office — located near Caltrain and Highways 101 & 280

As our first Sales Development Representative (SDR), we are looking for someone who is organized, has high-energy and strong interpersonal communication skills. The ideal person will have a passion for sales, love to cold call and figure out new ways to get potential customers. Ideally the SDR will have 1-2 years experience working in a fast paced sales environment. We are looking for someone who knows how to manage their time and has top class communication skills. It’s critical that our SDR is able to learn quickly when using new tools.

Additional Responsibilities Include:

  • Generate qualified leads, set up demos and outbound opportunities by phone and email.
  • Work with our account managers to pass qualified leads and track in salesforce.com.
  • Report internally on prospecting performance and identify potential optimizations.
  • Continuously fine tune outbound messaging – both email and cold calls to drive results.
  • Update and leverage salesforce.com and other sales tools to better track business and drive efficiencies.

Qualifications:

  • Bachelor’s degree (B.A.)
  • Minimum of 1-2 years of sales experience.
  • Excellent written and verbal communication skills.
  • Proven ability to work in a fast-paced, dynamic and goal-oriented environment.
  • Maintain a high sense of urgency and entrepreneurial work ethic that is required to drive business outcomes, with exceptional attention to detail.
  • Positive“can do” attitude, passionate and able to show commitment.
  • Fearless yet cordial personality- not afraid to make cold calls and introductions yet personable enough to connect with potential Backblaze customers.
  • Articulate and good listening skills.
  • Ability to set and manage multiple priorities.

What’s it like working with the Sales team?

The Backblaze sales team collaborates. We help each other out by sharing ideas, templates, and our customer’s experiences. When we talk about our accomplishments, there is no “I did this,” only “we.” We are truly a team.

We are honest to each other and our customers and communicate openly. We aim to have fun by embracing crazy ideas and creative solutions. We try to think not outside the box, but with no boxes at all. Customers are the driving force behind the success of the company and we care deeply about their success.

If this all sounds like you:

  1. Send an email to jobscontact@backblaze.com with the position in the subject line.
  2. Tell us a bit about your sales experience.
  3. Include your resume.

The post Needed: Sales Development Representative! appeared first on Backblaze Blog | Cloud Storage & Cloud Backup.

Needed: Associate Front End Developer

Post Syndicated from Yev original https://www.backblaze.com/blog/needed-associate-front-end-developer/

Want to work at a company that helps customers in over 150 countries around the world protect the memories they hold dear? Do you want to challenge yourself with a business that serves consumers, SMBs, Enterprise, and developers?

If all that sounds interesting, you might be interested to know that Backblaze is looking for an Associate Front End Developer!

Backblaze is a 10 year old company. Providing great customer experiences is the “secret sauce” that enables us to successfully compete against some of technology’s giants. We’ll finish the year at ~$20MM ARR and are a profitable business. This is an opportunity to have your work shine at scale in one of the fastest growing verticals in tech — Cloud Storage.

You will utilize HTML, ReactJS, CSS and jQuery to develop intuitive, elegant user experiences. As a member of our Front End Dev team, you will work closely with our web development, software design, and marketing teams.

On a day to day basis, you must be able to convert image mockups to HTML or ReactJS – There’s some production work that needs to get done. But you will also be responsible for helping build out new features, rethink old processes, and enabling third party systems to empower our marketing, sales, and support teams.

Our Associate Front End Developer must be proficient in:

  • HTML, CSS, Javascript (ES5)
  • jQuery, Bootstrap (with responsive targets)
  • Understanding of ensuring cross-browser compatibility and browser security for features
  • Basic SEO principles and ensuring that applications will adhere to them
  • Familiarity with ES2015+, ReactJS, unit testing
  • Learning about third party marketing and sales tools through reading documentation. Our systems include Google Tag Manager, Google Analytics, Salesforce, and Hubspot.
  • React Flux, Redux, SASS, Node experience is a plus

We’re looking for someone that is:

  • Passionate about building friendly, easy to use Interfaces and APIs.
  • Likes to work closely with other engineers, support, and marketing to help customers.
  • Is comfortable working independently on a mutually agreed upon prioritization queue (we don’t micromanage, we do make sure tasks are reasonably defined and scoped).
  • Diligent with quality control. Backblaze prides itself on giving our team autonomy to get work done, do the right thing for our customers, and keep a pace that is sustainable over the long run. As such, we expect everyone that checks in code that is stable. We also have a small QA team that operates as a secondary check when needed.

Backblaze Employees Have:

  • Good attitude and willingness to do whatever it takes to get the job done.
  • Strong desire to work for a small fast, paced company.
  • Desire to learn and adapt to rapidly changing technologies and work environment.
  • Comfort with well behaved pets in the office.

This position is located in San Mateo, California. Regular attendance in the office is expected.

Backblaze is an Equal Opportunity Employer and we offer competitive salary and benefits, including our no policy vacation policy.

If this sounds like you…
Send an email to: jobscontact@backblaze.com with:

  1. Associate Front End Dev in the subject line
  2. Your resume attached
  3. An overview of your relevant experience

The post Needed: Associate Front End Developer appeared first on Backblaze Blog | Cloud Storage & Cloud Backup.

Wanted: Senior Systems Administrator

Post Syndicated from Yev original https://www.backblaze.com/blog/wanted-senior-systems-administrator/

Wanted: Senior Systems Administrator

We’re looking for someone who enjoys solving difficult problems, running down elusive tech gremlins, and improving our environment one server at a time. If you enjoy being stretched, learning new skills, and want to look forward to seeing your co-workers every day, then we want you!

Backblaze is a small (in headcount) cloud storage (and backup!) company with a big mission, bringing feature-rich and accessible services to the masses, even if they don’t have unlimited VC funding (because we don’t either)! We believe in a fun and positive work environment where people can learn and grow, and where a sense of community is not just a buzzword from a company handbook (though you might probably find it in there).

What You’ll Be Doing

  • Mastering your craft, becoming a subject matter expert, and acting as an escalation point for areas of expertise (this means responding to pages in your areas of ownership as well)
  • Leading projects across a range of IT operations disciplines
  • Developing a thorough understanding of the environment and the skills necessary to troubleshoot all systems and services
  • Collaborating closely with other teams (Engineering, Infrastructure, etc.) to build out new systems and improve existing ones
  • Participating in on-call rotation when necessary
  • Petting the office dogs when appropriate

What You Should Have

  • 5+ years of work as a Systems Administrator (or equivalent college degree)
  • Expert knowledge of Linux systems administration (Debian preferred)
  • Ability to work under pressure in a fast-paced startup environment
  • A passion for build and improving all manner of systems and services
  • Excellent problem solving, investigative, and troubleshooting skills
  • Strong interpersonal communication skills
  • Local enough to commute to San Mateo office

Highly Desirable Skills

  • Experience working at a technology/software startup
  • Configuration management and automation software (Ansible preferred)
  • Familiarity with server and storage system hardware and configurations
  • Understanding of Java servlet containers (Tomcat preferred)
  • Skill in administration of different software suites and cloud-based integrations (G Suite, PagerDuty, etc.)
  • Comprehension of standard web services and packages (WordPress, Apache, etc.)

Some Backblaze Perks

  • Generous healthcare plans
  • Competitive compensation and 401k
  • All employees receive Option grants
  • Unlimited vacation days
  • Strong coffee
  • Fully stocked Micro kitchens
  • Weekly catered breakfast and lunches
  • Awesome people who work on awesome projects
  • Childcare bonus (human children only)
  • Get to bring your (well behaved) pets into the office
  • Backblaze is an Equal Opportunity Employer and we offer competitive salary and benefits, including our no policy vacation policy

If this sounds like you — follow these steps:

  1. Send an email to jobscontact@backblaze.com with the position in the subject line.
  2. Include your resume.
  3. Tell us a bit about your experience and why you’re excited to work with Backblaze.

The post Wanted: Senior Systems Administrator appeared first on Backblaze Blog | Cloud Storage & Cloud Backup.

Getting product security engineering right

Post Syndicated from Michal Zalewski original http://lcamtuf.blogspot.com/2018/02/getting-product-security-engineering.html

Product security is an interesting animal: it is a uniquely cross-disciplinary endeavor that spans policy, consulting,
process automation, in-depth software engineering, and cutting-edge vulnerability research. And in contrast to many
other specializations in our field of expertise – say, incident response or network security – we have virtually no
time-tested and coherent frameworks for setting it up within a company of any size.

In my previous post, I shared some thoughts
on nurturing technical organizations and cultivating the right kind of leadership within. Today, I figured it would
be fitting to follow up with several notes on what I learned about structuring product security work – and about actually
making the effort count.

The “comfort zone” trap

For security engineers, knowing your limits is a sought-after quality: there is nothing more dangerous than a security
expert who goes off script and starts dispensing authoritatively-sounding but bogus advice on a topic they know very
little about. But that same quality can be destructive when it prevents us from growing beyond our most familiar role: that of
a critic who pokes holes in other people’s designs.

The role of a resident security critic lends itself all too easily to a sense of supremacy: the mistaken
belief that our cognitive skills exceed the capabilities of the engineers and product managers who come to us for help
– and that the cool bugs we file are the ultimate proof of our special gift. We start taking pride in the mere act
of breaking somebody else’s software – and then write scathing but ineffectual critiques addressed to executives,
demanding that they either put a stop to a project or sign off on a risk. And hey, in the latter case, they better
brace for our triumphant “I told you so” at some later date.

Of course, escalations of this type have their place, but they need to be a very rare sight; when practiced routinely, they are a telltale
sign of a dysfunctional team. We might be failing to think up viable alternatives that are in tune with business or engineering needs; we might
be very unpersuasive, failing to communicate with other rational people in a language they understand; or it might be that our tolerance for risk
is badly out of whack with the rest of the company. Whatever the cause, I’ve seen high-level escalations where the security team
spoke of valiant efforts to resist inexplicably awful design decisions or data sharing setups; and where product leads in turn talked about
pressing business needs randomly blocked by obstinate security folks. Sometimes, simply having them compare their notes would be enough to arrive
at a technical solution – such as sharing a less sensitive subset of the data at hand.

To be effective, any product security program must be rooted in a partnership with the rest of the company, focused on helping them get stuff done
while eliminating or reducing security risks. To combat the toxic us-versus-them mentality, I found it helpful to have some team members with
software engineering backgrounds, even if it’s the ownership of a small open-source project or so. This can broaden our horizons, helping us see
that we all make the same mistakes – and that not every solution that sounds good on paper is usable once we code it up.

Getting off the treadmill

All security programs involve a good chunk of operational work. For product security, this can be a combination of product launch reviews, design consulting requests, incoming bug reports, or compliance-driven assessments of some sort. And curiously, such reactive work also has the property of gradually expanding to consume all the available resources on a team: next year is bound to bring even more review requests, even more regulatory hurdles, and even more incoming bugs to triage and fix.

Being more tractable, such routine tasks are also more readily enshrined in SDLs, SLAs, and all kinds of other official documents that are often mistaken for a mission statement that justifies the existence of our teams. Soon, instead of explaining to a developer why they should fix a particular problem right away, we end up pointing them to page 17 in our severity classification guideline, which defines that “severity 2” vulnerabilities need to be resolved within a month. Meanwhile, another policy may be telling them that they need to run a fuzzer or a web application scanner for a particular number of CPU-hours – no matter whether it makes sense or whether the job is set up right.

To run a product security program that scales sublinearly, stays abreast of future threats, and doesn’t erect bureaucratic speed bumps just for the sake of it, we need to recognize this inherent tendency for operational work to take over – and we need to reign it in. No matter what the last year’s policy says, we usually don’t need to be doing security reviews with a particular cadence or to a particular depth; if we need to scale them back 10% to staff a two-quarter project that fixes an important API and squashes an entire class of bugs, it’s a short-term risk we should feel empowered to take.

As noted in my earlier post, I find contingency planning to be a valuable tool in this regard: why not ask ourselves how the team would cope if the workload went up another 30%, but bad financial results precluded any team growth? It’s actually fun to think about such hypotheticals ahead of the time – and hey, if the ideas sound good, why not try them out today?

Living for a cause

It can be difficult to understand if our security efforts are structured and prioritized right; when faced with such uncertainty, it is natural to stick to the safe fundamentals – investing most of our resources into the very same things that everybody else in our industry appears to be focusing on today.

I think it’s important to combat this mindset – and if so, we might as well tackle it head on. Rather than focusing on tactical objectives and policy documents, try to write down a concise mission statement explaining why you are a team in the first place, what specific business outcomes you are aiming for, how do you prioritize it, and how you want it all to change in a year or two. It should be a fluid narrative that reads right and that everybody on your team can take pride in; my favorite way of starting the conversation is telling folks that we could always have a new VP tomorrow – and that the VP’s first order of business could be asking, “why do you have so many people here and how do I know they are doing the right thing?”. It’s a playful but realistic framing device that motivates people to get it done.

In general, a comprehensive product security program should probably start with the assumption that no matter how many resources we have at our disposal, we will never be able to stay in the loop on everything that’s happening across the company – and even if we did, we’re not going to be able to catch every single bug. It follows that one of our top priorities for the team should be making sure that bugs don’t happen very often; a scalable way of getting there is equipping engineers with intuitive and usable tools that make it easy to perform common tasks without having to worry about security at all. Examples include standardized, managed containers for production jobs; safe-by-default APIs, such as strict contextual autoescaping for XSS or type safety for SQL; security-conscious style guidelines; or plug-and-play libraries that take care of common crypto or ACL enforcement tasks.

Of course, not all problems can be addressed on framework level, and not every engineer will always reach for the right tools. Because of this, the next principle that I found to be worth focusing on is containment and mitigation: making sure that bugs are difficult to exploit when they happen, or that the damage is kept in check. The solutions in this space can range from low-level enhancements (say, hardened allocators or seccomp-bpf sandboxes) to client-facing features such as browser origin isolation or Content Security Policy.

The usual consulting, review, and outreach tasks are an important facet of a product security program, but probably shouldn’t be the sole focus of your team. It’s also best to avoid undue emphasis on vulnerability showmanship: while valuable in some contexts, it creates a hypercompetitive environment that may be hostile to less experienced team members – not to mention, squashing individual bugs offers very limited value if the same issue is likely to be reintroduced into the codebase the next day. I like to think of security reviews as a teaching opportunity instead: it’s a way to raise awareness, form partnerships with engineers, and help them develop lasting habits that reduce the incidence of bugs. Metrics to understand the impact of your work are important, too; if your engagements are seen mostly as a yet another layer of red tape, product teams will stop reaching out to you for advice.

The other tenet of a healthy product security effort requires us to recognize at a scale and given enough time, every defense mechanism is bound to fail – and so, we need ways to prevent bugs from turning into incidents. The efforts in this space may range from developing product-specific signals for the incident response and monitoring teams; to offering meaningful vulnerability reward programs and nourishing a healthy and respectful relationship with the research community; to organizing regular offensive exercises in hopes of spotting bugs before anybody else does.

Oh, one final note: an important feature of a healthy security program is the existence of multiple feedback loops that help you spot problems without the need to micromanage the organization and without being deathly afraid of taking chances. For example, the data coming from bug bounty programs, if analyzed correctly, offers a wonderful way to alert you to systemic problems in your codebase – and later on, to measure the impact of any remediation and hardening work.

Early Challenges: Making Critical Hires

Post Syndicated from Gleb Budman original https://www.backblaze.com/blog/early-challenges-making-critical-hires/

row of potential employee hires sitting waiting for an interview

In 2009, Google disclosed that they had 400 recruiters on staff working to hire nearly 10,000 people. Someday, that might be your challenge, but most companies in their early days are looking to hire a handful of people — the right people — each year. Assuming you are closer to startup stage than Google stage, let’s look at who you need to hire, when to hire them, where to find them (and how to help them find you), and how to get them to join your company.

Who Should Be Your First Hires

In later stage companies, the roles in the company have been well fleshed out, don’t change often, and each role can be segmented to focus on a specific area. A large company may have an entire department focused on just cubicle layout; at a smaller company you may not have a single person whose actual job encompasses all of facilities. At Backblaze, our CTO has a passion and knack for facilities and mostly led that charge. Also, the needs of a smaller company are quick to change. One of our first hires was a QA person, Sean, who ended up being 100% focused on data center infrastructure. In the early stage, things can shift quite a bit and you need people that are broadly capable, flexible, and most of all willing to pitch in where needed.

That said, there are times you may need an expert. At a previous company we hired Jon, a PhD in Bayesian statistics, because we needed algorithmic analysis for spam fighting. However, even that person was not only able and willing to do the math, but also code, and to not only focus on Bayesian statistics but explore a plethora of spam fighting options.

When To Hire

If you’ve raised a lot of cash and are willing to burn it with mistakes, you can guess at all the roles you might need and start hiring for them. No judgement: that’s a reasonable strategy if you’re cash-rich and time-poor.

If your cash is limited, try to see what you and your team are already doing and then hire people to take those jobs. It may sound counterintuitive, but if you’re already doing it presumably it needs to be done, you have a good sense of the type of skills required to do it, and you can bring someone on-board and get them up to speed quickly. That then frees you up to focus on tasks that can’t be done by someone else. At Backblaze, I ran marketing internally for years before hiring a VP of Marketing, making it easier for me to know what we needed. Once I was hiring, my primary goal was to find someone I could trust to take that role completely off of me so I could focus solely on my CEO duties

Where To Find the Right People

Finding great people is always difficult, particularly when the skillsets you’re looking for are highly in-demand by larger companies with lots of cash and cachet. You, however, have one massive advantage: you need to hire 5 people, not 5,000.

People You Worked With

The absolutely best people to hire are ones you’ve worked with before that you already know are good in a work situation. Consider your last job, the one before, and the one before that. A significant number of the people we recruited at Backblaze came from our previous startup MailFrontier. We knew what they could do and how they would fit into the culture, and they knew us and thus could quickly meld into the environment. If you didn’t have a previous job, consider people you went to school with or perhaps individuals with whom you’ve done projects previously.

People You Know

Hiring friends, family, and others can be risky, but should be considered. Sometimes a friend can be a “great buddy,” but is not able to do the job or isn’t a good fit for the organization. Having to let go of someone who is a friend or family member can be rough. Have the conversation up front with them about that possibility, so you have the ability to stay friends if the position doesn’t work out. Having said that, if you get along with someone as a friend, that’s one critical component of succeeding together at work. At Backblaze we’ve hired a number of people successfully that were friends of someone in the organization.

Friends Of People You Know

Your network is likely larger than you imagine. Your employees, investors, advisors, spouses, friends, and other folks all know people who might be a great fit for you. Make sure they know the roles you’re hiring for and ask them if they know anyone that would fit. Search LinkedIn for the titles you’re looking for and see who comes up; if they’re a 2nd degree connection, ask your connection for an introduction.

People You Know About

Sometimes the person you want isn’t someone anyone knows, but you may have read something they wrote, used a product they’ve built, or seen a video of a presentation they gave. Reach out. You may get a great hire: worst case, you’ll let them know they were appreciated, and make them aware of your organization.

Other Places to Find People

There are a million other places to find people, including job sites, community groups, Facebook/Twitter, GitHub, and more. Consider where the people you’re looking for are likely to congregate online and in person.

A Comment on Diversity

Hiring “People You Know” can often result in “Hiring People Like You” with the same workplace experiences, culture, background, and perceptions. Some studies have shown [1, 2, 3, 4] that homogeneous groups deliver faster, while heterogeneous groups are more creative. Also, “Hiring People Like You” often propagates the lack of women and minorities in tech and leadership positions in general. When looking for people you know, keep an eye to not discount people you know who don’t have the same cultural background as you.

Helping People To Find You

Reaching out proactively to people is the most direct way to find someone, but you want potential hires coming to you as well. To do this, they have to a) be aware of you, b) know you have a role they’re interested in, and c) think they would want to work there. Let’s tackle a) and b) first below.

Your Blog

I started writing our blog before we launched the product and talked about anything I found interesting related to our space. For several years now our team has owned the content on the blog and in 2017 over 1.5 million people read it. Each time we have a position open it’s published to the blog. If someone finds reading about backup and storage interesting, perhaps they’d want to dig in deeper from the inside. Many of the people we’ve recruited have mentioned reading the blog as either how they found us or as a factor in why they wanted to work here.
[BTW, this is Gleb’s 200th post on Backblaze’s blog. The first was in 2008. — Editor]

Your Email List

In addition to the emails our blog subscribers receive, we send regular emails to our customers, partners, and prospects. These are largely focused on content we think is directly useful or interesting for them. However, once every few months we include a small mention that we’re hiring, and the positions we’re looking for. Often a small blurb is all you need to capture people’s imaginations whether they might find the jobs interesting or can think of someone that might fit the bill.

Your Social Involvement

Whether it’s Twitter or Facebook, Hacker News or Slashdot, your potential hires are engaging in various communities. Being socially involved helps make people aware of you, reminds them of you when they’re considering a job, and paints a picture of what working with you and your company would be like. Adam was in a Reddit thread where we were discussing our Storage Pods, and that interaction was ultimately part of the reason he left Apple to come to Backblaze.

Convincing People To Join

Once you’ve found someone or they’ve found you, how do you convince them to join? They may be currently employed, have other offers, or have to relocate. Again, while the biggest companies have a number of advantages, you might have more unique advantages than you realize.

Why Should They Join You

Here are a set of items that you may be able to offer which larger organizations might not:

Role: Consider the strengths of the role. Perhaps it will have broader scope? More visibility at the executive level? No micromanagement? Ability to take risks? Option to create their own role?

Compensation: In addition to salary, will their options potentially be worth more since they’re getting in early? Can they trade-off salary for more options? Do they get option refreshes?

Benefits: In addition to healthcare, food, and 401(k) plans, are there unique benefits of your company? One company I knew took the entire team for a one-month working retreat abroad each year.

Location: Most people prefer to work close to home. If you’re located outside of the San Francisco Bay Area, you might be at a disadvantage for not being in the heart of tech. But if you find employees close to you you’ve got a huge advantage. Sometimes it’s micro; even in the Bay Area the difference of 5 miles can save 20 minutes each way every day. We located the Backblaze headquarters in San Mateo, a middle-ground that made it accessible to those coming from San Jose and San Francisco. We also chose a downtown location near a train, restaurants, and cafes: all to make it easier and more pleasant. Also, are you flexible in letting your employees work remotely? Our systems administrator Elliott is about to embark on a long-term cross-country journey working from an RV.

Environment: Open office, cubicle, cafe, work-from-home? Loud/quiet? Social or focused? 24×7 or work-life balance? Different environments appeal to different people.

Team: Who will they be working with? A company with 100,000 people might have 100 brilliant ones you’d want to work with, but ultimately we work with our core team. Who will your prospective hires be working with?

Market: Some people are passionate about gaming, others biotech, still others food. The market you’re targeting will get different people excited.

Product: Have an amazing product people love? Highlight that. If you’re lucky, your potential hire is already a fan.

Mission: Curing cancer, making people happy, and other company missions inspire people to strive to be part of the journey. Our mission is to make storing data astonishingly easy and low-cost. If you care about data, information, knowledge, and progress, our mission helps drive all of them.

Culture: I left this for last, but believe it’s the most important. What is the culture of your company? Finding people who want to work in the culture of your organization is critical. If they like the culture, they’ll fit and continue it. We’ve worked hard to build a culture that’s collaborative, friendly, supportive, and open; one in which people like coming to work. For example, the five founders started with (and still have) the same compensation and equity. That started a culture of “we’re all in this together.” Build a culture that will attract the people you want, and convey what the culture is.

Writing The Job Description

Most job descriptions focus on the all the requirements the candidate must meet. While important to communicate, the job description should first sell the job. Why would the appropriate candidate want the job? Then share some of the requirements you think are critical. Remember that people read not just what you say but how you say it. Try to write in a way that conveys what it is like to actually be at the company. Ahin, our VP of Marketing, said the job description itself was one of the things that attracted him to the company.

Orchestrating Interviews

Much can be said about interviewing well. I’m just going to say this: make sure that everyone who is interviewing knows that their job is not only to evaluate the candidate, but give them a sense of the culture, and sell them on the company. At Backblaze, we often have one person interview core prospects solely for company/culture fit.

Onboarding

Hiring success shouldn’t be defined by finding and hiring the right person, but instead by the right person being successful and happy within the organization. Ensure someone (usually their manager) provides them guidance on what they should be concentrating on doing during their first day, first week, and thereafter. Giving new employees opportunities and guidance so that they can achieve early wins and feel socially integrated into the company does wonders for bringing people on board smoothly

In Closing

Our Director of Production Systems, Chris, said to me the other day that he looks for companies where he can work on “interesting problems with nice people.” I’m hoping you’ll find your own version of that and find this post useful in looking for your early and critical hires.

Of course, I’d be remiss if I didn’t say, if you know of anyone looking for a place with “interesting problems with nice people,” Backblaze is hiring. 😉

The post Early Challenges: Making Critical Hires appeared first on Backblaze Blog | Cloud Storage & Cloud Backup.

Build a Multi-Tenant Amazon EMR Cluster with Kerberos, Microsoft Active Directory Integration and EMRFS Authorization

Post Syndicated from Songzhi Liu original https://aws.amazon.com/blogs/big-data/build-a-multi-tenant-amazon-emr-cluster-with-kerberos-microsoft-active-directory-integration-and-emrfs-authorization/

One of the challenges faced by our customers—especially those in highly regulated industries—is balancing the need for security with flexibility. In this post, we cover how to enable multi-tenancy and increase security by using EMRFS (EMR File System) authorization, the Amazon S3 storage-level authorization on Amazon EMR.

Amazon EMR is an easy, fast, and scalable analytics platform enabling large-scale data processing. EMRFS authorization provides Amazon S3 storage-level authorization by configuring EMRFS with multiple IAM roles. With this functionality enabled, different users and groups can share the same cluster and assume their own IAM roles respectively.

Simply put, on Amazon EMR, we can now have an Amazon EC2 role per user assumed at run time instead of one general EC2 role at the cluster level. When the user is trying to access Amazon S3 resources, Amazon EMR evaluates against a predefined mappings list in EMRFS configurations and picks up the right role for the user.

In this post, we will discuss what EMRFS authorization is (Amazon S3 storage-level access control) and show how to configure the role mappings with detailed examples. You will then have the desired permissions in a multi-tenant environment. We also demo Amazon S3 access from HDFS command line, Apache Hive on Hue, and Apache Spark.

EMRFS authorization for Amazon S3

There are two prerequisites for using this feature:

  1. Users must be authenticated, because EMRFS needs to map the current user/group/prefix to a predefined user/group/prefix. There are several authentication options. In this post, we launch a Kerberos-enabled cluster that manages the Key Distribution Center (KDC) on the master node, and enable a one-way trust from the KDC to a Microsoft Active Directory domain.
  2. The application must support accessing Amazon S3 via Applications that have their own S3FileSystem APIs (for example, Presto) are not supported at this time.

EMRFS supports three types of mapping entries: user, group, and Amazon S3 prefix. Let’s use an example to show how this works.

Assume that you have the following three identities in your organization, and they are defined in the Active Directory:

To enable all these groups and users to share the EMR cluster, you need to define the following IAM roles:

In this case, you create a separate Amazon EC2 role that doesn’t give any permission to Amazon S3. Let’s call the role the base role (the EC2 role attached to the EMR cluster), which in this example is named EMR_EC2_RestrictedRole. Then, you define all the Amazon S3 permissions for each specific user or group in their own roles. The restricted role serves as the fallback role when the user doesn’t belong to any user/group, nor does the user try to access any listed Amazon S3 prefixes defined on the list.

Important: For all other roles, like emrfs_auth_group_role_data_eng, you need to add the base role (EMR_EC2_RestrictedRole) as the trusted entity so that it can assume other roles. See the following example:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "Service": "ec2.amazonaws.com"
      },
      "Action": "sts:AssumeRole"
    },
    {
      "Effect": "Allow",
      "Principal": {
        "AWS": "arn:aws:iam::511586466501:role/EMR_EC2_RestrictedRole"
      },
      "Action": "sts:AssumeRole"
    }
  ]
}

The following is an example policy for the admin user role (emrfs_auth_user_role_admin_user):

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": "s3:*",
            "Resource": "*"
        }
    ]
}

We are assuming the admin user has access to all buckets in this example.

The following is an example policy for the data science group role (emrfs_auth_group_role_data_sci):

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Resource": [
                "arn:aws:s3:::emrfs-auth-data-science-bucket-demo/*",
                "arn:aws:s3:::emrfs-auth-data-science-bucket-demo"
            ],
            "Action": [
                "s3:*"
            ]
        }
    ]
}

This role grants all Amazon S3 permissions to the emrfs-auth-data-science-bucket-demo bucket and all the objects in it. Similarly, the policy for the role emrfs_auth_group_role_data_eng is shown below:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Resource": [
                "arn:aws:s3:::emrfs-auth-data-engineering-bucket-demo/*",
                "arn:aws:s3:::emrfs-auth-data-engineering-bucket-demo"
            ],
            "Action": [
                "s3:*"
            ]
        }
    ]
}

Example role mappings configuration

To configure EMRFS authorization, you use EMR security configuration. Here is the configuration we use in this post

Consider the following scenario.

First, the admin user admin1 tries to log in and run a command to access Amazon S3 data through EMRFS. The first role emrfs_auth_user_role_admin_user on the mapping list, which is a user role, is mapped and picked up. Then admin1 has access to the Amazon S3 locations that are defined in this role.

Then a user from the data engineer group (grp_data_engineering) tries to access a data bucket to run some jobs. When EMRFS sees that the user is a member of the grp_data_engineering group, the group role emrfs_auth_group_role_data_eng is assumed, and the user has proper access to Amazon S3 that is defined in the emrfs_auth_group_role_data_eng role.

Next, the third user comes, who is not an admin and doesn’t belong to any of the groups. After failing evaluation of the top three entries, EMRFS evaluates whether the user is trying to access a certain Amazon S3 prefix defined in the last mapping entry. This type of mapping entry is called the prefix type. If the user is trying to access s3://emrfs-auth-default-bucket-demo/, then the prefix mapping is in effect, and the prefix role emrfs_auth_prefix_role_default_s3_prefix is assumed.

If the user is not trying to access any of the Amazon S3 paths that are defined on the list—which means it failed the evaluation of all the entries—it only has the permissions defined in the EMR_EC2RestrictedRole. This role is assumed by the EC2 instances in the cluster.

In this process, all the mappings defined are evaluated in the defined order, and the first role that is mapped is assumed, and the rest of the list is skipped.

Setting up an EMR cluster and mapping Active Directory users and groups

Now that we know how EMRFS authorization role mapping works, the next thing we need to think about is how we can use this feature in an easy and manageable way.

Active Directory setup

Many customers manage their users and groups using Microsoft Active Directory or other tools like OpenLDAP. In this post, we create the Active Directory on an Amazon EC2 instance running Windows Server and create the users and groups we will be using in the example below. After setting up Active Directory, we use the Amazon EMR Kerberos auto-join capability to establish a one-way trust from the KDC running on the EMR master node to the Active Directory domain on the EC2 instance. You can use your own directory services as long as it talks to the LDAP (Lightweight Directory Access Protocol).

To create and join Active Directory to Amazon EMR, follow the steps in the blog post Use Kerberos Authentication to Integrate Amazon EMR with Microsoft Active Directory.

After configuring Active Directory, you can create all the users and groups using the Active Directory tools and add users to appropriate groups. In this example, we created users like admin1, dataeng1, datascientist1, grp_data_engineering, and grp_data_science, and then add the users to the right groups.

Join the EMR cluster to an Active Directory domain

For clusters with Kerberos, Amazon EMR now supports automated Active Directory domain joins. You can use the security configuration to configure the one-way trust from the KDC to the Active Directory domain. You also configure the EMRFS role mappings in the same security configuration.

The following is an example of the EMR security configuration with a trusted Active Directory domain EMRKRB.TEST.COM and the EMRFS role mappings as we discussed earlier:

The EMRFS role mapping configuration is shown in this example:

We will also provide an example AWS CLI command that you can run.

Launching the EMR cluster and running the tests

Now you have configured Kerberos and EMRFS authorization for Amazon S3.

Additionally, you need to configure Hue with Active Directory using the Amazon EMR configuration API in order to log in using the AD users created before. The following is an example of Hue AD configuration.

[
  {
    "Classification":"hue-ini",
    "Properties":{

    },
    "Configurations":[
      {
        "Classification":"desktop",
        "Properties":{

        },
        "Configurations":[
          {
            "Classification":"ldap",
            "Properties":{

            },
            "Configurations":[
              {
                "Classification":"ldap_servers",
                "Properties":{

                },
                "Configurations":[
                  {
                    "Classification":"AWS",
                    "Properties":{
                      "base_dn":"DC=emrkrb,DC=test,DC=com",
                      "ldap_url":"ldap://emrkrb.test.com",
                      "search_bind_authentication":"false",
                      "bind_dn":"CN=adjoiner,CN=users,DC=emrkrb,DC=test,DC=com",
                      "bind_password":"Abc123456",
                      "create_users_on_login":"true",
                      "nt_domain":"emrkrb.test.com"
                    },
                    "Configurations":[

                    ]
                  }
                ]
              }
            ]
          },
          {
            "Classification":"auth",
            "Properties":{
              "backend":"desktop.auth.backend.LdapBackend"
            },
            "Configurations":[

            ]
          }
        ]
      }
    ]
  }

Note: In the preceding configuration JSON file, change the values as required before pasting it into the software setting section in the Amazon EMR console.

Now let’s use this configuration and the security configuration you created before to launch the cluster.

In the Amazon EMR console, choose Create cluster. Then choose Go to advanced options. On the Step1: Software and Steps page, under Edit software settings (optional), paste the configuration in the box.

The rest of the setup is the same as an ordinary cluster setup, except in the Security Options section. In Step 4: Security, under Permissions, choose Custom, and then choose the RestrictedRole that you created before.

Choose the appropriate subnets (these should meet the base requirement in order for a successful Active Directory join—see the Amazon EMR Management Guide for more details), and choose the appropriate security groups to make sure it talks to the Active Directory. Choose a key so that you can log in and configure the cluster.

Most importantly, choose the security configuration that you created earlier to enable Kerberos and EMRFS authorization for Amazon S3.

You can use the following AWS CLI command to create a cluster.

aws emr create-cluster --name "TestEMRFSAuthorization" \ 
--release-label emr-5.10.0 \ --instance-type m3.xlarge \ 
--instance-count 3 \ 
--ec2-attributes InstanceProfile=EMR_EC2_DefaultRole,KeyName=MyEC2KeyPair \ --service-role EMR_DefaultRole \ 
--security-configuration MyKerberosConfig \ 
--configurations file://hue-config.json \
--applications Name=Hadoop Name=Hive Name=Hue Name=Spark \ 
--kerberos-attributes Realm=EC2.INTERNAL, \ KdcAdminPassword=<YourClusterKDCAdminPassword>, \ ADDomainJoinUser=<YourADUserLogonName>,ADDomainJoinPassword=<YourADUserPassword>, \ 
CrossRealmTrustPrincipalPassword=<MatchADTrustPwd>

Note: If you create the cluster using CLI, you need to save the JSON configuration for Hue into a file named hue-config.json and place it on the server where you run the CLI command.

After the cluster gets into the Waiting state, try to connect by using SSH into the cluster using the Active Directory user name and password.

ssh -l [email protected] <EMR IP or DNS name>

Quickly run two commands to show that the Active Directory join is successful:

  1. id [user name] shows the mapped AD users and groups in Linux.
  2. hdfs groups [user name] shows the mapped group in Hadoop.

Both should return the current Active Directory user and group information if the setup is correct.

Now, you can test the user mapping first. Log in with the admin1 user, and run a Hadoop list directory command:

hadoop fs -ls s3://emrfs-auth-data-science-bucket-demo/

Now switch to a user from the data engineer group.

Retry the previous command to access the admin’s bucket. It should throw an Amazon S3 Access Denied exception.

When you try listing the Amazon S3 bucket that a data engineer group member has accessed, it triggers the group mapping.

hadoop fs -ls s3://emrfs-auth-data-engineering-bucket-demo/

It successfully returns the listing results. Next we will test Apache Hive and then Apache Spark.

 

To run jobs successfully, you need to create a home directory for every user in HDFS for staging data under /user/<username>. Users can configure a step to create a home directory at cluster launch time for every user who has access to the cluster. In this example, you use Hue since Hue will create the home directory in HDFS for the user at the first login. Here Hue also needs to be integrated with the same Active Directory as explained in the example configuration described earlier.

First, log in to Hue as a data engineer user, and open a Hive Notebook in Hue. Then run a query to create a new table pointing to the data engineer bucket, s3://emrfs-auth-data-engineering-bucket-demo/table1_data_eng/.

You can see that the table was created successfully. Now try to create another table pointing to the data science group’s bucket, where the data engineer group doesn’t have access.

It failed and threw an Amazon S3 Access Denied error.

Now insert one line of data into the successfully create table.

Next, log out, switch to a data science group user, and create another table, test2_datasci_tb.

The creation is successful.

The last task is to test Spark (it requires the user directory, but Hue created one in the previous step).

Now let’s come back to the command line and run some Spark commands.

Login to the master node using the datascientist1 user:

Start the SparkSQL interactive shell by typing spark-sql, and run the show tables command. It should list the tables that you created using Hive.

As a data science group user, try select on both tables. You will find that you can only select the table defined in the location that your group has access to.

Conclusion

EMRFS authorization for Amazon S3 enables you to have multiple roles on the same cluster, providing flexibility to configure a shared cluster for different teams to achieve better efficiency. The Active Directory integration and group mapping make it much easier for you to manage your users and groups, and provides better auditability in a multi-tenant environment.


Additional Reading

If you found this post useful, be sure to check out Use Kerberos Authentication to Integrate Amazon EMR with Microsoft Active Directory and Launching and Running an Amazon EMR Cluster inside a VPC.


About the Authors

Songzhi Liu is a Big Data Consultant with AWS Professional Services. He works closely with AWS customers to provide them Big Data & Machine Learning solutions and best practices on the Amazon cloud.

 

 

 

 

Success at Apache: A Newbie’s Narrative

Post Syndicated from mikesefanov original https://yahooeng.tumblr.com/post/170536010891

yahoodevelopers:

Kuhu Shukla (bottom center) and team at the 2017 DataWorks Summit


By Kuhu Shukla

This post first appeared here on the Apache Software Foundation blog as part of ASF’s “Success at Apache” monthly blog series.

As I sit at my desk on a rather frosty morning with my coffee, looking up new JIRAs from the previous day in the Apache Tez project, I feel rather pleased. The latest community release vote is complete, the bug fixes that we so badly needed are in and the new release that we tested out internally on our many thousand strong cluster is looking good. Today I am looking at a new stack trace from a different Apache project process and it is hard to miss how much of the exceptional code I get to look at every day comes from people all around the globe. A contributor leaves a JIRA comment before he goes on to pick up his kid from soccer practice while someone else wakes up to find that her effort on a bug fix for the past two months has finally come to fruition through a binding +1.

Yahoo – which joined AOL, HuffPost, Tumblr, Engadget, and many more brands to form the Verizon subsidiary Oath last year – has been at the frontier of open source adoption and contribution since before I was in high school. So while I have no historical trajectories to share, I do have a story on how I found myself in an epic journey of migrating all of Yahoo jobs from Apache MapReduce to Apache Tez, a then-new DAG based execution engine.

Oath grid infrastructure is through and through driven by Apache technologies be it storage through HDFS, resource management through YARN, job execution frameworks with Tez and user interface engines such as Hive, Hue, Pig, Sqoop, Spark, Storm. Our grid solution is specifically tailored to Oath’s business-critical data pipeline needs using the polymorphic technologies hosted, developed and maintained by the Apache community.

On the third day of my job at Yahoo in 2015, I received a YouTube link on An Introduction to Apache Tez. I watched it carefully trying to keep up with all the questions I had and recognized a few names from my academic readings of Yarn ACM papers. I continued to ramp up on YARN and HDFS, the foundational Apache technologies Oath heavily contributes to even today. For the first few weeks I spent time picking out my favorite (necessary) mailing lists to subscribe to and getting started on setting up on a pseudo-distributed Hadoop cluster. I continued to find my footing with newbie contributions and being ever more careful with whitespaces in my patches. One thing was clear – Tez was the next big thing for us. By the time I could truly call myself a contributor in the Hadoop community nearly 80-90% of the Yahoo jobs were now running with Tez. But just like hiking up the Grand Canyon, the last 20% is where all the pain was. Being a part of the solution to this challenge was a happy prospect and thankfully contributing to Tez became a goal in my next quarter.

The next sprint planning meeting ended with me getting my first major Tez assignment – progress reporting. The progress reporting in Tez was non-existent – “Just needs an API fix,”  I thought. Like almost all bugs in this ecosystem, it was not easy. How do you define progress? How is it different for different kinds of outputs in a graph? The questions were many.

I, however, did not have to go far to get answers. The Tez community actively came to a newbie’s rescue, finding answers and posing important questions. I started attending the bi-weekly Tez community sync up calls and asking existing contributors and committers for course correction. Suddenly the team was much bigger, the goals much more chiseled. This was new to anyone like me who came from the networking industry, where the most open part of the code are the RFCs and the implementation details are often hidden. These meetings served as a clean room for our coding ideas and experiments. Ideas were shared, to the extent of which data structure we should pick and what a future user of Tez would take from it. In between the usual status updates and extensive knowledge transfers were made.

Oath uses Apache Pig and Apache Hive extensively and most of the urgent requirements and requests came from Pig and Hive developers and users. Each issue led to a community JIRA and as we started running Tez at Oath scale, new feature ideas and bugs around performance and resource utilization materialized. Every year most of the Hadoop team at Oath travels to the Hadoop Summit where we meet our cohorts from the Apache community and we stand for hours discussing the state of the art and what is next for the project. One such discussion set the course for the next year and a half for me.

We needed an innovative way to shuffle data. Frameworks like MapReduce and Tez have a shuffle phase in their processing lifecycle wherein the data from upstream producers is made available to downstream consumers. Even though Apache Tez was designed with a feature set corresponding to optimization requirements in Pig and Hive, the Shuffle Handler Service was retrofitted from MapReduce at the time of the project’s inception. With several thousands of jobs on our clusters leveraging these features in Tez, the Shuffle Handler Service became a clear performance bottleneck. So as we stood talking about our experience with Tez with our friends from the community, we decided to implement a new Shuffle Handler for Tez. All the conversation points were tracked now through an umbrella JIRA TEZ-3334 and the to-do list was long. I picked a few JIRAs and as I started reading through I realized, this is all new code I get to contribute to and review. There might be a better way to put this, but to be honest it was just a lot of fun! All the whiteboards were full, the team took walks post lunch and discussed how to go about defining the API. Countless hours were spent debugging hangs while fetching data and looking at stack traces and Wireshark captures from our test runs. Six months in and we had the feature on our sandbox clusters. There were moments ranging from sheer frustration to absolute exhilaration with high fives as we continued to address review comments and fixing big and small issues with this evolving feature.

As much as owning your code is valued everywhere in the software community, I would never go on to say “I did this!” In fact, “we did!” It is this strong sense of shared ownership and fluid team structure that makes the open source experience at Apache truly rewarding. This is just one example. A lot of the work that was done in Tez was leveraged by the Hive and Pig community and cross Apache product community interaction made the work ever more interesting and challenging. Triaging and fixing issues with the Tez rollout led us to hit a 100% migration score last year and we also rolled the Tez Shuffle Handler Service out to our research clusters. As of last year we have run around 100 million Tez DAGs with a total of 50 billion tasks over almost 38,000 nodes.

In 2018 as I move on to explore Hadoop 3.0 as our future release, I hope that if someone outside the Apache community is reading this, it will inspire and intrigue them to contribute to a project of their choice. As an astronomy aficionado, going from a newbie Apache contributor to a newbie Apache committer was very much like looking through my telescope - it has endless possibilities and challenges you to be your best.

About the Author:

Kuhu Shukla is a software engineer at Oath and did her Masters in Computer Science at North Carolina State University. She works on the Big Data Platforms team on Apache Tez, YARN and HDFS with a lot of talented Apache PMCs and Committers in Champaign, Illinois. A recent Apache Tez Committer herself she continues to contribute to YARN and HDFS and spoke at the 2017 Dataworks Hadoop Summit on “Tez Shuffle Handler: Shuffling At Scale With Apache Hadoop”. Prior to that she worked on Juniper Networks’ router and switch configuration APIs. She likes to participate in open source conferences and women in tech events. In her spare time she loves singing Indian classical and jazz, laughing, whale watching, hiking and peering through her Dobsonian telescope.