Amazon EMR empowers many customers to build big data processing applications quickly and cost-effectively, using popular distributed frameworks such as Apache Spark, Apache HBase, Presto, and Apache Flink. For organizations that are crafting their analytical applications on Amazon EMR, there is a growing need to keep their data assets organized in an automated fashion. Because datasets tend to grow exponentially, using cataloging tools is essential to automating data discovery and organizing data assets.
AWS Glue Data Catalog provides this essential capability, allowing you to automatically discover and catalog metadata about your data stores in a central repository. Since Amazon EMR 5.8.0, customers have been using the AWS Glue Data Catalog as a metadata store for Apache Hive and Spark SQL applications that are running on Amazon EMR. Starting with Amazon EMR 5.10.0, you can catalog datasets using AWS Glue and run queries using Presto on Amazon EMR from the Hue (Hadoop User Experience) and Apache Zeppelin UIs.
You might wonder what scenarios warrant using Presto running on Amazon EMR and when to choose Amazon Athena (which uses Presto as the query engine under the hood). It is important to note that both are excellent tools for querying massive amounts of data and addressing different needs and use cases.
Amazon Athena provides the easiest way to run interactive queries for data in Amazon S3 without needing to set up or manage any servers. Presto running on Amazon EMR gives you much more flexibility in how you configure and run your queries, providing the ability to federate to other data sources if needed. For example, you might have a use case that requires LDAP authentication for clients such as the Presto CLI or JDBC/ODBC drivers. Or you might have a workflow where you need to join data between different systems like MySQL/Amazon Redshift/Apache Cassandra and Hive. In these examples, Presto running on Amazon EMR is the right tool to use because it can be configured to enable LDAP authentication in addition to the desired database connectors at cluster launch.
Now, let’s look at how metadata management for Presto works with AWS Glue.
Using an AWS Glue crawler to discover datasets
The AWS Glue Data Catalog is a reference to the location, schema, and runtime metrics of your datasets. To create this reference metadata, AWS Glue needs to crawl your datasets. In this exercise, we use an AWS Glue crawler to populate tables in the Data Catalog for the NYC taxi rides dataset.
The following are the steps for adding a crawler:
Sign in to the AWS Management Console, and open the AWS Glue console. In the navigation pane, choose Crawlers. Then choose Add crawler.
On the Add a data store page, specify the location of the NYC taxi rides dataset.
In the next step, choose an existing IAM role if one is available, or create a new role. Then choose Next.
On the scheduling page, for Frequency, choose Run on demand.
On the Configure the crawler’s output page, choose Add database. Specify blog-db as the database name. (You can specify a name of your choice, but be sure to choose the correct database name when running queries.)
Follow the remaining steps using the default values to create a crawler.
When the crawler displays the Ready state, navigate to the Databases (Choose blog-db from the list of databases, or search for it by specifying it as a filter, as shown in the following screenshot.) Then choose Tables. You should see the three tables created by the crawler, as follows.
After you’ve set up the Amazon EMR cluster with Presto, the AWS Glue Data Catalog is available through a default “hive” catalog. To change between the Hive and Glue metastores, you have to manually update hive.properties and restart the Presto server. Connect to the master node on your EMR cluster using SSH, and run the Presto CLI to start running queries interactively.
$ presto-cli --catalog hive
Begin with a simple query to sample a few rows:
presto> SELECT * FROM “blog-db”.taxi limit 10;
The query shows a few sample rows as follows:
Query the average fare for trips at each hour of the day and for each day of the month on the Parquet version of the taxi dataset.
presto> SELECT EXTRACT (HOUR FROM pickup_datetime) AS hour, avg(fare_amount) AS average_fare FROM “blog-db”.taxi_parquet GROUP BY 1 ORDER BY 1;
The following image shows the results:
More interestingly, you can compute the number of trips that gave tips in the 10 percent, 15 percent, or higher percentage range:
presto> -- Tip Percent Category
, COUNT (DISTINCT TripID) TripCt
WHEN fare_prct < 0.7 THEN 'FL70'
WHEN fare_prct < 0.8 THEN 'FL80'
WHEN fare_prct < 0.9 THEN 'FL90'
WHEN tip_prct < 0.1 THEN 'TL10'
WHEN tip_prct < 0.15 THEN 'TL15'
WHEN tip_prct < 0.2 THEN 'TL20'
, (fare_amount / total_amount) as fare_prct
, (extra / total_amount) as extra_prct
, (mta_tax / total_amount) as tip_prct
, (tolls_amount / total_amount) as mta_taxprct
, (tip_amount / total_amount) as tolls_prct
, (improvement_surcharge / total_amount) as imprv_suchrgprct
, (cast(pickup_longitude AS VARCHAR(100)) || '_' || cast(pickup_latitude AS VARCHAR(100))) as TripID
WHERE total_amount > 0
) as t
) as t
GROUP BY TipPrctCtgry;
The results are as follows:
While the preceding query is running, navigate to the web interface for Presto on Amazon EMR at <http://master-public-dns-name:8889/. Here you can look into the query metrics, such as active worker nodes, number of rows read per second, reserved memory, and parallelism.
Running queries in the Presto Editor on Hue
If you installed Hue with your Amazon EMR launch, you can also run queries on Hue’s Presto Editor. On the Amazon EMR Cluster console, choose Enable Web Connection, and follow the instructions to access the web interfaces for Hue and Zeppelin.
After the web connection is enabled, choose the Hue link to open the web interface. At the login screen, if you are the administrator logging in for the first time, type a user name and password to create your Hue superuser account. Then choose Create account. Otherwise, type your user name and password and choose Create account, or type the credentials provided by your administrator.
Choose the Presto Editor from the menu. You can run Presto queries against your tables in the AWS Glue Data Catalog.
Having a shared data catalog for applications on Amazon EMR alleviates a myriad of data-related challenges that organizations face today—including discovery, governance, auditability, and collaboration. In this post, we explored how the AWS Glue Data Catalog addresses discoverability and manageability for table metadata for Presto on Amazon EMR. Go ahead, give this a try, and share your experience with us!
Radhika Ravirala is a Solutions Architect at Amazon Web Services where she helps customers craft distributed big data applications on the AWS platform. Prior to her cloud journey, she worked as a software engineer and designer for technology companies in Silicon Valley. She holds a M.S in computer science from San Jose State University.
I often encounter people experiencing frustration as they attempt to scale their e-commerce or WordPress site—particularly around the cost and complexity related to scaling. When I talk to customers about their scaling plans, they often mention phrases such as horizontal scaling and microservices, but usually people aren’t sure about how to dive in and effectively scale their sites.
Now let’s talk about different scaling options. For instance if your current workload is in a traditional data center, you can leverage the cloud for your on-premises solution. This way you can scale to achieve greater efficiency with less cost. It’s not necessary to set up a whole powerhouse to light a few bulbs. If your workload is already in the cloud, you can use one of the available out-of-the-box options.
Designing your API in microservices and adding horizontal scaling might seem like the best choice, unless your web application is already running in an on-premises environment and you’ll need to quickly scale it because of unexpected large spikes in web traffic.
So how to handle this situation? Take things one step at a time when scaling and you may find horizontal scaling isn’t the right choice, after all.
For example, assume you have a tech news website where you did an early-look review of an upcoming—and highly-anticipated—smartphone launch, which went viral. The review, a blog post on your website, includes both video and pictures. Comments are enabled for the post and readers can also rate it. For example, if your website is hosted on a traditional Linux with a LAMP stack, you may find yourself with immediate scaling problems.
Let’s get more details on the current scenario and dig out more:
Where are images and videos stored?
How many read/write requests are received per second? Per minute?
What is the level of security required?
Are these synchronous or asynchronous requests?
We’ll also want to consider the following if your website has a transactional load like e-commerce or banking:
How is the website handling sessions?
Do you have any compliance requests—like the Payment Card Industry Data Security Standard (PCI DSS compliance) —if your website is using its own payment gateway?
How are you recording customer behavior data and fulfilling your analytics needs?
What are your loading balancing considerations (scaling, caching, session maintenance, etc.)?
So, if we take this one step at a time:
Step 1:Ease server load. We need to quickly handle spikes in traffic, generated by activity on the blog post, so let’s reduce server load by moving image and video to some third -party content delivery network (CDN). AWS provides Amazon CloudFront as a CDN solution, which is highly scalable with built-in security to verify origin access identity and handle any DDoS attacks. CloudFront can direct traffic to your on-premises or cloud-hosted server with its 113 Points of Presence (102 Edge Locations and 11 Regional Edge Caches) in 56 cities across 24 countries, which provides efficient caching. Step 2: Reduce read load by adding more read replicas. MySQL provides a nice mirror replication for databases. Oracle has its own Oracle plug for replication and AWS RDS provide up to five read replicas, which can span across the region and even the Amazon database Amazon Aurora can have 15 read replicas with Amazon Aurora autoscaling support. If a workload is highly variable, you should consider Amazon Aurora Serverless database to achieve high efficiency and reduced cost. While most mirror technologies do asynchronous replication, AWS RDS can provide synchronous multi-AZ replication, which is good for disaster recovery but not for scalability. Asynchronous replication to mirror instance means replication data can sometimes be stale if network bandwidth is low, so you need to plan and design your application accordingly.
I recommend that you always use a read replica for any reporting needs and try to move non-critical GET services to read replica and reduce the load on the master database. In this case, loading comments associated with a blog can be fetched from a read replica—as it can handle some delay—in case there is any issue with asynchronous reflection.
Step 3: Reduce write requests. This can be achieved by introducing queue to process the asynchronous message. Amazon Simple Queue Service (Amazon SQS) is a highly-scalable queue, which can handle any kind of work-message load. You can process data, like rating and review; or calculate Deal Quality Score (DQS) using batch processing via an SQS queue. If your workload is in AWS, I recommend using a job-observer pattern by setting up Auto Scaling to automatically increase or decrease the number of batch servers, using the number of SQS messages, with Amazon CloudWatch, as the trigger. For on-premises workloads, you can use SQS SDK to create an Amazon SQS queue that holds messages until they’re processed by your stack. Or you can use Amazon SNS to fan out your message processing in parallel for different purposes like adding a watermark in an image, generating a thumbnail, etc.
Step 4: Introduce a more robust caching engine. You can use Amazon Elastic Cache for Memcached or Redis to reduce write requests. Memcached and Redis have different use cases so if you can afford to lose and recover your cache from your database, use Memcached. If you are looking for more robust data persistence and complex data structure, use Redis. In AWS, these are managed services, which means AWS takes care of the workload for you and you can also deploy them in your on-premises instances or use a hybrid approach.
Step 5: Scale your server. If there are still issues, it’s time to scale your server. For the greatest cost-effectiveness and unlimited scalability, I suggest always using horizontal scaling. However, use cases like database vertical scaling may be a better choice until you are good with sharding; or use Amazon Aurora Serverless for variable workloads. It will be wise to use Auto Scaling to manage your workload effectively for horizontal scaling. Also, to achieve that, you need to persist the session. Amazon DynamoDB can handle session persistence across instances.
If your server is on premises, consider creating a multisite architecture, which will help you achieve quick scalability as required and provide a good disaster recovery solution. You can pick and choose individual services like Amazon Route 53, AWS CloudFormation, Amazon SQS, Amazon SNS, Amazon RDS, etc. depending on your needs.
Your multisite architecture will look like the following diagram:
In this architecture, you can run your regular workload on premises, and use your AWS workload as required for scalability and disaster recovery. Using Route 53, you can direct a precise percentage of users to an AWS workload.
If you decide to move all of your workloads to AWS, the recommended multi-AZ architecture would look like the following:
In this architecture, you are using a multi-AZ distributed workload for high availability. You can have a multi-region setup and use Route53 to distribute your workload between AWS Regions. CloudFront helps you to scale and distribute static content via an S3 bucket and DynamoDB, maintaining your application state so that Auto Scaling can apply horizontal scaling without loss of session data. At the database layer, RDS with multi-AZ standby provides high availability and read replica helps achieve scalability.
This is a high-level strategy to help you think through the scalability of your workload by using AWS even if your workload in on premises and not in the cloud…yet.
I highly recommend creating a hybrid, multisite model by placing your on-premises environment replica in the public cloud like AWS Cloud, and using Amazon Route53 DNS Service and Elastic Load Balancing to route traffic between on-premises and cloud environments. AWS now supports load balancing between AWS and on-premises environments to help you scale your cloud environment quickly, whenever required, and reduce it further by applying Amazon auto-scaling and placing a threshold on your on-premises traffic using Route 53.
Thanks to the very talented sooperdavid, creator of some of the wonderful animations known as RealLifeDoodles, Thomas Pesquet and Astro Pi Ed have been turned into one of the cutest videos on the internet.
Taking often comical video clips, those with a know-how and skill level that outweighs my own in spades add faces and emotions to inanimate objects, creating what the social media world refers to as a Real Life Doodle. From disappointed exercise balls to cannibalistic piles of leaves, these video clips are both cute and sometimes, though thankfully not always, a little heartbreaking.
Watch letmegofree GIF by sooperdave on Gfycat. Discover more reallifedoodles GIFs on Gfycat
Our own RealLifeDoodle
A few months back, when Programme Manager Dave Honess, better known to many as SpaceDave, sent me these Astro Pivideos for me to upload to YouTube, a small plan hatched in my brain. For in the midst of the video, and pointed out to me by SpaceDave – “I kind of love the way he just lets the unit drop out of shot” – was the most adorable sight as poor Ed drifted off into the great unknown of the ISS. Finding that I have this odd ability to consider many inanimate objects as ‘cute’, I wanted to see whether we could turn poor Ed into a RealLifeDoodle.
Heading to the Reddit RealLifeDoodle subreddit, I sent moderator sooperdavid a private message, asking if he’d be so kind as to bring our beloved Ed to life.
Unless you’re new to the world of the Raspberry Pi blog (in which case, welcome!), you’ll probably know about the Astro Pi Challenge. But for those who are unaware, let me break it down for you.
In 2015, two weeks before British ESA Astronaut Tim Peake journeyed to the International Space Station, two Raspberry Pis were sent up to await his arrival. Clad in 6063-grade aluminium flight cases and fitted with their own Sense HATs and cameramodules, the Astro Pis Ed and Izzy were ready to receive the winning codes from school children in the UK. The following year, this time maintained by French ESA Astronaut Thomas Pesquet, children from every ESA member country got involved to send even more code to the ISS.
If you launch your startup and no one knows, did you actually launch? As mentioned in my last post, our initial launch target was to get a 1,000 people to use our service. But how do you get even 1,000 people to sign up for your service when no one knows who you are?
There are a variety of methods to attract your first 1,000 customers, but launching with the press is my favorite. I’ll explain why and how to do it below.
Paths to Attract Your First 1,000 Customers
Social following: If you have a massive social following, those people are a reasonable target for what you’re offering. In particular if your relationship with them is one where they would buy something you recommend, this can be one of the easiest ways to get your initial customers. However, building this type of following is non-trivial and often is done over several years.
Press not only provides awareness and customers, but credibility and SEO benefits as well
Paid advertising: The advantage of paid ads is you have control over when they are presented and what they say. The primary disadvantage is they tend to be expensive, especially before you have your positioning, messaging, and funnel nailed.
Viral: There are certainly examples of companies that launched with a hugely viral video, blog post, or promotion. While fantastic if it happens, even if you do everything right, the likelihood of massive virality is miniscule and the conversion rate is often low.
Press: As I said, this is my favorite. You don’t need to pay a PR agency and can go from nothing to launched in a couple weeks. Press not only provides awareness and customers, but credibility and SEO benefits as well.
How to Pitch the Press
It’s easy: Have a compelling story, find the right journalists, make their life easy, pitch and follow-up. Of course, each one of those has some nuance, so let’s dig in.
Have a compelling story
When you’ve been working for months on your startup, it’s easy to get lost in the minutiae when talking to others. Stories that a journalist will write about need to be something their readers will care about. Knowing what story to tell and how to tell it is part science and part art. Here’s how you can get there:
The basics of your story
Ask yourself the following questions, and write down the answers:
What are we doing? What product service are we offering?
Why? What problem are we solving?
What is interesting or unique? Either about what we’re doing, how we’re doing it, or for who we’re doing it.
“But my story isn’t that exciting”
Neither was announcing a data backup company, believe me. Look for angles that make it compelling. Here are some:
Did someone on your team do something major before? (build a successful company/product, create some innovation, market something we all know, etc.)
Do you have an interesting investor or board member?
Is there a personal story that drove you to start this company?
Are you starting it in a unique place?
Did you come upon the idea in a unique way?
Can you share something people want to know that’s not usually shared?
Are you partnered with a well-known company?
…is there something interesting/entertaining/odd/shocking/touching/etc.?
It doesn’t get much less exciting than, “We’re launching a company that will backup your data.” But there were still a lot of compelling stories:
Founded by serial entrepreneurs, bootstrapped a capital-intensive company, committed to each other for a year without salary.
Challenging the way that every backup company before was set up by not asking customers to pick and choose files to backup.
Designing our own storage system.
For the initial launch, we focused on “unlimited for $5/month” and statistics from a survey we ran with Harris Interactive that said that 94% of people did not regularly backup their data.
It’s an old adage that “Everyone has a story.” Regardless of what you’re doing, there is always something interesting to share. Dig for that.
Once you’ve captured what you think the interesting story is, you’ve got to boil it down. Yes, you need the elevator pitch, but this is shorter…it’s the headline pitch. Write the headline that you would love to see a journalist write.
Regardless of what you’re doing, there is always something interesting to share. Dig for that.
Now comes the part where you have to be really honest with yourself: if you weren’t involved, would you care?
The “Techmeme Test”
One way I try to ground myself is what I call the “Techmeme Test”. Techmeme lists the top tech articles. Read the headlines. Imagine the headline you wrote in the middle of the page. If you weren’t involved, would you click on it? Is it more or less compelling than the others. Much of tech news is dominated by the largest companies. If you want to get written about, your story should be more compelling. If not, go back above and explore your story some more.
Embargoes, exclusives and calls-to-action
Journalists write about news. Thus, if you’ve already announced something and are then pitching a journalist to cover it, unless you’re giving her something significant that hasn’t been said, it’s no longer news. As a result, there are ‘embargoes’ and ‘exclusives’.
: An embargo simply means that you are sharing news with a journalist that they need to keep private until a certain date and time.
If you’re Apple, this may be a formal and legal document. In our case, it’s as simple as saying, “Please keep embargoed until 4/13/17 at 8am California time.” in the pitch. Some sites explicitly will not keep embargoes; for example The Information will only break news. If you want to launch something later, do not share information with journalists at these sites. If you are only working with a single journalist for a story, and your announcement time is flexible, you can jointly work out a date and time to announce. However, if you have a fixed launch time or are working with a few journalists, embargoes are key.
Exclusives: An exclusive means you’re giving something specifically to that journalist. Most journalists love an exclusive as it means readers have to come to them for the story. One option is to give a journalist an exclusive on the entire story. If it is your dream journalist, this may make sense. Another option, however, is to give exclusivity on certain pieces. For example, for your launch you could give an exclusive on funding detail & a VC interview to a more finance-focused journalist and insight into the tech & a CTO interview to a more tech-focused journalist.
Call-to-Action: With our launch we gave TechCrunch, Ars Technica, and SimplyHelp URLs that gave the first few hundred of their readers access to the private beta. Once those first few hundred users from each site downloaded, the beta would be turned off.
Thus, we used a combination of embargoes, exclusives, and a call-to-action during our initial launch to be able to brief journalists on the news before it went live, give them something they could announce as exclusive, and provide a time-sensitive call-to-action to the readers so that they would actually sign up and not just read and go away.
How to Find the Most Authoritative Sites / Authors
“If a press release is published and no one sees it, was it published?” Perhaps the time existed when sending a press release out over the wire meant journalists would read it and write about it. That time has long been forgotten. Over 1,000 unread press releases are published every day. If you want your compelling story to be covered, you need to find the handful of journalists that will care.
Determine the publications
Find the publications that cover the type of story you want to share. If you’re in tech, Techmeme has a leaderboard of publications ranked by leadership and presence. This list will tell you which publications are likely to have influence. Visit the sites and see if your type of story appears on their site. But, once you’ve determined the publication do NOT send a pitch their “[email protected]” or “[email protected]” email addresses. In all the times I’ve done that, I have never had a single response. Those email addresses are likely on every PR, press release, and spam list and unlikely to get read. Instead…
Determine the journalists
Once you’ve determined which publications cover your area, check which journalists are doing the writing. Skim the articles and search for keywords and competitor names.
Over 1,000 unread press releases are published every day.
Identify one primary journalist at the publication that you would love to have cover you, and secondary ones if there are a few good options. If you’re not sure which one should be the primary, consider a few tests:
Do they truly seem to care about the space?
Do they write interesting/compelling stories that ‘get it’?
Do they appear on the Techmeme leaderboard?
Do their articles get liked/tweeted/shared and commented on?
Do they have a significant social presence?
In addition to Techmeme or if you aren’t in the tech space Google will become a must have tool for finding the right journalists to pitch. Below the search box you will find a number of tabs. Click on Tools and change the Any time setting to Custom range. I like to use the past six months to ensure I find authors that are actively writing about my market. I start with the All results. This will return a combination of product sites and articles depending upon your search term.
Scan for articles and click on the link to see if the article is on topic. If it is find the author’s name. Often if you click on the author name it will take you to a bio page that includes their Twitter, LinkedIn, and/or Facebook profile. Many times you will find their email address in the bio. You should collect all the information and add it to your outreach spreadsheet. Click here to get a copy. It’s always a good idea to comment on the article to start building awareness of your name. Another good idea is to Tweet or Like the article.
Next click on the News tab and set the same search parameters. You will get a different set of results. Repeat the same steps. Between the two searches you will have a list of authors that actively write for the websites that Google considers the most authoritative on your market.
How to find the most socially shared authors
Your next step is to find the writers whose articles get shared the most socially. Go to Buzzsumo and click on the Most Shared tab. Enter search terms for your market as well as competitor names. Again I like to use the past 6 months as the time range. You will get a list of articles that have been shared the most across Facebook, LinkedIn, Twitter, Pinterest, and Google+. In addition to finding the most shared articles and their authors you can also see some of the Twitter users that shared the article. Many of those Twitter users are big influencers in your market so it’s smart to start following and interacting with them as well as the authors.
How to Find Author Email Addresses
Some journalists publish their contact info right on the stories. For those that don’t, a bit of googling will often get you the email. For example, TechCrunch wrote a story a few years ago where they published all of their email addresses, which was in response to this new service that charges a small fee to provide journalist email addresses. Sometimes visiting their twitter pages will link to a personal site, upon which they will share an email address.
Of course all is not lost if you don’t find an email in the bio. There are two good services for finding emails, https://app.voilanorbert.com/ and https://hunter.io/. For Voila Norbert enter the author name and the website you found their article on. The majority of the time you search for an author on a major publication Norbert will return an accurate email address. If it doesn’t try Hunter.io.
On Hunter.io enter the domain name and click on Personal Only. Then scroll through the results to find the author’s email. I’ve found Norbert to be more accurate overall but between the two you will find most major author’s email addresses.
Email, by the way, is not necessarily the best way to engage a journalist. Many are avid Twitter users. Follow them and engage – that means read/retweet/favorite their tweets; reply to their questions, and generally be helpful BEFORE you pitch them. Later when you email them, you won’t be just a random email address.
Now that you have all these email addresses (possibly thousands if you purchased a list) – do NOT spam. It is incredibly tempting to think “I could try to figure out which of these folks would be interested, but if I just email all of them, I’ll save myself time and be more likely to get some of them to respond.” Don’t do it.
Follow them and engage – that means read/retweet/favorite their tweets; reply to their questions, and generally be helpful BEFORE you pitch them.
First, you’ll want to tailor your pitch to the individual. Second, it’s a small world and you’ll be known as someone who spams – reputation is golden. Also, don’t call journalists. Unless you know them or they’ve said they’re open to calls, you’re most likely to just annoy them.
Build a relationship
Play the long game. You may be focusing just on the launch and hoping to get this one story covered, but if you don’t quickly flame-out, you will have many more opportunities to tell interesting stories that you’ll want the press to cover. Be honest and don’t exaggerate.
When you have 500 users it’s tempting to say, “We’ve got thousands!” Don’t. The good journalists will see through it and it’ll likely come back to bite you later. If you don’t know something, say “I don’t know but let me find out for you.” Most journalists want to write interesting stories that their readers will appreciate. Help them do that. Build deeper relationships with 5 – 10 journalists, rather than spamming thousands.
It doesn’t need to be complicated, but keep a spreadsheet that includes the name, publication, and contact info of the journalists you care about. Then, use it to keep track of who you’ve pitched, who’s responded, whether you’ve sent them the materials they need, and whether they intend to write/have written.
Make their life easy
Journalists have a million PR people emailing them, are actively engaging with readers on Twitter and in the comments, are tracking their metrics, are working their sources…and all the while needing to publish new articles. They’re busy. Make their life easy and they’re more likely to engage with yours.
Get to know them
Before sending them a pitch, know what they’ve written in the space. If you tell them how your story relates to ones they’ve written, it’ll help them put the story in context, and enable them to possibly link back to a story they wrote before.
Prepare your materials
Journalists will need somewhere to get more info (prepare a fact sheet), a URL to link to, and at least one image (ideally a few to choose from.) A fact sheet gives bite-sized snippets of information they may need about your startup or product: what it is, how big the market is, what’s the pricing, who’s on the team, etc. The URL is where their reader will get the product or more information from you. It doesn’t have to be live when you’re pitching, but you should be able to tell what the URL will be. The images are ones that they could embed in the article: a product screenshot, a CEO or team photo, an infographic. Scan the types of images included in their articles. Don’t send any of these in your pitch, but have them ready. Studies, stats, customer/partner/investor quotes are also good to have.
A pitch has to be short and compelling.
Think back to the headline you want. Is it really compelling? Can you shorten it to a subject line? Include what’s happening and when. For Mike Arrington at Techcrunch, our first subject line was “Startup doing an ‘online time machine’”. Later I would include, “launching June 6th”.
For John Timmer at ArsTechnica, it was “Demographics data re: your 4/17 article”. Why? Because he wrote an article titled “WiFi popular with the young people; backups, not so much”. Since we had run a demographics survey on backups, I figured as a science editor he’d be interested in this additional data.
A few key things about the body of the email. It should be short and to the point, no more than a few sentences. Here was my actual, original pitch email to John:
We’re launching Backblaze next week which provides a Time Machine-online type of service. As part of doing some research I read your article about backups not being popular with young people and that you had wished Accenture would have given you demographics. In prep for our invite-only launch I sponsored Harris Interactive to get demographic data on who’s doing backups and if all goes well, I should have that data on Friday.
Next week starts Backup Awareness Month (and yes, probably Clean Your House Month and Brush Your Teeth Month)…but nonetheless…good time to remind readers to backup with a bit of data?
Would you be interested in seeing/talking about the data when I get it?
Would you be interested in getting a sneak peak at Backblaze? (I could give you some invite codes for your readers as well.)
CEO and Co-Founder
Automatic, Secure, High-Performance Online Backup
The Good: It said what we’re doing, why this relates to him and his readers, provides him information he had asked for in an article, ties to something timely, is clearly tailored for him, is pitched by the CEO and Co-Founder, and provides my cell.
The Bad: It’s too long.
I got better later. Here’s an example:
Subject: Does temperature affect hard drive life?
Hi Peter, there has been much debate about whether temperature affects how long a hard drive lasts. Following up on the Backblaze analyses of how long do drives last & which drives last the longest (that you wrote about) we’ve now analyzed the impact of heat on the nearly 40,000 hard drives we have and found that…
We’re going to publish the results this Monday, 5/12 at 5am California-time. Want a sneak peak of the analysis?
A common question is “When should I launch?” What day, what time? I prefer to launch on Tuesday at 8am California-time. Launching earlier in the week gives breathing room for the news to live longer. While your launch may be a single article posted and that’s that, if it ends up a larger success, earlier in the week allows other journalists (including ones who are in other countries) to build on the story. Monday announcements can be tough because the journalists generally need to have their stories finished by Friday, and while ideally everything is buttoned up beforehand, startups sometimes use the weekend as overflow before a launch.
The 8am California-time is because it allows articles to be published at the beginning of the day West Coast and around lunch-time East Coast. Later and you risk it being past publishing time for the day. We used to launch at 5am in order to be morning for the East Coast, but it did not seem to have a significant benefit in coverage or impact, but did mean that the entire internal team needed to be up at 3am or 4am. Sometimes that’s critical, but I prefer to not burn the team out when it’s not.
Finally, try to stay clear of holidays, major announcements and large conferences. If Apple is coming out with their next iPhone, many of the tech journalists will be busy at least a couple days prior and possibly a week after. Not always obvious, but if you can, find times that are otherwise going to be slow for news.
There is a fine line between persistence and annoyance. I once had a journalist write me after we had an announcement that was covered by the press, “Why didn’t you let me know?! I would have written about that!” I had sent him three emails about the upcoming announcement to which he never responded.
My general rule is 3 emails.
Ugh. However, my takeaway from this isn’t that I should send 10 emails to every journalist. It’s that sometimes these things happen.
My general rule is 3 emails. If I’ve identified a specific journalist that I think would be interested and have a pitch crafted for her, I’ll send her the email ideally 2 weeks prior to the announcement. I’ll follow-up a week later, and one more time 2 days prior. If she ever says, “I’m not interested in this topic,” I note it and don’t email her on that topic again.
If a journalist wrote, I read the article and engage in the comments (or someone on our team, such as our social guy, @YevP does). We’ll often promote the story through our social channels and email our employees who may choose to share the story as well. This helps us, but also helps the journalist get their story broader reach. Again, the goal is to build a relationship with the journalists your space. If there’s something relevant to your customers that the journalist wrote, you’re providing a service to your customers AND helping the journalist get the word out about the article.
At times the stories also end up shared on sites such as Hacker News, Reddit, Slashdot, or become active conversations on Twitter. Again, we try to engage there and respond to questions (when we do, we are always clear that we’re from Backblaze.)
And finally, I’ll often send a short thank you to the journalist.
Getting Your First 1,000 Customers With Press
As I mentioned at the beginning, there is more than one way to get your first 1,000 customers. My favorite is working with the press to share your story. If you figure out your compelling story, find the right journalists, make their life easy, pitch and follow-up, you stand a high likelyhood of getting coverage and customers. Better yet, that coverage will provide credibility for your company, and if done right, will establish you as a resource for the press for the future.
Like any muscle, this process takes working out. The first time may feel a bit daunting, but just take the steps one at a time. As you do this a few times, the process will be easier and you’ll know who to reach out and quickly determine what stories will be compelling.
There are two opposing models of how the Internet has changed protest movements. The first is that the Internet has made protesters mightier than ever. This comes from the successful revolutions in Tunisia (2010-11), Egypt (2011), and Ukraine (2013). The second is that it has made them more ineffectual. Derided as “slacktivism” or “clicktivism,” the ease of action without commitment can result in movements like Occupy petering out in the US without any obvious effects. Of course, the reality is more nuanced, and Zeynep Tufekci teases that out in her new book Twitter and Tear Gas.
Tufekci is a rare interdisciplinary figure. As a sociologist, programmer, and ethnographer, she studies how technology shapes society and drives social change. She has a dual appointment in both the School of Information Science and the Department of Sociology at University of North Carolina at Chapel Hill, and is a Faculty Associate at the Berkman Klein Center for Internet and Society at Harvard University. Her regular New York Times column on the social impacts of technology is a must-read.
Modern Internet-fueled protest movements are the subjects of Twitter and Tear Gas. As an observer, writer, and participant, Tufekci examines how modern protest movements have been changed by the Internet — and what that means for protests going forward. Her book combines her own ethnographic research and her usual deft analysis, with the research of others and some big data analysis from social media outlets. The result is a book that is both insightful and entertaining, and whose lessons are much broader than the book’s central topic.
“The Power and Fragility of Networked Protest” is the book’s subtitle. The power of the Internet as a tool for protest is obvious: it gives people newfound abilities to quickly organize and scale. But, according to Tufekci, it’s a mistake to judge modern protests using the same criteria we used to judge pre-Internet protests. The 1963 March on Washington might have culminated in hundreds of thousands of people listening to Martin Luther King Jr. deliver his “I Have a Dream” speech, but it was the culmination of a multi-year protest effort and the result of six months of careful planning made possible by that sustained effort. The 2011 protests in Cairo came together in mere days because they could be loosely coordinated on Facebook and Twitter.
That’s the power. Tufekci describes the fragility by analogy. Nepalese Sherpas assist Mt. Everest climbers by carrying supplies, laying out ropes and ladders, and so on. This means that people with limited training and experience can make the ascent, which is no less dangerous — to sometimes disastrous results. Says Tufekci: “The Internet similarly allows networked movements to grow dramatically and rapidly, but without prior building of formal or informal organizational and other collective capacities that could prepare them for the inevitable challenges they will face and give them the ability to respond to what comes next.” That makes them less able to respond to government counters, change their tactics — a phenomenon Tufekci calls “tactical freeze” — make movement-wide decisions, and survive over the long haul.
Tufekci isn’t arguing that modern protests are necessarily less effective, but that they’re different. Effective movements need to understand these differences, and leverage these new advantages while minimizing the disadvantages.
To that end, she develops a taxonomy for talking about social movements. Protests are an example of a “signal” that corresponds to one of several underlying “capacities.” There’s narrative capacity: the ability to change the conversation, as Black Lives Matter did with police violence and Occupy did with wealth inequality. There’s disruptive capacity: the ability to stop business as usual. An early Internet example is the 1999 WTO protests in Seattle. And finally, there’s electoral or institutional capacity: the ability to vote, lobby, fund raise, and so on. Because of various “affordances” of modern Internet technologies, particularly social media, the same signal — a protest of a given size — reflects different underlying capacities.
This taxonomy also informs government reactions to protest movements. Smart responses target attention as a resource. The Chinese government responded to 2015 protesters in Hong Kong by not engaging with them at all, denying them camera-phone videos that would go viral and attract the world’s attention. Instead, they pulled their police back and waited for the movement to die from lack of attention.
If this all sounds dry and academic, it’s not. Twitter and Tear Gasis infused with a richness of detail stemming from her personal participation in the 2013 Gezi Park protests in Turkey, as well as personal on-the-ground interviews with protesters throughout the Middle East — particularly Egypt and her native Turkey — Zapatistas in Mexico, WTO protesters in Seattle, Occupy participants worldwide, and others. Tufekci writes with a warmth and respect for the humans that are part of these powerful social movements, gently intertwining her own story with the stories of others, big data, and theory. She is adept at writing for a general audience, anddespite being published by the intimidating Yale University Press — her book is more mass-market than academic. What rigor is there is presented in a way that carries readers along rather than distracting.
The synthesist in me wishes Tufekci would take some additional steps, taking the trends she describes outside of the narrow world of political protest and applying them more broadly to social change. Her taxonomy is an important contribution to the more-general discussion of how the Internet affects society. Furthermore, her insights on the networked public sphere has applications for understanding technology-driven social change in general. These are hard conversations for society to have. We largely prefer to allow technology to blindly steer society or — in some ways worse — leave it to unfettered for-profit corporations. When you’re reading Twitter and Tear Gas, keep current and near-term future technological issues such as ubiquitous surveillance, algorithmic discrimination, and automation and employment in mind. You’ll come away with new insights.
Tufekci twice quotes historian Melvin Kranzberg from 1985: “Technology is neither good nor bad; nor is it neutral.” This foreshadows her central message. For better or worse, the technologies that power the networked public sphere have changed the nature of political protest as well as government reactions to and suppressions of such protest.
I have long characterized our technological future as a battle between the quick and the strong. The quick — dissidents, hackers, criminals, marginalized groups — are the first to make use of a new technology to magnify their power. The strong are slower, but have more raw power to magnify. So while protesters are the first to use Facebook to organize, the governments eventually figure out how to use Facebook to track protesters. It’s still an open question who will gain the upper hand in the long term, but Tufekci’s book helps us understand the dynamics at work.
The cookie settings on this website are set to "allow cookies" to give you the best browsing experience possible. If you continue to use this website without changing your cookie settings or you click "Accept" below then you are consenting to this.