Tag Archives: bash

Get ready to write — Workers KV is now in GA!

Post Syndicated from Ashcon Partovi original https://blog.cloudflare.com/workers-kv-is-ga/

Get ready to write — Workers KV is now in GA!

Today, we’re excited to announce Workers KV is entering general availability and is ready for production use!

Get ready to write — Workers KV is now in GA!

What is Workers KV?

Workers KV is a highly distributed, eventually consistent, key-value store that spans Cloudflare’s global edge. It allows you to store billions of key-value pairs and read them with ultra-low latency anywhere in the world. Now you can build entire applications with the performance of a CDN static cache.

Why did we build it?

Workers is a platform that lets you run JavaScript on Cloudflare’s global edge of 175+ data centers. With only a few lines of code, you can route HTTP requests, modify responses, or even create new responses without an origin server.

// A Worker that handles a single redirect,
// such a humble beginning...
addEventListener("fetch", event => {
  event.respondWith(handleOneRedirect(event.request))
})

async function handleOneRedirect(request) {
  let url = new URL(request.url)
  let device = request.headers.get("CF-Device-Type")
  // If the device is mobile, add a prefix to the hostname.
  // (eg. example.com becomes mobile.example.com)
  if (device === "mobile") {
    url.hostname = "mobile." + url.hostname
    return Response.redirect(url, 302)
  }
  // Otherwise, send request to the original hostname.
  return await fetch(request)
}

Customers quickly came to us with use cases that required a way to store persistent data. Following our example above, it’s easy to handle a single redirect, but what if you want to handle billions of them? You would have to hard-code them into your Workers script, fit it all in under 1 MB, and re-deploy it every time you wanted to make a change — yikes! That’s why we built Workers KV.

// A Worker that can handle billions of redirects,
// now that's more like it!
addEventListener("fetch", event => {
  event.respondWith(handleBillionsOfRedirects(event.request))
})

async function handleBillionsOfRedirects(request) {
  let prefix = "/redirect"
  let url = new URL(request.url)
  // Check if the URL is a special redirect.
  // (eg. example.com/redirect/<random-hash>)
  if (url.pathname.startsWith(prefix)) {
    // REDIRECTS is a custom variable that you define,
    // it binds to a Workers KV "namespace." (aka. a storage bucket)
    let redirect = await REDIRECTS.get(url.pathname.replace(prefix, ""))
    if (redirect) {
      url.pathname = redirect
      return Response.redirect(url, 302)
    }
  }
  // Otherwise, send request to the original path.
  return await fetch(request)
}

With only a few changes from our previous example, we scaled from one redirect to billions − that’s just a taste of what you can build with Workers KV.

How does it work?

Distributed data stores are often modeled using the CAP Theorem, which states that distributed systems can only pick between 2 out of the 3 following guarantees:

  • Consistency – is my data the same everywhere?
  • Availability – is my data accessible all the time?
  • Partition tolerance – is my data stored in multiple locations?

Get ready to write — Workers KV is now in GA!
Diagram of the choices and tradeoffs of the CAP Theorem.

Workers KV chooses to guarantee Availability and Partition tolerance. This combination is known as eventual consistency, which presents Workers KV with two unique competitive advantages:

  • Reads are ultra fast (median of 12 ms) since its powered by our caching technology.
  • Data is available across 175+ edge data centers and resilient to regional outages.

Although, there are tradeoffs to eventual consistency. If two clients write different values to the same key at the same time, the last client to write eventually “wins” and its value becomes globally consistent. This also means that if a client writes to a key and that same client reads that same key, the values may be inconsistent for a short amount of time.

To help visualize this scenario, here’s a real-life example amongst three friends:

  • Suppose Matthew, Michelle, and Lee are planning their weekly lunch.
  • Matthew decides they’re going out for sushi.
  • Matthew tells Michelle their sushi plans, Michelle agrees.
  • Lee, not knowing the plans, tells Michelle they’re actually having pizza.

An hour later, Michelle and Lee are waiting at the pizza parlor while Matthew is sitting alone at the sushi restaurant — what went wrong? We can chalk this up to eventual consistency, because after waiting for a few minutes, Matthew looks at his updated calendar and eventually finds the new truth, they’re going out for pizza instead.

While it may take minutes in real-life, Workers KV is much faster. It can achieve global consistency in less than 60 seconds. Additionally, when a Worker writes to a key, then immediately reads that same key, it can expect the values to be consistent if both operations came from the same location.

When should I use it?

Now that you understand the benefits and tradeoffs of using eventual consistency, how do you determine if it’s the right storage solution for your application? Simply put, if you want global availability with ultra-fast reads, Workers KV is right for you.

However, if your application is frequently writing to the same key, there is an additional consideration. We call it “the Matthew question”: Are you okay with the Matthews of the world occasionally going to the wrong restaurant?

You can imagine use cases (like our redirect Worker example) where this doesn’t make any material difference. But if you decide to keep track of a user’s bank account balance, you would not want the possibility of two balances existing at once, since they could purchase something with money they’ve already spent.

What can I build with it?

Here are a few examples of applications that have been built with KV:

  • Mass redirects – handle billions of HTTP redirects.
  • User authentication – validate user requests to your API.
  • Translation keys – dynamically localize your web pages.
  • Configuration data – manage who can access your origin.
  • Step functions – sync state data between multiple APIs functions.
  • Edge file store – host large amounts of small files.

We’ve highlighted several of those use cases in our previous blog post. We also have some more in-depth code walkthroughs, including a recently published blog post on how to build an online To-do list with Workers KV.

Get ready to write — Workers KV is now in GA!

What’s new since beta?

By far, our most common request was to make it easier to write data to Workers KV. That’s why we’re releasing three new ways to make that experience even better:

1. Bulk Writes

If you want to import your existing data into Workers KV, you don’t want to go through the hassle of sending an HTTP request for every key-value pair. That’s why we added a bulk endpoint to the Cloudflare API. Now you can upload up to 10,000 pairs (up to 100 MB of data) in a single PUT request.

curl "https://api.cloudflare.com/client/v4/accounts/ \
     $ACCOUNT_ID/storage/kv/namespaces/$NAMESPACE_ID/bulk" \
  -X PUT \
  -H "X-Auth-Key: $CLOUDFLARE_AUTH_KEY" \
  -H "X-Auth-Email: $CLOUDFLARE_AUTH_EMAIL" \
  -d '[
    {"key": "built_by",    value: "kyle, alex, charlie, andrew, and brett"},
    {"key": "reviewed_by", value: "joaquin"},
    {"key": "approved_by", value: "steve"}
  ]'

Let’s walk through an example use case: you want to off-load your website translation to Workers. Since you’re reading translation keys frequently and only occasionally updating them, this application works well with the eventual consistency model of Workers KV.

In this example, we hook into Crowdin, a popular platform to manage translation data. This Worker responds to a /translate endpoint, downloads all your translation keys, and bulk writes them to Workers KV so you can read it later on our edge:

addEventListener("fetch", event => {
  if (event.request.url.pathname === "/translate") {
    event.respondWith(uploadTranslations())
  }
})

async function uploadTranslations() {
  // Ask crowdin for all of our translations.
  var response = await fetch(
    "https://api.crowdin.com/api/project" +
    "/:ci_project_id/download/all.zip?key=:ci_secret_key")
  // If crowdin is responding, parse the response into
  // a single json with all of our translations.
  if (response.ok) {
    var translations = await zipToJson(response)
    return await bulkWrite(translations)
  }
  // Return the errored response from crowdin.
  return response
}

async function bulkWrite(keyValuePairs) {
  return fetch(
    "https://api.cloudflare.com/client/v4/accounts" +
    "/:cf_account_id/storage/kv/namespaces/:cf_namespace_id/bulk",
    {
      method: "PUT",
      headers: {
        "Content-Type": "application/json",
        "X-Auth-Key": ":cf_auth_key",
        "X-Auth-Email": ":cf_email"
      },
      body: JSON.stringify(keyValuePairs)
    }
  )
}

async function zipToJson(response) {
  // ... omitted for brevity ...
  // (eg. https://stuk.github.io/jszip)
  return [
    {key: "hello.EN", value: "Hello World"},
    {key: "hello.ES", value: "Hola Mundo"}
  ]
}

Now, when you want to translate a page, all you have to do is read from Workers KV:

async function translate(keys, lang) {
  // You bind your translations namespace to the TRANSLATIONS variable.
  return Promise.all(keys.map(key => TRANSLATIONS.get(key + "." + lang)))
}

2. Expiring Keys

By default, key-value pairs stored in Workers KV last forever. However, sometimes you want your data to auto-delete after a certain amount of time. That’s why we’re introducing the expiration and expirationTtloptions for write operations.

// Key expires 60 seconds from now.
NAMESPACE.put("myKey", "myValue", {expirationTtl: 60})

// Key expires if the UNIX epoch is in the past.
NAMESPACE.put("myKey", "myValue", {expiration: 1247788800})

# You can also set keys to expire from the Cloudflare API.
curl "https://api.cloudflare.com/client/v4/accounts/ \
     $ACCOUNT_ID/storage/kv/namespaces/$NAMESPACE_ID/ \
     values/$KEY?expiration_ttl=$EXPIRATION_IN_SECONDS"
  -X PUT \
  -H "X-Auth-Key: $CLOUDFLARE_AUTH_KEY" \
  -H "X-Auth-Email: $CLOUDFLARE_AUTH_EMAIL" \
  -d "$VALUE"

Let’s say you want to block users that have been flagged as inappropriate from your website, but only for a week. With an expiring key, you can set the expire time and not have to worry about deleting it later.

In this example, we assume users and IP addresses are one of the same. If your application has authentication, you could use access tokens as the key identifier.

addEventListener("fetch", event => {
  var url = new URL(event.request.url)
  // An internal API that blocks a new user IP.
  // (eg. example.com/block/1.2.3.4)
  if (url.pathname.startsWith("/block")) {
    var ip = url.pathname.split("/").pop()
    event.respondWith(blockIp(ip))
  } else {
    // Other requests check if the IP is blocked.
   event.respondWith(handleRequest(event.request))
  }
})

async function blockIp(ip) {
  // Values are allowed to be empty in KV,
  // we don't need to store any extra information anyway.
  await BLOCKED.put(ip, "", {expirationTtl: 60*60*24*7})
  return new Response("ok")
}

async function handleRequest(request) {
  var ip = request.headers.get("CF-Connecting-IP")
  if (ip) {
    var blocked = await BLOCKED.get(ip)
    // If we detect an IP and its blocked, respond with a 403 error.
    if (blocked) {
      return new Response({status: 403, statusText: "You are blocked!"})
    }
  }
  // Otherwise, passthrough the original request.
  return fetch(request)
}

3. Larger Values

We’ve increased our size limit on values from 64 kB to 2 MB. This is quite useful if you need to store buffer-based or file data in Workers KV.

Get ready to write — Workers KV is now in GA!

Consider this scenario: you want to let your users upload their favorite GIF to their profile without having to store these GIFs as binaries in your database or managing another cloud storage bucket.

Workers KV is a great fit for this use case! You can create a Workers KV namespace for your users’ GIFs that is fast and reliable wherever your customers are located.

In this example, users upload a link to their favorite GIF, then a Worker downloads it and stores it to Workers KV.

addEventListener("fetch", event => {
  var url = event.request.url
  var arg = request.url.split("/").pop()
  // User sends a URI encoded link to the GIF they wish to upload.
  // (eg. example.com/api/upload_gif/<encoded-uri>)
  if (url.pathname.startsWith("/api/upload_gif")) {
    event.respondWith(uploadGif(arg))
    // Profile contains link to view the GIF.
    // (eg. example.com/api/view_gif/<username>)
  } else if (url.pathname.startsWith("/api/view_gif")) {
    event.respondWith(getGif(arg))
  }
})

async function uploadGif(url) {
  // Fetch the GIF from the Internet.
  var gif = await fetch(decodeURIComponent(url))
  var buffer = await gif.arrayBuffer()
  // Upload the GIF as a buffer to Workers KV.
  await GIFS.put(user.name, buffer)
  return gif
}

async function getGif(username) {
  var gif = await GIFS.get(username, "arrayBuffer")
  // If the user has set one, respond with the GIF.
  if (gif) {
    return new Response(gif, {headers: {"Content-Type": "image/gif"}})
  } else {
    return new Response({status: 404, statusText: "User has no GIF!"})
  }
}

Lastly, we want to thank all of our beta customers. It was your valuable feedback that led us to develop these changes to Workers KV. Make sure to stay in touch with us, we’re always looking ahead for what’s next and we love hearing from you!

Pricing

We’re also ready to announce our GA pricing. If you’re one of our Enterprise customers, your pricing obviously remains unchanged.

  • $0.50 / GB of data stored, 1 GB included
  • $0.50 / million reads, 10 million included
  • $5 / million write, list, and delete operations, 1 million included

During the beta period, we learned customers don’t want to just read values at our edge, they want to write values from our edge too. Since there is high demand for these edge operations, which are more costly, we have started charging non-read operations per month.

Limits

As mentioned earlier, we increased our value size limit from 64 kB to 2 MB. We’ve also removed our cap on the number of keys per namespace — it’s now unlimited. Here are our GA limits:

  • Up to 20 namespaces per account, each with unlimited keys
  • Keys of up to 512 bytes and values of up to 2 MB
  • Unlimited writes per second for different keys
  • One write per second for the same key
  • Unlimited reads per second per key

Try it out now!

Now open to all customers, you can start using Workers KV today from your Cloudflare dashboard under the Workers tab. You can also look at our updated documentation.

We’re really excited to see what you all can build with Workers KV!

Storing Encrypted Credentials In Git

Post Syndicated from Bozho original https://techblog.bozho.net/storing-encrypted-credentials-in-git/

We all know that we should not commit any passwords or keys to the repo with our code (no matter if public or private). Yet, thousands of production passwords can be found on GitHub (and probably thousands more in internal company repositories). Some have tried to fix that by removing the passwords (once they learned it’s not a good idea to store them publicly), but passwords have remained in the git history.

Knowing what not to do is the first and very important step. But how do we store production credentials. Database credentials, system secrets (e.g. for HMACs), access keys for 3rd party services like payment providers or social networks. There doesn’t seem to be an agreed upon solution.

I’ve previously argued with the 12-factor app recommendation to use environment variables – if you have a few that might be okay, but when the number of variables grow (as in any real application), it becomes impractical. And you can set environment variables via a bash script, but you’d have to store it somewhere. And in fact, even separate environment variables should be stored somewhere.

This somewhere could be a local directory (risky), a shared storage, e.g. FTP or S3 bucket with limited access, or a separate git repository. I think I prefer the git repository as it allows versioning (Note: S3 also does, but is provider-specific). So you can store all your environment-specific properties files with all their credentials and environment-specific configurations in a git repo with limited access (only Ops people). And that’s not bad, as long as it’s not the same repo as the source code.

Such a repo would look like this:

project
└─── production
|   |   application.properites
|   |   keystore.jks
└─── staging
|   |   application.properites
|   |   keystore.jks
└─── on-premise-client1
|   |   application.properites
|   |   keystore.jks
└─── on-premise-client2
|   |   application.properites
|   |   keystore.jks

Since many companies are using GitHub or BitBucket for their repositories, storing production credentials on a public provider may still be risky. That’s why it’s a good idea to encrypt the files in the repository. A good way to do it is via git-crypt. It is “transparent” encryption because it supports diff and encryption and decryption on the fly. Once you set it up, you continue working with the repo as if it’s not encrypted. There’s even a fork that works on Windows.

You simply run git-crypt init (after you’ve put the git-crypt binary on your OS Path), which generates a key. Then you specify your .gitattributes, e.g. like that:

secretfile filter=git-crypt diff=git-crypt
*.key filter=git-crypt diff=git-crypt
*.properties filter=git-crypt diff=git-crypt
*.jks filter=git-crypt diff=git-crypt

And you’re done. Well, almost. If this is a fresh repo, everything is good. If it is an existing repo, you’d have to clean up your history which contains the unencrypted files. Following these steps will get you there, with one addition – before calling git commit, you should call git-crypt status -f so that the existing files are actually encrypted.

You’re almost done. We should somehow share and backup the keys. For the sharing part, it’s not a big issue to have a team of 2-3 Ops people share the same key, but you could also use the GPG option of git-crypt (as documented in the README). What’s left is to backup your secret key (that’s generated in the .git/git-crypt directory). You can store it (password-protected) in some other storage, be it a company shared folder, Dropbox/Google Drive, or even your email. Just make sure your computer is not the only place where it’s present and that it’s protected. I don’t think key rotation is necessary, but you can devise some rotation procedure.

git-crypt authors claim to shine when it comes to encrypting just a few files in an otherwise public repo. And recommend looking at git-remote-gcrypt. But as often there are non-sensitive parts of environment-specific configurations, you may not want to encrypt everything. And I think it’s perfectly fine to use git-crypt even in a separate repo scenario. And even though encryption is an okay approach to protect credentials in your source code repo, it’s still not necessarily a good idea to have the environment configurations in the same repo. Especially given that different people/teams manage these credentials. Even in small companies, maybe not all members have production access.

The outstanding questions in this case is – how do you sync the properties with code changes. Sometimes the code adds new properties that should be reflected in the environment configurations. There are two scenarios here – first, properties that could vary across environments, but can have default values (e.g. scheduled job periods), and second, properties that require explicit configuration (e.g. database credentials). The former can have the default values bundled in the code repo and therefore in the release artifact, allowing external files to override them. The latter should be announced to the people who do the deployment so that they can set the proper values.

The whole process of having versioned environment-speific configurations is actually quite simple and logical, even with the encryption added to the picture. And I think it’s a good security practice we should try to follow.

The post Storing Encrypted Credentials In Git appeared first on Bozho's tech blog.

Security updates for Friday

Post Syndicated from ris original https://lwn.net/Articles/755667/rss

Security updates have been issued by Arch Linux (bind, libofx, and thunderbird), Debian (thunderbird, xdg-utils, and xen), Fedora (procps-ng), Mageia (gnupg2, mbedtls, pdns, and pdns-recursor), openSUSE (bash, GraphicsMagick, icu, and kernel), Oracle (thunderbird), Red Hat (java-1.7.1-ibm, java-1.8.0-ibm, and thunderbird), Scientific Linux (thunderbird), and Ubuntu (curl).

Security updates for Thursday

Post Syndicated from ris original https://lwn.net/Articles/755540/rss

Security updates have been issued by Debian (imagemagick), Fedora (curl, glibc, kernel, and thunderbird-enigmail), openSUSE (enigmail, knot, and python), Oracle (procps-ng), Red Hat (librelp, procps-ng, redhat-virtualization-host, rhev-hypervisor7, and unboundid-ldapsdk), Scientific Linux (procps-ng), SUSE (bash, ceph, icu, kvm, and qemu), and Ubuntu (procps and spice, spice-protocol).

Airbash – Fully Automated WPA PSK Handshake Capture Script

Post Syndicated from Darknet original https://www.darknet.org.uk/2018/05/airbash-fully-automated-wpa-psk-handshake-capture-script/?utm_source=rss&utm_medium=social&utm_campaign=darknetfeed

Airbash – Fully Automated WPA PSK Handshake Capture Script

Airbash is a POSIX-compliant, fully automated WPA PSK handshake capture script aimed at penetration testing. It is compatible with Bash and Android Shell (tested on Kali Linux and Cyanogenmod 10.2) and uses aircrack-ng to scan for clients that are currently connected to access points (AP).

Those clients are then deauthenticated in order to capture the handshake when attempting to reconnect to the AP. Verification of a captured handshake is done using aircrack-ng.

Read the rest of Airbash – Fully Automated WPA PSK Handshake Capture Script now! Only available at Darknet.

Bad Software Is Our Fault

Post Syndicated from Bozho original https://techblog.bozho.net/bad-software-is-our-fault/

Bad software is everywhere. One can even claim that every software is bad. Cool companies, tech giants, established companies, all produce bad software. And no, yours is not an exception.

Who’s to blame for bad software? It’s all complicated and many factors are intertwined – there’s business requirements, there’s organizational context, there’s lack of sufficient skilled developers, there’s the inherent complexity of software development, there’s leaky abstractions, reliance on 3rd party software, consequences of wrong business and purchase decisions, time limitations, flawed business analysis, etc. So yes, despite the catchy title, I’m aware it’s actually complicated.

But in every “it’s complicated” scenario, there’s always one or two factors that are decisive. All of them contribute somehow, but the major drivers are usually a handful of things. And in the case of base software, I think it’s the fault of technical people. Developers, architects, ops.

We don’t seem to care about best practices. And I’ll do some nasty generalizations here, but bear with me. We can spend hours arguing about tabs vs spaces, curly bracket on new line, git merge vs rebase, which IDE is better, which framework is better and other largely irrelevant stuff. But we tend to ignore the important aspects that span beyond the code itself. The context in which the code lives, the non-functional requirements – robustness, security, resilience, etc.

We don’t seem to get security. Even trivial stuff such as user authentication is almost always implemented wrong. These days Twitter and GitHub realized they have been logging plain-text passwords, for example, but that’s just the tip of the iceberg. Too often we ignore the security implications.

“But the business didn’t request the security features”, one may say. The business never requested 2-factor authentication, encryption at rest, PKI, secure (or any) audit trail, log masking, crypto shredding, etc., etc. Because the business doesn’t know these things – we do and we have to put them on the backlog and fight for them to be implemented. Each organization has its specifics and tech people can influence the backlog in different ways, but almost everywhere we can put things there and prioritize them.

The other aspect is testing. We should all be well aware by now that automated testing is mandatory. We have all the tools in the world for unit, functional, integration, performance and whatnot testing, and yet many software projects lack the necessary test coverage to be able to change stuff without accidentally breaking things. “But testing takes time, we don’t have it”. We are perfectly aware that testing saves time, as we’ve all had those “not again!” recurring bugs. And yet we think of all sorts of excuses – “let the QAs test it”, we have to ship that now, we’ll test it later”, “this is too trivial to be tested”, etc.

And you may say it’s not our job. We don’t define what has do be done, we just do it. We don’t define the budget, the scope, the features. We just write whatever has been decided. And that’s plain wrong. It’s not our job to make money out of our code, and it’s not our job to define what customers need, but apart from that everything is our job. The way the software is structured, the security aspects and security features, the stability of the code base, the way the software behaves in different environments. The non-functional requirements are our job, and putting them on the backlog is our job.

You’ve probably heard that every software becomes “legacy” after 6 months. And that’s because of us, our sloppiness, our inability to mitigate external factors and constraints. Too often we create a mess through “just doing our job”.

And of course that’s a generalization. I happen to know a lot of great professionals who don’t make these mistakes, who strive for excellence and implement things the right way. But our industry as a whole doesn’t. Our industry as a whole produces bad software. And it’s our fault, as developers – as the only people who know why a certain piece of software is bad.

In a talk of his, Bob Martin warns us of the risks of our sloppiness. We have been building websites so far, but we are more and more building stuff that interacts with the real world, directly and indirectly. Ultimately, lives may depend on our software (like the recent unfortunate death caused by a self-driving car). And I’ll agree with Uncle Bob that it’s high time we self-regulate as an industry, before some technically incompetent politician decides to do that.

How, I don’t know. We’ll have to think more about it. But I’m pretty sure it’s our fault that software is bad, and no amount of blaming the management, the budget, the timing, the tools or the process can eliminate our responsibility.

Why do I insist on bashing my fellow software engineers? Because if we start looking at software development with more responsibility; with the fact that if it fails, it’s our fault, then we’re more likely to get out of our current bug-ridden, security-flawed, fragile software hole and really become the experts of the future.

The post Bad Software Is Our Fault appeared first on Bozho's tech blog.

Analyze data in Amazon DynamoDB using Amazon SageMaker for real-time prediction

Post Syndicated from YongSeong Lee original https://aws.amazon.com/blogs/big-data/analyze-data-in-amazon-dynamodb-using-amazon-sagemaker-for-real-time-prediction/

Many companies across the globe use Amazon DynamoDB to store and query historical user-interaction data. DynamoDB is a fast NoSQL database used by applications that need consistent, single-digit millisecond latency.

Often, customers want to turn their valuable data in DynamoDB into insights by analyzing a copy of their table stored in Amazon S3. Doing this separates their analytical queries from their low-latency critical paths. This data can be the primary source for understanding customers’ past behavior, predicting future behavior, and generating downstream business value. Customers often turn to DynamoDB because of its great scalability and high availability. After a successful launch, many customers want to use the data in DynamoDB to predict future behaviors or provide personalized recommendations.

DynamoDB is a good fit for low-latency reads and writes, but it’s not practical to scan all data in a DynamoDB database to train a model. In this post, I demonstrate how you can use DynamoDB table data copied to Amazon S3 by AWS Data Pipeline to predict customer behavior. I also demonstrate how you can use this data to provide personalized recommendations for customers using Amazon SageMaker. You can also run ad hoc queries using Amazon Athena against the data. DynamoDB recently released on-demand backups to create full table backups with no performance impact. However, it’s not suitable for our purposes in this post, so I chose AWS Data Pipeline instead to create managed backups are accessible from other services.

To do this, I describe how to read the DynamoDB backup file format in Data Pipeline. I also describe how to convert the objects in S3 to a CSV format that Amazon SageMaker can read. In addition, I show how to schedule regular exports and transformations using Data Pipeline. The sample data used in this post is from Bank Marketing Data Set of UCI.

The solution that I describe provides the following benefits:

  • Separates analytical queries from production traffic on your DynamoDB table, preserving your DynamoDB read capacity units (RCUs) for important production requests
  • Automatically updates your model to get real-time predictions
  • Optimizes for performance (so it doesn’t compete with DynamoDB RCUs after the export) and for cost (using data you already have)
  • Makes it easier for developers of all skill levels to use Amazon SageMaker

All code and data set in this post are available in this .zip file.

Solution architecture

The following diagram shows the overall architecture of the solution.

The steps that data follows through the architecture are as follows:

  1. Data Pipeline regularly copies the full contents of a DynamoDB table as JSON into an S3
  2. Exported JSON files are converted to comma-separated value (CSV) format to use as a data source for Amazon SageMaker.
  3. Amazon SageMaker renews the model artifact and update the endpoint.
  4. The converted CSV is available for ad hoc queries with Amazon Athena.
  5. Data Pipeline controls this flow and repeats the cycle based on the schedule defined by customer requirements.

Building the auto-updating model

This section discusses details about how to read the DynamoDB exported data in Data Pipeline and build automated workflows for real-time prediction with a regularly updated model.

Download sample scripts and data

Before you begin, take the following steps:

  1. Download sample scripts in this .zip file.
  2. Unzip the src.zip file.
  3. Find the automation_script.sh file and edit it for your environment. For example, you need to replace 's3://<your bucket>/<datasource path>/' with your own S3 path to the data source for Amazon ML. In the script, the text enclosed by angle brackets—< and >—should be replaced with your own path.
  4. Upload the json-serde-1.3.6-SNAPSHOT-jar-with-dependencies.jar file to your S3 path so that the ADD jar command in Apache Hive can refer to it.

For this solution, the banking.csv  should be imported into a DynamoDB table.

Export a DynamoDB table

To export the DynamoDB table to S3, open the Data Pipeline console and choose the Export DynamoDB table to S3 template. In this template, Data Pipeline creates an Amazon EMR cluster and performs an export in the EMRActivity activity. Set proper intervals for backups according to your business requirements.

One core node(m3.xlarge) provides the default capacity for the EMR cluster and should be suitable for the solution in this post. Leave the option to resize the cluster before running enabled in the TableBackupActivity activity to let Data Pipeline scale the cluster to match the table size. The process of converting to CSV format and renewing models happens in this EMR cluster.

For a more in-depth look at how to export data from DynamoDB, see Export Data from DynamoDB in the Data Pipeline documentation.

Add the script to an existing pipeline

After you export your DynamoDB table, you add an additional EMR step to EMRActivity by following these steps:

  1. Open the Data Pipeline console and choose the ID for the pipeline that you want to add the script to.
  2. For Actions, choose Edit.
  3. In the editing console, choose the Activities category and add an EMR step using the custom script downloaded in the previous section, as shown below.

Paste the following command into the new step after the data ­­upload step:

s3://#{myDDBRegion}.elasticmapreduce/libs/script-runner/script-runner.jar,s3://<your bucket name>/automation_script.sh,#{output.directoryPath},#{myDDBRegion}

The element #{output.directoryPath} references the S3 path where the data pipeline exports DynamoDB data as JSON. The path should be passed to the script as an argument.

The bash script has two goals, converting data formats and renewing the Amazon SageMaker model. Subsequent sections discuss the contents of the automation script.

Automation script: Convert JSON data to CSV with Hive

We use Apache Hive to transform the data into a new format. The Hive QL script to create an external table and transform the data is included in the custom script that you added to the Data Pipeline definition.

When you run the Hive scripts, do so with the -e option. Also, define the Hive table with the 'org.openx.data.jsonserde.JsonSerDe' row format to parse and read JSON format. The SQL creates a Hive EXTERNAL table, and it reads the DynamoDB backup data on the S3 path passed to it by Data Pipeline.

Note: You should create the table with the “EXTERNAL” keyword to avoid the backup data being accidentally deleted from S3 if you drop the table.

The full automation script for converting follows. Add your own bucket name and data source path in the highlighted areas.

#!/bin/bash
hive -e "
ADD jar s3://<your bucket name>/json-serde-1.3.6-SNAPSHOT-jar-with-dependencies.jar ; 
DROP TABLE IF EXISTS blog_backup_data ;
CREATE EXTERNAL TABLE blog_backup_data (
 customer_id map<string,string>,
 age map<string,string>, job map<string,string>, 
 marital map<string,string>,education map<string,string>, 
 default map<string,string>, housing map<string,string>,
 loan map<string,string>, contact map<string,string>, 
 month map<string,string>, day_of_week map<string,string>, 
 duration map<string,string>, campaign map<string,string>,
 pdays map<string,string>, previous map<string,string>, 
 poutcome map<string,string>, emp_var_rate map<string,string>, 
 cons_price_idx map<string,string>, cons_conf_idx map<string,string>,
 euribor3m map<string,string>, nr_employed map<string,string>, 
 y map<string,string> ) 
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe' 
LOCATION '$1/';

INSERT OVERWRITE DIRECTORY 's3://<your bucket name>/<datasource path>/' 
SELECT concat( customer_id['s'],',', 
 age['n'],',', job['s'],',', 
 marital['s'],',', education['s'],',', default['s'],',', 
 housing['s'],',', loan['s'],',', contact['s'],',', 
 month['s'],',', day_of_week['s'],',', duration['n'],',', 
 campaign['n'],',',pdays['n'],',',previous['n'],',', 
 poutcome['s'],',', emp_var_rate['n'],',', cons_price_idx['n'],',',
 cons_conf_idx['n'],',', euribor3m['n'],',', nr_employed['n'],',', y['n'] ) 
FROM blog_backup_data
WHERE customer_id['s'] > 0 ; 

After creating an external table, you need to read data. You then use the INSERT OVERWRITE DIRECTORY ~ SELECT command to write CSV data to the S3 path that you designated as the data source for Amazon SageMaker.

Depending on your requirements, you can eliminate or process the columns in the SELECT clause in this step to optimize data analysis. For example, you might remove some columns that have unpredictable correlations with the target value because keeping the wrong columns might expose your model to “overfitting” during the training. In this post, customer_id  columns is removed. Overfitting can make your prediction weak. More information about overfitting can be found in the topic Model Fit: Underfitting vs. Overfitting in the Amazon ML documentation.

Automation script: Renew the Amazon SageMaker model

After the CSV data is replaced and ready to use, create a new model artifact for Amazon SageMaker with the updated dataset on S3.  For renewing model artifact, you must create a new training job.  Training jobs can be run using the AWS SDK ( for example, Amazon SageMaker boto3 ) or the Amazon SageMaker Python SDK that can be installed with “pip install sagemaker” command as well as the AWS CLI for Amazon SageMaker described in this post.

In addition, consider how to smoothly renew your existing model without service impact, because your model is called by applications in real time. To do this, you need to create a new endpoint configuration first and update a current endpoint with the endpoint configuration that is just created.

#!/bin/bash
## Define variable 
REGION=$2
DTTIME=`date +%Y-%m-%d-%H-%M-%S`
ROLE="<your AmazonSageMaker-ExecutionRole>" 


# Select containers image based on region.  
case "$REGION" in
"us-west-2" )
    IMAGE="174872318107.dkr.ecr.us-west-2.amazonaws.com/linear-learner:latest"
    ;;
"us-east-1" )
    IMAGE="382416733822.dkr.ecr.us-east-1.amazonaws.com/linear-learner:latest" 
    ;;
"us-east-2" )
    IMAGE="404615174143.dkr.ecr.us-east-2.amazonaws.com/linear-learner:latest" 
    ;;
"eu-west-1" )
    IMAGE="438346466558.dkr.ecr.eu-west-1.amazonaws.com/linear-learner:latest" 
    ;;
 *)
    echo "Invalid Region Name"
    exit 1 ;  
esac

# Start training job and creating model artifact 
TRAINING_JOB_NAME=TRAIN-${DTTIME} 
S3OUTPUT="s3://<your bucket name>/model/" 
INSTANCETYPE="ml.m4.xlarge"
INSTANCECOUNT=1
VOLUMESIZE=5 
aws sagemaker create-training-job --training-job-name ${TRAINING_JOB_NAME} --region ${REGION}  --algorithm-specification TrainingImage=${IMAGE},TrainingInputMode=File --role-arn ${ROLE}  --input-data-config '[{ "ChannelName": "train", "DataSource": { "S3DataSource": { "S3DataType": "S3Prefix", "S3Uri": "s3://<your bucket name>/<datasource path>/", "S3DataDistributionType": "FullyReplicated" } }, "ContentType": "text/csv", "CompressionType": "None" , "RecordWrapperType": "None"  }]'  --output-data-config S3OutputPath=${S3OUTPUT} --resource-config  InstanceType=${INSTANCETYPE},InstanceCount=${INSTANCECOUNT},VolumeSizeInGB=${VOLUMESIZE} --stopping-condition MaxRuntimeInSeconds=120 --hyper-parameters feature_dim=20,predictor_type=binary_classifier  

# Wait until job completed 
aws sagemaker wait training-job-completed-or-stopped --training-job-name ${TRAINING_JOB_NAME}  --region ${REGION}

# Get newly created model artifact and create model
MODELARTIFACT=`aws sagemaker describe-training-job --training-job-name ${TRAINING_JOB_NAME} --region ${REGION}  --query 'ModelArtifacts.S3ModelArtifacts' --output text `
MODELNAME=MODEL-${DTTIME}
aws sagemaker create-model --region ${REGION} --model-name ${MODELNAME}  --primary-container Image=${IMAGE},ModelDataUrl=${MODELARTIFACT}  --execution-role-arn ${ROLE}

# create a new endpoint configuration 
CONFIGNAME=CONFIG-${DTTIME}
aws sagemaker  create-endpoint-config --region ${REGION} --endpoint-config-name ${CONFIGNAME}  --production-variants  VariantName=Users,ModelName=${MODELNAME},InitialInstanceCount=1,InstanceType=ml.m4.xlarge

# create or update the endpoint
STATUS=`aws sagemaker describe-endpoint --endpoint-name  ServiceEndpoint --query 'EndpointStatus' --output text --region ${REGION} `
if [[ $STATUS -ne "InService" ]] ;
then
    aws sagemaker  create-endpoint --endpoint-name  ServiceEndpoint  --endpoint-config-name ${CONFIGNAME} --region ${REGION}    
else
    aws sagemaker  update-endpoint --endpoint-name  ServiceEndpoint  --endpoint-config-name ${CONFIGNAME} --region ${REGION}
fi

Grant permission

Before you execute the script, you must grant proper permission to Data Pipeline. Data Pipeline uses the DataPipelineDefaultResourceRole role by default. I added the following policy to DataPipelineDefaultResourceRole to allow Data Pipeline to create, delete, and update the Amazon SageMaker model and data source in the script.

{
 "Version": "2012-10-17",
 "Statement": [
 {
 "Effect": "Allow",
 "Action": [
 "sagemaker:CreateTrainingJob",
 "sagemaker:DescribeTrainingJob",
 "sagemaker:CreateModel",
 "sagemaker:CreateEndpointConfig",
 "sagemaker:DescribeEndpoint",
 "sagemaker:CreateEndpoint",
 "sagemaker:UpdateEndpoint",
 "iam:PassRole"
 ],
 "Resource": "*"
 }
 ]
}

Use real-time prediction

After you deploy a model into production using Amazon SageMaker hosting services, your client applications use this API to get inferences from the model hosted at the specified endpoint. This approach is useful for interactive web, mobile, or desktop applications.

Following, I provide a simple Python code example that queries against Amazon SageMaker endpoint URL with its name (“ServiceEndpoint”) and then uses them for real-time prediction.

=== Python sample for real-time prediction ===

#!/usr/bin/env python
import boto3
import json 

client = boto3.client('sagemaker-runtime', region_name ='<your region>' )
new_customer_info = '34,10,2,4,1,2,1,1,6,3,190,1,3,4,3,-1.7,94.055,-39.8,0.715,4991.6'
response = client.invoke_endpoint(
    EndpointName='ServiceEndpoint',
    Body=new_customer_info, 
    ContentType='text/csv'
)
result = json.loads(response['Body'].read().decode())
print(result)
--- output(response) ---
{u'predictions': [{u'score': 0.7528127431869507, u'predicted_label': 1.0}]}

Solution summary

The solution takes the following steps:

  1. Data Pipeline exports DynamoDB table data into S3. The original JSON data should be kept to recover the table in the rare event that this is needed. Data Pipeline then converts JSON to CSV so that Amazon SageMaker can read the data.Note: You should select only meaningful attributes when you convert CSV. For example, if you judge that the “campaign” attribute is not correlated, you can eliminate this attribute from the CSV.
  2. Train the Amazon SageMaker model with the new data source.
  3. When a new customer comes to your site, you can judge how likely it is for this customer to subscribe to your new product based on “predictedScores” provided by Amazon SageMaker.
  4. If the new user subscribes your new product, your application must update the attribute “y” to the value 1 (for yes). This updated data is provided for the next model renewal as a new data source. It serves to improve the accuracy of your prediction. With each new entry, your application can become smarter and deliver better predictions.

Running ad hoc queries using Amazon Athena

Amazon Athena is a serverless query service that makes it easy to analyze large amounts of data stored in Amazon S3 using standard SQL. Athena is useful for examining data and collecting statistics or informative summaries about data. You can also use the powerful analytic functions of Presto, as described in the topic Aggregate Functions of Presto in the Presto documentation.

With the Data Pipeline scheduled activity, recent CSV data is always located in S3 so that you can run ad hoc queries against the data using Amazon Athena. I show this with example SQL statements following. For an in-depth description of this process, see the post Interactive SQL Queries for Data in Amazon S3 on the AWS News Blog. 

Creating an Amazon Athena table and running it

Simply, you can create an EXTERNAL table for the CSV data on S3 in Amazon Athena Management Console.

=== Table Creation ===
CREATE EXTERNAL TABLE datasource (
 age int, 
 job string, 
 marital string , 
 education string, 
 default string, 
 housing string, 
 loan string, 
 contact string, 
 month string, 
 day_of_week string, 
 duration int, 
 campaign int, 
 pdays int , 
 previous int , 
 poutcome string, 
 emp_var_rate double, 
 cons_price_idx double,
 cons_conf_idx double, 
 euribor3m double, 
 nr_employed double, 
 y int 
)
ROW FORMAT DELIMITED 
FIELDS TERMINATED BY ',' ESCAPED BY '\\' LINES TERMINATED BY '\n' 
LOCATION 's3://<your bucket name>/<datasource path>/';

The following query calculates the correlation coefficient between the target attribute and other attributes using Amazon Athena.

=== Sample Query ===

SELECT corr(age,y) AS correlation_age_and_target, 
 corr(duration,y) AS correlation_duration_and_target, 
 corr(campaign,y) AS correlation_campaign_and_target,
 corr(contact,y) AS correlation_contact_and_target
FROM ( SELECT age , duration , campaign , y , 
 CASE WHEN contact = 'telephone' THEN 1 ELSE 0 END AS contact 
 FROM datasource 
 ) datasource ;

Conclusion

In this post, I introduce an example of how to analyze data in DynamoDB by using table data in Amazon S3 to optimize DynamoDB table read capacity. You can then use the analyzed data as a new data source to train an Amazon SageMaker model for accurate real-time prediction. In addition, you can run ad hoc queries against the data on S3 using Amazon Athena. I also present how to automate these procedures by using Data Pipeline.

You can adapt this example to your specific use case at hand, and hopefully this post helps you accelerate your development. You can find more examples and use cases for Amazon SageMaker in the video AWS 2017: Introducing Amazon SageMaker on the AWS website.

 


Additional Reading

If you found this post useful, be sure to check out Serving Real-Time Machine Learning Predictions on Amazon EMR and Analyzing Data in S3 using Amazon Athena.

 


About the Author

Yong Seong Lee is a Cloud Support Engineer for AWS Big Data Services. He is interested in every technology related to data/databases and helping customers who have difficulties in using AWS services. His motto is “Enjoy life, be curious and have maximum experience.”

 

 

Weekly roundup: Potpourri 2

Post Syndicated from Eevee original https://eev.ee/dev/2018/01/23/weekly-roundup-potpourri-2/

  • blog: I wrote a birthday post, as is tradition. I finally finished writing Game Night 2, a full month after we actually played those games.

  • art: I put together an art improvement chart for last year, after skipping doing it in July, tut tut. Kind of a weird rollercoaster!

    I worked a teeny bit on two one-off comics I guess but they aren’t reeeally getting anywhere fast. Comics are hard.

    I made a banner for Strawberry Jam 2 which I think came out fantastically!

  • games: I launched Strawberry Jam 2, a month-long February game jam about making horny games. I will probably be making a horny game for it.

  • idchoppers: I took another crack at dilation. Some meager progress, maybe. I think I’m now porting bad academic C++ to Rust to get the algorithm I want, and I can’t help but wonder if I could just make up something of my own faster than this.

  • fox flux: I did a bunch of brainstorming and consolidated a bunch of notes from like four different places, which feels like work but also feels like it doesn’t actually move the project forward.

  • anise!!: Ah, yes, this fell a bit by the wayside. Some map work, some attempts at a 3D effect for a particular thing without much luck (though I found a workaround in the last couple days).

  • computers: I relieved myself of some 200 browser tabs, which feels fantastic, though I’ve since opened like 80 more. Alas. I also tried to put together a firejail profile for running mystery games from the internet, and I got like 90% of the way there, but it turns out there’s basically no way to stop an X application from reading all keyboard input.

    (Yes, I know about that, and I tried it. Yes, that too.)

I’ve got a small pile of little projects that are vaguely urgent, so as much as I’d love to bash my head against idchoppers for a solid week, I’m gonna try to focus on getting a couple half-done things full-done. And maybe try to find time for art regularly so I don’t fall out of practice? Huff puff.

SSHfix.sh – the small tool I use to enable SSH public/private key login

Post Syndicated from Delian Delchev original http://deliantech.blogspot.com/2017/11/sshfixsh-small-tool-i-use-to-enable-ssh.html

I am just dropping that here. This is sshfix.sh – a small tool I use to enable SSH login to a remote host.

I use it the same way I use ssh:

./sshfix.sh [email protected]

The code:

#!/bin/sh
[ -f ~/.ssh/id_rsa.pub ] || ssh-keygen -t rsa -b 2048; ssh $* “(mkdir -p ~/.ssh; echo \”$(cat ~/.ssh/id_rsa.pub)\” >> ~/.ssh/authorized_keys)”

[$] Bash the kernel maintainers

Post Syndicated from corbet original https://lwn.net/Articles/738222/rss

Laurent Pinchart ran a session at the 2017 Embedded Linux Conference Europe
entitled “Bash the kernel maintainers”; the idea was to get feedback from
developers on their experience working with the kernel community. A few
days later, the Maintainers Summit held a free-flowing discussion on the
issues that were brought up in that session. Some changes may result from
this discussion, but it also showed how hard it can be to change how kernel
subsystem maintainers work.

Browser hacking for 280 character tweets

Post Syndicated from Robert Graham original http://blog.erratasec.com/2017/09/browser-hacking-for-280-character-tweets.html

Twitter has raised the limit to 280 characters for a select number of people. However, they left open a hole, allowing anybody to make large tweets with a little bit of hacking. The hacking skills needed are basic hacking skills, which I thought I’d write up in a blog post.


Specifically, the skills you will exercise are:

  • basic command-line shell
  • basic HTTP requests
  • basic browser DOM editing

The short instructions

The basic instructions were found in tweets like the following:
These instructions are clear to the average hacker, but of course, a bit difficult for those learning hacking, hence this post.

The command-line

The basics of most hacking start with knowledge of the command-line. This is the “Terminal” app under macOS or cmd.exe under Windows. Almost always when you see hacking dramatized in the movies, they are using the command-line.
In the beginning, the command-line is all computers had. To do anything on a computer, you had to type a “command” telling it what to do. What we see as the modern graphical screen is a layer on top of the command-line, one that translates clicks of the mouse into the raw commands.
On most systems, the command-line is known as “bash”. This is what you’ll find on Linux and macOS. Windows historically has had a different command-line that uses slightly different syntax, though in the last couple years, they’ve also supported “bash”. You’ll have to install it first, such as by following these instructions.
You’ll see me use command that may not be yet installed on your “bash” command-line, like nc and curl. You’ll need to run a command to install them, such as:
sudo apt-get install nc curl
The thing to remember about the command-line is that the mouse doesn’t work. You can’t click to move the cursor as you normally do in applications. That’s because the command-line predates the mouse by decades. Instead, you have to use arrow keys.
I’m not going to spend much effort discussing the command-line, as a complete explanation is beyond the scope of this document. Instead, I’m assuming the reader either already knows it, or will learn-from-example as we go along.

Web requests

The basics of how the web works are really simple. A request to a web server is just a small packet of text, such as the following, which does a search on Google for the search-term “penguin” (presumably, you are interested in knowing more about penguins):
GET /search?q=penguin HTTP/1.0
Host: www.google.com
User-Agent: human
The command we are sending to the server is GET, meaning get a page. We are accessing the URL /search, which on Google’s website, is how you do a search. We are then sending the parameter q with the value penguin. We also declare that we are using version 1.0 of the HTTP (hyper-text transfer protocol).
Following the first line there are a number of additional headers. In one header, we declare the Host name that we are accessing. Web servers can contain many different websites, with different names, so this header is usually imporant.
We also add the User-Agent header. The “user-agent” means the “browser” that you use, like Edge, Chrome, Firefox, or Safari. It allows servers to send content optimized for different browsers. Since we are sending web requests without a browser here, we are joking around saying human.
Here’s what happens when we use the nc program to send this to a google web server:
The first part is us typing, until we hit the [enter] key to create a blank line. After that point is the response from the Google server. We get back a result code (OK), followed by more headers from the server, and finally the contents of the webpage, which goes on from many screens. (We’ll talk about what web pages look like below).
Note that a lot of HTTP headers are optional and really have little influence on what’s going on. They are just junk added to web requests. For example, we see Google report a P3P header is some relic of 2002 that nobody uses anymore, as far as I can tell. Indeed, if you follow the URL in the P3P header, Google pretty much says exactly that.
I point this out because the request I show above is a simplified one. In practice, most requests contain a lot more headers, especially Cookie headers. We’ll see that later when making requests.

Using cURL instead

Sending the raw HTTP request to the server, and getting raw HTTP/HTML back, is annoying. The better way of doing this is with the tool known as cURL, or plainly, just curl. You may be familiar with the older command-line tools wget. cURL is similar, but more flexible.
To use curl for the experiment above, we’d do something like the following. We are saving the web page to “penguin.html” instead of just spewing it on the screen.
Underneath, cURL builds an HTTP header just like the one we showed above, and sends it to the server, getting the response back.

Web-pages

Now let’s talk about web pages. When you look at the web page we got back from Google while searching for “penguin”, you’ll see that it’s intimidatingly complex. I mean, it intimidates me. But it all starts from some basic principles, so we’ll look at some simpler examples.
The following is text of a simple web page:
<html>
<body>
<h1>Test</h1>
<p>This is a simple web page</p>
</body>
</html>
This is HTML, “hyper-text markup language”. As it’s name implies, we “markup” text, such as declaring the first text as a level-1 header (H1), and the following text as a paragraph (P).
In a web browser, this gets rendered as something that looks like the following. Notice how a header is formatted differently from a paragraph. Also notice that web browsers can use local files as well as make remote requests to web servers:
You can right-mouse click on the page and do a “View Source”. This will show the raw source behind the web page:
Web pages don’t just contain marked-up text. They contain two other important features, style information that dictates how things appear, and script that does all the live things that web pages do, from which we build web apps.
So let’s add a little bit of style and scripting to our web page. First, let’s view the source we’ll be adding:
In our header (H1) field, we’ve added the attribute to the markup giving this an id of mytitle. In the style section above, we give that element a color of blue, and tell it to align to the center.
Then, in our script section, we’ve told it that when somebody clicks on the element “mytitle”, it should send an “alert” message of “hello”.
This is what our web page now looks like, with the center blue title:
When we click on the title, we get a popup alert:
Thus, we see an example of the three components of a webpage: markup, style, and scripting.

Chrome developer tools

Now we go off the deep end. Right-mouse click on “Test” (not normal click, but right-button click, to pull up a menu). Select “Inspect”.
You should now get a window that looks something like the following. Chrome splits the screen in half, showing the web page on the left, and it’s debug tools on the right.
This looks similar to what “View Source” shows, but it isn’t. Instead, it’s showing how Chrome interpreted the source HTML. For example, our style/script tags should’ve been marked up with a head (header) tag. We forgot it, but Chrome adds it in anyway.
What Google is showing us is called the DOM, or document object model. It shows us all the objects that make up a web page, and how they fit together.
For example, it shows us how the style information for #mytitle is created. It first starts with the default style information for an h1 tag, and then how we’ve changed it with our style specifications.
We can edit the DOM manually. Just double click on things you want to change. For example, in this screen shot, I’ve changed the style spec from blue to red, and I’ve changed the header and paragraph test. The original file on disk hasn’t changed, but I’ve changed the DOM in memory.
This is a classic hacking technique. If you don’t like things like paywalls, for example, just right-click on the element blocking your view of the text, “Inspect” it, then delete it. (This works for some paywalls).
This edits the markup and style info, but changing the scripting stuff is a bit more complicated. To do that, click on the [Console] tab. This is the scripting console, and allows you to run code directly as part of the webpage. We are going to run code that resets what happens when we click on the title. In this case, we are simply going to change the message to “goodbye”.
Now when we click on the title, we indeed get the message:
Again, a common way to get around paywalls is to run some code like that that change which functions will be called.

Putting it all together

Now let’s put this all together in order to hack Twitter to allow us (the non-chosen) to tweet 280 characters. Review Dildog’s instructions above.
The first step is to get to Chrome Developer Tools. Dildog suggests F12. I suggest right-clicking on the Tweet button (or Reply button, as I use in my example) and doing “Inspect”, as I describe above.
You’ll now see your screen split in half, with the DOM toward the right, similar to how I describe above. However, Twitter’s app is really complex. Well, not really complex, it’s all basic stuff when you come right down to it. It’s just so much stuff — it’s a large web app with lots of parts. So we have to dive in without understanding everything that’s going on.
The Tweet/Reply button we are inspecting is going to look like this in the DOM:
The Tweet/Reply button is currently greyed out because it has the “disabled” attribute. You need to double click on it and remove that attribute. Also, in the class attribute, there is also a “disabled” part. Double-click, then click on that and removed just that disabled as well, without impacting the stuff around it. This should change the button from disabled to enabled. It won’t be greyed out, and it’ll respond when you click on it.
Now click on it. You’ll get an error message, as shown below:
What we’ve done here is bypass what’s known as client-side validation. The script in the web page prevented sending Tweets longer than 140 characters. Our editing of the DOM changed that, allowing us to send a bad request to the server. Bypassing client-side validation this way is the source of a lot of hacking.
But Twitter still does server-side validation as well. They know any client-side validation can be bypassed, and are in on the joke. They tell us hackers “You’ll have to be more clever”. So let’s be more clever.
In order to make longer 280 characters tweets work for select customers, they had to change something on the server-side. The thing they added was adding a “weighted_character_count=true” to the HTTP request. We just need to repeat the request we generated above, adding this parameter.
In theory, we can do this by fiddling with the scripting. The way Dildog describes does it a different way. He copies the request out of the browser, edits it, then send it via the command-line using curl.
We’ve used the [Elements] and [Console] tabs in Chrome’s DevTools. Now we are going to use the [Network] tab. This lists all the requests the web page has made to the server. The twitter app is constantly making requests to refresh the content of the web page. The request we made trying to do a long tweet is called “create”, and is red, because it failed.
Google Chrome gives us a number of ways to duplicate the request. The most useful is that it copies it as a full cURL command we can just paste onto the command-line. We don’t even need to know cURL, it takes care of everything for us. On Windows, since you have two command-lines, it gives you a choice to use the older Windows cmd.exe, or the newer bash.exe. I use the bash version, since I don’t know where to get the Windows command-line version of cURL.exe.
There’s a lot of going on here. The first thing to notice is the long xxxxxx strings. That’s actually not in the original screenshot. I edited the picture. That’s because these are session-cookies. If inserted them into your browser, you’d hijack my Twitter session, and be able to tweet as me (such as making Carlos Danger style tweets). Therefore, I have to remove them from the example.
At the top of the screen is the URL that we are accessing, which is https://twitter.com/i/tweet/create. Much of the rest of the screen uses the cURL -H option to add a header. These are all the HTTP headers that I describe above. Finally, at the bottom, is the –data section, which contains the data bits related to the tweet, especially the tweet itself.
We need to edit either the URL above to read https://twitter.com/i/tweet/create?weighted_character_count=true, or we need to add &weighted_character_count=true to the –data section at the bottom (either works). Remember: mouse doesn’t work on command-line, so you have to use the cursor-keys to navigate backwards in the line. Also, since the line is larger than the screen, it’s on several visual lines, even though it’s all a single line as far as the command-line is concerned.
Now just hit [return] on your keyboard, and the tweet will be sent to the server, which at the moment, works. Presto!
Twitter will either enable or disable the feature for everyone in a few weeks, at which point, this post won’t work. But the reason I’m writing this is to demonstrate the basic hacking skills. We manipulate the web pages we receive from servers, and we manipulate what’s sent back from our browser back to the server.

Easier: hack the scripting

Instead of messing with the DOM and editing the HTTP request, the better solution would be to change the scripting that does both DOM client-side validation and HTTP request generation. The only reason Dildog above didn’t do that is that it’s a lot more work trying to find where all this happens.
Others have, though. @Zemnmez did just that, though his technique works for the alternate TweetDeck client (https://tweetdeck.twitter.com) instead of the default client. Go copy his code from here, then paste it into the DevTools scripting [Console]. It’ll go in an replace some scripting functions, such like my simpler example above.
The console is showing a stream of error messages, because TweetDeck has bugs, ignore those.
Now you can effortlessly do long tweets as normal, without all the messing around I’ve spent so much text in this blog post describing.
Now, as I’ve mentioned this before, you are only editing what’s going on in the current web page. If you refresh this page, or close it, everything will be lost. You’ll have to re-open the DevTools scripting console and repaste the code. The easier way of doing this is to use the [Sources] tab instead of [Console] and use the “Snippets” feature to save this bit of code in your browser, to make it easier next time.
The even easier way is to use Chrome extensions like TamperMonkey and GreaseMonkey that’ll take care of this for you. They’ll save the script, and automatically run it when they see you open the TweetDeck webpage again.
An even easier way is to use one of the several Chrome extensions written in the past day specifically designed to bypass the 140 character limit. Since the purpose of this blog post is to show you how to tamper with your browser yourself, rather than help you with Twitter, I won’t list them.

Conclusion

Tampering with the web-page the server gives you, and the data you send back, is a basic hacker skill. In truth, there is a lot to this. You have to get comfortable with the command-line, using tools like cURL. You have to learn how HTTP requests work. You have to understand how web pages are built from markup, style, and scripting. You have to be comfortable using Chrome’s DevTools for messing around with web page elements, network requests, scripting console, and scripting sources.
So it’s rather a lot, actually.
My hope with this page is to show you a practical application of all this, without getting too bogged down in fully explaining how every bit works.

Security updates for Monday

Post Syndicated from ris original https://lwn.net/Articles/733389/rss

Security updates have been issued by Debian (freerdp, mbedtls, tiff, and tiff3), Fedora (chromium, krb5, libstaroffice, mbedtls, mingw-libidn2, mingw-openjpeg2, openjpeg2, and rubygems), Mageia (bzr, libarchive, libgcrypt, and tcpdump), openSUSE (gdk-pixbuf, libidn2, mpg123, postgresql94, postgresql96, and xen), Slackware (bash, mariadb, and tcpdump), and SUSE (evince and kernel).

Security updates for Tuesday

Post Syndicated from corbet original https://lwn.net/Articles/731678/rss

Security updates have been issued by Debian (extplorer and libraw), Fedora (mingw-libsoup, python-tablib, ruby, and subversion), Mageia (avidemux, clamav, nasm, php-pear-CAS, and shutter), Oracle (xmlsec1), Red Hat (openssl tomcat), Scientific Linux (authconfig, bash, curl, evince, firefox, freeradius, gdm gnome-session, ghostscript, git, glibc, gnutls, groovy, GStreamer, gtk-vnc, httpd, java-1.7.0-openjdk, kernel, libreoffice, libsoup, libtasn1, log4j, mariadb, mercurial, NetworkManager, openldap, openssh, pidgin, pki-core, postgresql, python, qemu-kvm, samba, spice, subversion, tcpdump, tigervnc fltk, tomcat, X.org, and xmlsec1), SUSE (git), and Ubuntu (augeas, cvs, and texlive-base).

Security updates for Wednesday

Post Syndicated from ris original https://lwn.net/Articles/730338/rss

Security updates have been issued by Mageia (atril, mpg123, perl-SOAP-Lite, and virtualbox), openSUSE (kernel and libzypp, zypper), Oracle (authconfig, bash, curl, gdm and gnome-session, ghostscript, git, glibc, gnutls, gtk-vnc, kernel, libreoffice, libtasn1, mariadb, openldap, openssh, pidgin, postgresql, python, qemu-kvm, samba, tcpdump, tigervnc and fltk, and tomcat), Red Hat (kernel, kernel-rt, openstack-neutron, and qemu-kvm), and SUSE (puppet and tcmu-runner).