Tag Archives: HoC

How to retain system tables’ data spanning multiple Amazon Redshift clusters and run cross-cluster diagnostic queries

Post Syndicated from Karthik Sonti original https://aws.amazon.com/blogs/big-data/how-to-retain-system-tables-data-spanning-multiple-amazon-redshift-clusters-and-run-cross-cluster-diagnostic-queries/

Amazon Redshift is a data warehouse service that logs the history of the system in STL log tables. The STL log tables manage disk space by retaining only two to five days of log history, depending on log usage and available disk space.

To retain STL tables’ data for an extended period, you usually have to create a replica table for every system table. Then, for each you load the data from the system table into the replica at regular intervals. By maintaining replica tables for STL tables, you can run diagnostic queries on historical data from the STL tables. You then can derive insights from query execution times, query plans, and disk-spill patterns, and make better cluster-sizing decisions. However, refreshing replica tables with live data from STL tables at regular intervals requires schedulers such as Cron or AWS Data Pipeline. Also, these tables are specific to one cluster and they are not accessible after the cluster is terminated. This is especially true for transient Amazon Redshift clusters that last for only a finite period of ad hoc query execution.

In this blog post, I present a solution that exports system tables from multiple Amazon Redshift clusters into an Amazon S3 bucket. This solution is serverless, and you can schedule it as frequently as every five minutes. The AWS CloudFormation deployment template that I provide automates the solution setup in your environment. The system tables’ data in the Amazon S3 bucket is partitioned by cluster name and query execution date to enable efficient joins in cross-cluster diagnostic queries.

I also provide another CloudFormation template later in this post. This second template helps to automate the creation of tables in the AWS Glue Data Catalog for the system tables’ data stored in Amazon S3. After the system tables are exported to Amazon S3, you can run cross-cluster diagnostic queries on the system tables’ data and derive insights about query executions in each Amazon Redshift cluster. You can do this using Amazon QuickSight, Amazon Athena, Amazon EMR, or Amazon Redshift Spectrum.

You can find all the code examples in this post, including the CloudFormation templates, AWS Glue extract, transform, and load (ETL) scripts, and the resolution steps for common errors you might encounter in this GitHub repository.

Solution overview

The solution in this post uses AWS Glue to export system tables’ log data from Amazon Redshift clusters into Amazon S3. The AWS Glue ETL jobs are invoked at a scheduled interval by AWS Lambda. AWS Systems Manager, which provides secure, hierarchical storage for configuration data management and secrets management, maintains the details of Amazon Redshift clusters for which the solution is enabled. The last-fetched time stamp values for the respective cluster-table combination are maintained in an Amazon DynamoDB table.

The following diagram covers the key steps involved in this solution.

The solution as illustrated in the preceding diagram flows like this:

  1. The Lambda function, invoke_rs_stl_export_etl, is triggered at regular intervals, as controlled by Amazon CloudWatch. It’s triggered to look up the AWS Systems Manager parameter store to get the details of the Amazon Redshift clusters for which the system table export is enabled.
  2. The same Lambda function, based on the Amazon Redshift cluster details obtained in step 1, invokes the AWS Glue ETL job designated for the Amazon Redshift cluster. If an ETL job for the cluster is not found, the Lambda function creates one.
  3. The ETL job invoked for the Amazon Redshift cluster gets the cluster credentials from the parameter store. It gets from the DynamoDB table the last exported time stamp of when each of the system tables was exported from the respective Amazon Redshift cluster.
  4. The ETL job unloads the system tables’ data from the Amazon Redshift cluster into an Amazon S3 bucket.
  5. The ETL job updates the DynamoDB table with the last exported time stamp value for each system table exported from the Amazon Redshift cluster.
  6. The Amazon Redshift cluster system tables’ data is available in Amazon S3 and is partitioned by cluster name and date for running cross-cluster diagnostic queries.

Understanding the configuration data

This solution uses AWS Systems Manager parameter store to store the Amazon Redshift cluster credentials securely. The parameter store also securely stores other configuration information that the AWS Glue ETL job needs for extracting and storing system tables’ data in Amazon S3. Systems Manager comes with a default AWS Key Management Service (AWS KMS) key that it uses to encrypt the password component of the Amazon Redshift cluster credentials.

The following table explains the global parameters and cluster-specific parameters required in this solution. The global parameters are defined once and applicable at the overall solution level. The cluster-specific parameters are specific to an Amazon Redshift cluster and repeat for each cluster for which you enable this post’s solution. The CloudFormation template explained later in this post creates these parameters as part of the deployment process.

Parameter name Type Description
Global parametersdefined once and applied to all jobs
redshift_query_logs.global.s3_prefix String The Amazon S3 path where the query logs are exported. Under this path, each exported table is partitioned by cluster name and date.
redshift_query_logs.global.tempdir String The Amazon S3 path that AWS Glue ETL jobs use for temporarily staging the data.
redshift_query_logs.global.role> String The name of the role that the AWS Glue ETL jobs assume. Just the role name is sufficient. The complete Amazon Resource Name (ARN) is not required.
redshift_query_logs.global.enabled_cluster_list StringList A comma-separated list of cluster names for which system tables’ data export is enabled. This gives flexibility for a user to exclude certain clusters.
Cluster-specific parametersfor each cluster specified in the enabled_cluster_list parameter
redshift_query_logs.<<cluster_name>>.connection String The name of the AWS Glue Data Catalog connection to the Amazon Redshift cluster. For example, if the cluster name is product_warehouse, the entry is redshift_query_logs.product_warehouse.connection.
redshift_query_logs.<<cluster_name>>.user String The user name that AWS Glue uses to connect to the Amazon Redshift cluster.
redshift_query_logs.<<cluster_name>>.password Secure String The password that AWS Glue uses to connect the Amazon Redshift cluster’s encrypted-by key that is managed in AWS KMS.

For example, suppose that you have two Amazon Redshift clusters, product-warehouse and category-management, for which the solution described in this post is enabled. In this case, the parameters shown in the following screenshot are created by the solution deployment CloudFormation template in the AWS Systems Manager parameter store.

Solution deployment

To make it easier for you to get started, I created a CloudFormation template that automatically configures and deploys the solution—only one step is required after deployment.

Prerequisites

To deploy the solution, you must have one or more Amazon Redshift clusters in a private subnet. This subnet must have a network address translation (NAT) gateway or a NAT instance configured, and also a security group with a self-referencing inbound rule for all TCP ports. For more information about why AWS Glue ETL needs the configuration it does, described previously, see Connecting to a JDBC Data Store in a VPC in the AWS Glue documentation.

To start the deployment, launch the CloudFormation template:

CloudFormation stack parameters

The following table lists and describes the parameters for deploying the solution to export query logs from multiple Amazon Redshift clusters.

Property Default Description
S3Bucket mybucket The bucket this solution uses to store the exported query logs, stage code artifacts, and perform unloads from Amazon Redshift. For example, the mybucket/extract_rs_logs/data bucket is used for storing all the exported query logs for each system table partitioned by the cluster. The mybucket/extract_rs_logs/temp/ bucket is used for temporarily staging the unloaded data from Amazon Redshift. The mybucket/extract_rs_logs/code bucket is used for storing all the code artifacts required for Lambda and the AWS Glue ETL jobs.
ExportEnabledRedshiftClusters Requires Input A comma-separated list of cluster names from which the system table logs need to be exported.
DataStoreSecurityGroups Requires Input A list of security groups with an inbound rule to the Amazon Redshift clusters provided in the parameter, ExportEnabledClusters. These security groups should also have a self-referencing inbound rule on all TCP ports, as explained on Connecting to a JDBC Data Store in a VPC.

After you launch the template and create the stack, you see that the following resources have been created:

  1. AWS Glue connections for each Amazon Redshift cluster you provided in the CloudFormation stack parameter, ExportEnabledRedshiftClusters.
  2. All parameters required for this solution created in the parameter store.
  3. The Lambda function that invokes the AWS Glue ETL jobs for each configured Amazon Redshift cluster at a regular interval of five minutes.
  4. The DynamoDB table that captures the last exported time stamps for each exported cluster-table combination.
  5. The AWS Glue ETL jobs to export query logs from each Amazon Redshift cluster provided in the CloudFormation stack parameter, ExportEnabledRedshiftClusters.
  6. The IAM roles and policies required for the Lambda function and AWS Glue ETL jobs.

After the deployment

For each Amazon Redshift cluster for which you enabled the solution through the CloudFormation stack parameter, ExportEnabledRedshiftClusters, the automated deployment includes temporary credentials that you must update after the deployment:

  1. Go to the parameter store.
  2. Note the parameters <<cluster_name>>.user and redshift_query_logs.<<cluster_name>>.password that correspond to each Amazon Redshift cluster for which you enabled this solution. Edit these parameters to replace the placeholder values with the right credentials.

For example, if product-warehouse is one of the clusters for which you enabled system table export, you edit these two parameters with the right user name and password and choose Save parameter.

Querying the exported system tables

Within a few minutes after the solution deployment, you should see Amazon Redshift query logs being exported to the Amazon S3 location, <<S3Bucket_you_provided>>/extract_redshift_query_logs/data/. In that bucket, you should see the eight system tables partitioned by customer name and date: stl_alert_event_log, stl_dlltext, stl_explain, stl_query, stl_querytext, stl_scan, stl_utilitytext, and stl_wlm_query.

To run cross-cluster diagnostic queries on the exported system tables, create external tables in the AWS Glue Data Catalog. To make it easier for you to get started, I provide a CloudFormation template that creates an AWS Glue crawler, which crawls the exported system tables stored in Amazon S3 and builds the external tables in the AWS Glue Data Catalog.

Launch this CloudFormation template to create external tables that correspond to the Amazon Redshift system tables. S3Bucket is the only input parameter required for this stack deployment. Provide the same Amazon S3 bucket name where the system tables’ data is being exported. After you successfully create the stack, you can see the eight tables in the database, redshift_query_logs_db, as shown in the following screenshot.

Now, navigate to the Athena console to run cross-cluster diagnostic queries. The following screenshot shows a diagnostic query executed in Athena that retrieves query alerts logged across multiple Amazon Redshift clusters.

You can build the following example Amazon QuickSight dashboard by running cross-cluster diagnostic queries on Athena to identify the hourly query count and the key query alert events across multiple Amazon Redshift clusters.

How to extend the solution

You can extend this post’s solution in two ways:

  • Add any new Amazon Redshift clusters that you spin up after you deploy the solution.
  • Add other system tables or custom query results to the list of exports from an Amazon Redshift cluster.

Extend the solution to other Amazon Redshift clusters

To extend the solution to more Amazon Redshift clusters, add the three cluster-specific parameters in the AWS Systems Manager parameter store following the guidelines earlier in this post. Modify the redshift_query_logs.global.enabled_cluster_list parameter to append the new cluster to the comma-separated string.

Extend the solution to add other tables or custom queries to an Amazon Redshift cluster

The current solution ships with the export functionality for the following Amazon Redshift system tables:

  • stl_alert_event_log
  • stl_dlltext
  • stl_explain
  • stl_query
  • stl_querytext
  • stl_scan
  • stl_utilitytext
  • stl_wlm_query

You can easily add another system table or custom query by adding a few lines of code to the AWS Glue ETL job, <<cluster-name>_extract_rs_query_logs. For example, suppose that from the product-warehouse Amazon Redshift cluster you want to export orders greater than $2,000. To do so, add the following five lines of code to the AWS Glue ETL job product-warehouse_extract_rs_query_logs, where product-warehouse is your cluster name:

  1. Get the last-processed time-stamp value. The function creates a value if it doesn’t already exist.

salesLastProcessTSValue = functions.getLastProcessedTSValue(trackingEntry=”mydb.sales_2000",job_configs=job_configs)

  1. Run the custom query with the time stamp.

returnDF=functions.runQuery(query="select * from sales s join order o where o.order_amnt > 2000 and sale_timestamp > '{}'".format (salesLastProcessTSValue) ,tableName="mydb.sales_2000",job_configs=job_configs)

  1. Save the results to Amazon S3.

functions.saveToS3(dataframe=returnDF,s3Prefix=s3Prefix,tableName="mydb.sales_2000",partitionColumns=["sale_date"],job_configs=job_configs)

  1. Get the latest time-stamp value from the returned data frame in Step 2.

latestTimestampVal=functions.getMaxValue(returnDF,"sale_timestamp",job_configs)

  1. Update the last-processed time-stamp value in the DynamoDB table.

functions.updateLastProcessedTSValue(“mydb.sales_2000",latestTimestampVal[0],job_configs)

Conclusion

In this post, I demonstrate a serverless solution to retain the system tables’ log data across multiple Amazon Redshift clusters. By using this solution, you can incrementally export the data from system tables into Amazon S3. By performing this export, you can build cross-cluster diagnostic queries, build audit dashboards, and derive insights into capacity planning by using services such as Athena. I also demonstrate how you can extend this solution to other ad hoc query use cases or tables other than system tables by adding a few lines of code.


Additional Reading

If you found this post useful, be sure to check out Using Amazon Redshift Spectrum, Amazon Athena, and AWS Glue with Node.js in Production and Amazon Redshift – 2017 Recap.


About the Author

Karthik Sonti is a senior big data architect at Amazon Web Services. He helps AWS customers build big data and analytical solutions and provides guidance on architecture and best practices.

 

 

 

 

Here, have some videos!

Post Syndicated from Alex Bate original https://www.raspberrypi.org/blog/easter-monday-2018/

Today is Easter Monday and as such, the drawbridge is up at Pi Towers. So while we spend time with familytoo much chocolate…family and chocolate, here are some great Pi-themed videos from members of our community. Enjoy!

Eggies live stream!

Bluebird Birdhouse

Raspberry Pi and NoIR camera installed in roof of Bluebird house with IR LEDs. Currently 5 eggs being incubated.

Doctor Who TARDIS doorbell

Raspberry pi Tardis

Raspberry pi Tardis doorbell

Google AIY with Tech-nic-Allie

Ok Google! AIY Voice Kit MagPi

Allie assembles this Google Home kit, that runs on a Raspberry Pi, then uses the Google Home to test her space knowledge with a little trivia game. Stay tuned at the end to see a few printed cases you can use instead of the cardboard.

Buying a Coke with a Raspberry Pi rover

Buy a coke with raspberry pi rover

Mission date : March 26 2018 My raspberry pi project. I use LTE modem to connect internet. python programming. raspberry pi controls pi cam, 2servo motor, 2dc motor. (This video recoded with gopro to upload youtube. Actually I controll this rover by pi cam.

Raspberry Pi security camera

🔴How to Make a Smart Security Camera With Movement Notification – Under 60$

I built my first security camera with motion-control connected to my raspberry pi with MotionEyeOS. What you need: *Raspberry pi 3 (I prefer pi 3) *Any Webcam or raspberry pi cam *Mirco SD card (min 8gb) Useful links : Download the motioneyeOS software here ➜ https://github.com/ccrisan/motioneyeos/releases How to do it: – Download motioneyeOS to your empty SD card (I mounted it via Etcher ) – I always do a sudo apt-upgrade & sudo apt-update on my projects, in the Pi.

Happy Easter!

The post Here, have some videos! appeared first on Raspberry Pi.

UK IPTV Provider ACE Calls it Quits, Cites Mounting Legal Pressure

Post Syndicated from Andy original https://torrentfreak.com/uk-iptv-provider-ace-calls-it-quits-cites-mounting-legal-pressure-180402/

Terms including “Kodi box” are now in common usage in the UK and thanks to continuing coverage in the tabloid media, more and more people are learning that free content is just a few clicks away.

In parallel, premium IPTV services are also on the up. In basic terms, these provide live TV and sports through an Internet connection in a consumer-friendly way. When bundled with beautiful interfaces and fully functional Electronic Program Guides (EPG), they’re almost indistinguishable from services offered by Sky and BTSport, for example.

These come at a price, typically up to £10 per month or £20 for a three-month package, but for the customer this represents good value for money. Many providers offer several thousand channels in decent quality and reliability is much better than free streams. This kind of service was offered by prominent UK provider ACE TV but an announcement last December set alarm bells ringing.

“It saddens me to announce this, but due to pressure from the authorities in the UK, we are no longer selling new subscriptions. This obviously includes trials,” ACE said in a statement.

ACE insisted that it would continue as a going concern, servicing existing customers. However, it did keep its order books open for a while longer, giving people one last chance to subscribe to the service for anything up to a year. And with that ACE continued more quietly in the background, albeit with a disabled Facebook page.

But things were not well in ACE land. Like all major IPTV providers delivering services to the UK, ACE was subjected to blocking action by the English Premier League and UEFA. High Court injunctions allow ISPs in the UK to block their pirate streams in real-time, meaning that matches were often rendered inaccessible to ACE’s customers.

While this blocking can be mitigated when the customer uses a VPN, most don’t want to go to the trouble. Some IPTV providers have engaged in a game of cat-and-mouse with the blocking efforts, some with an impressive level of success. However, it appears that the nuisance eventually took its toll on ACE.

“The ISPs in the UK and across Europe have recently become much more aggressive in blocking our service while football games are in progress,” ACE said in a statement last month.

“In order to get ourselves off of the ISP blacklist we are going to black out the EPL games for all users (including VPN users) starting on Monday. We believe that this will enable us to rebuild the bypass process and successfully provide you with all EPL games.”

People familiar with the blocking process inform TF that this is unlikely to have worked.

Although nobody outside the EPL’s partners knows exactly how the system works, it appears that anti-piracy companies simply subscribe to IPTV services themselves and extract the IP addresses serving the content. ISPs then block them. No pause would’ve helped the situation.

Then, on March 24, another announcement indicated that ACE probably wouldn’t make it very far into 2019.

“It is with sorrow that we announce that we are no longer accepting renewals, upgrades to existing subscriptions or the purchase of new credits. We plan to support existing subscriptions until they expire,” the team wrote.

“EPL games including highlights continue to be blocked and are not expected to be reinstated before the end of the season.”

The suggestion was that ACE would keep going, at least for a while, but chat transcripts with the company obtained by TF last month indicated that ACE would probably shut down, sooner rather than later. Less than a week on, that proved to be the case.

On or around March 29, ACE began sending emails out to customers, announcing the end of the company.

“We recently announced that Ace was no longer accepting renewals or offering new reseller credits but planned to support existing subscription. Due to mounting legal pressure in the UK we have been forced to change our plans and we are now announcing that Ace will close down at the end of March,” the email read.

“This means that from April 1st onwards the Ace service will no longer work.”

April 1 was yesterday and it turns out it wasn’t a joke. Customers who paid in advance no longer have a service and those who paid a year up front are particularly annoyed. So-called ‘re-sellers’ of ACE are fuming more than most.

Re-sellers effectively act as sales agents for IPTV providers, buying access to the service at a reduced rate and making a small profit on each subscriber they sign up. They get a nice web interface to carry out the transactions and it’s something that anyone can do.

However, this generally requires investment from the re-seller in order to buy ‘credits’ up front, which are used to sell services to new customers. Those who invested money in this way with ACE are now in trouble.

“If anyone from ACE is reading here, yer a bunch of fuckin arseholes. I hope your next shite is a hedgehog!!” one shouted on Reddit. “Being a reseller for them and losing hundreds a pounds is bad enough!!”

While the loss of a service is probably a shock to more recent converts to the world of IPTV, those with experience of any kind of pirate TV product should already be well aware that this is nothing out of the ordinary.

For those who bought hacked or cloned satellite cards in the 1990s, to those who used ‘chipped’ cable boxes a little later on, the free rides all come to an end at some point. It’s just a question of riding the wave when it arrives and paying attention to the next big thing, without investing too much money at the wrong time.

For ACE’s former customers, it’s simply a case of looking for a new provider. There are plenty of them, some with zero intent of shutting down. There are rumors that ACE might ‘phoenix’ themselves under another name but that’s also par for the course when people feel they’re owed money and suspicions are riding high.

“Please do not ask if we are rebranding/setting up a new service, the answer is no,” ACE said in a statement.

And so the rollercoaster continues…

Source: TF, for the latest info on copyright, file-sharing, torrent sites and more. We also have VPN reviews, discounts, offers and coupons.

A geometric Rust adventure

Post Syndicated from Eevee original https://eev.ee/blog/2018/03/30/a-geometric-rust-adventure/

Hi. Yes. Sorry. I’ve been trying to write this post for ages, but I’ve also been working on a huge writing project, and apparently I have a very limited amount of writing mana at my disposal. I think this is supposed to be a Patreon reward from January. My bad. I hope it’s super great to make up for the wait!

I recently ported some math code from C++ to Rust in an attempt to do a cool thing with Doom. Here is my story.

The problem

I presented it recently as a conundrum (spoilers: I solved it!), but most of those details are unimportant.

The short version is: I have some shapes. I want to find their intersection.

Really, I want more than that: I want to drop them all on a canvas, intersect everything with everything, and pluck out all the resulting polygons. The input is a set of cookie cutters, and I want to press them all down on the same sheet of dough and figure out what all the resulting contiguous pieces are. And I want to know which cookie cutter(s) each piece came from.

But intersection is a good start.

Example of the goal.  Given two squares that overlap at their corners, I want to find the small overlap piece, plus the two L-shaped pieces left over from each square

I’m carefully referring to the input as shapes rather than polygons, because each one could be a completely arbitrary collection of lines. Obviously there’s not much you can do with shapes that aren’t even closed, but at the very least, I need to handle concavity and multiple disconnected polygons that together are considered a single input.

This is a non-trivial problem with a lot of edge cases, and offhand I don’t know how to solve it robustly. I’m not too eager to go figure it out from scratch, so I went hunting for something I could build from.

(Infuriatingly enough, I can just dump all the shapes out in an SVG file and any SVG viewer can immediately solve the problem, but that doesn’t quite help me. Though I have had a few people suggest I just rasterize the whole damn problem, and after all this, I’m starting to think they may have a point.)

Alas, I couldn’t find a Rust library for doing this. I had a hard time finding any library for doing this that wasn’t a massive fully-featured geometry engine. (I could’ve used that, but I wanted to avoid non-Rust dependencies if possible, since distributing software is already enough of a nightmare.)

A Twitter follower directed me towards a paper that described how to do very nearly what I wanted and nothing else: “A simple algorithm for Boolean operations on polygons” by F. Martínez (2013). Being an academic paper, it’s trapped in paywall hell; sorry about that. (And as I understand it, none of the money you’d pay to get the paper would even go to the authors? Is that right? What a horrible and predatory system for discovering and disseminating knowledge.)

The paper isn’t especially long, but it does describe an awful lot of subtle details and is mostly written in terms of its own reference implementation. Rather than write my own implementation based solely on the paper, I decided to try porting the reference implementation from C++ to Rust.

And so I fell down the rabbit hole.

The basic algorithm

Thankfully, the author has published the sample code on his own website, if you want to follow along. (It’s the bottom link; the same author has, confusingly, published two papers on the same topic with similar titles, four years apart.)

If not, let me describe the algorithm and how the code is generally laid out. The algorithm itself is based on a sweep line, where a vertical line passes across the plane and ✨ does stuff ✨ as it encounters various objects. This implementation has no physical line; instead, it keeps track of which segments from the original polygon would be intersecting the sweep line, which is all we really care about.

A vertical line is passing rightwards over a couple intersecting shapes.  The line current intersects two of the shapes' sides, and these two sides are the "sweep list"

The code is all bundled inside a class with only a single public method, run, because… that’s… more object-oriented, I guess. There are several helper methods, and state is stored in some attributes. A rough outline of run is:

  1. Run through all the line segments in both input polygons. For each one, generate two SweepEvents (one for each endpoint) and add them to a std::deque for storage.

    Add pointers to the two SweepEvents to a std::priority_queue, the event queue. This queue uses a custom comparator to order the events from left to right, so the top element is always the leftmost endpoint.

  2. Loop over the event queue (where an “event” means the sweep line passed over the left or right end of a segment). Encountering a left endpoint means the sweep line is newly touching that segment, so add it to a std::set called the sweep list. An important point is that std::set is ordered, and the sweep list uses a comparator that keeps segments in order vertically.

    Encountering a right endpoint means the sweep line is leaving a segment, so that segment is removed from the sweep list.

  3. When a segment is added to the sweep list, it may have up to two neighbors: the segment above it and the segment below it. Call possibleIntersection to check whether it intersects either of those neighbors. (This is nearly sufficient to find all intersections, which is neat.)

  4. If possibleIntersection detects an intersection, it will split each segment into two pieces then and there. The old segment is shortened in-place to become the left part, and a new segment is created for the right part. The new endpoints at the point of intersection are added to the event queue.

  5. Some bookkeeping is done along the way to track which original polygons each segment is inside, and eventually the segments are reconstructed into new polygons.

Hopefully that’s enough to follow along. It took me an inordinately long time to tease this out. The comments aren’t especially helpful.

1
    std::deque<SweepEvent> eventHolder;    // It holds the events generated during the computation of the boolean operation

Syntax and basic semantics

The first step was to get something that rustc could at least parse, which meant translating C++ syntax to Rust syntax.

This was surprisingly straightforward! C++ classes become Rust structs. (There was no inheritance here, thankfully.) All the method declarations go away. Method implementations only need to be indented and wrapped in impl.

I did encounter some unnecessarily obtuse uses of the ternary operator:

1
(prevprev != sl.begin()) ? --prevprev : prevprev = sl.end();

Rust doesn’t have a ternary — you can use a regular if block as an expression — so I expanded these out.

C++ switch blocks become Rust match blocks, but otherwise function basically the same. Rust’s enums are scoped (hallelujah), so I had to explicitly spell out where enum values came from.

The only really annoying part was changing function signatures; C++ types don’t look much at all like Rust types, save for the use of angle brackets. Rust also doesn’t pass by implicit reference, so I needed to sprinkle a few &s around.

I would’ve had a much harder time here if this code had relied on any remotely esoteric C++ functionality, but thankfully it stuck to pretty vanilla features.

Language conventions

This is a geometry problem, so the sample code unsurprisingly has its own home-grown point type. Rather than port that type to Rust, I opted to use the popular euclid crate. Not only is it code I didn’t have to write, but it already does several things that the C++ code was doing by hand inline, like dot products and cross products. And all I had to do was add one line to Cargo.toml to use it! I have no idea how anyone writes C or C++ without a package manager.

The C++ code used getters, i.e. point.x (). I’m not a huge fan of getters, though I do still appreciate the need for them in lowish-level systems languages where you want to future-proof your API and the language wants to keep a clear distinction between attribute access and method calls. But this is a point, which is nothing more than two of the same numeric type glued together; what possible future logic might you add to an accessor? The euclid authors appear to side with me and leave the coordinates as public fields, so I took great joy in removing all the superfluous parentheses.

Polygons are represented with a Polygon class, which has some number of Contours. A contour is a single contiguous loop. Something you’d usually think of as a polygon would only have one, but a shape with a hole would have two: one for the outside, one for the inside. The weird part of this arrangement was that Polygon implemented nearly the entire STL container interface, then waffled between using it and not using it throughout the rest of the code. Rust lets anything in the same module access non-public fields, so I just skipped all that and used polygon.contours directly. Hell, I think I made contours public.

Finally, the SweepEvent type has a pol field that’s declared as an enum PolygonType (either SUBJECT or CLIPPING, to indicate which of the two inputs it is), but then some other code uses the same field as a numeric index into a polygon’s contours. Boy I sure do love static typing where everything’s a goddamn integer. I wanted to extend the algorithm to work on arbitrarily many input polygons anyway, so I scrapped the enum and this became a usize.


Then I got to all the uses of STL. I have only a passing familiarity with the C++ standard library, and this code actually made modest use of it, which caused some fun days-long misunderstandings.

As mentioned, the SweepEvents are stored in a std::deque, which is never read from. It took me a little thinking to realize that the deque was being used as an arena: it’s the canonical home for the structs so pointers to them can be tossed around freely. (It can’t be a std::vector, because that could reallocate and invalidate all the pointers; std::deque is probably a doubly-linked list, and guarantees no reallocation.)

Rust’s standard library does have a doubly-linked list type, but I knew I’d run into ownership hell here later anyway, so I think I replaced it with a Rust Vec to start with. It won’t compile either way, so whatever. We’ll get back to this in a moment.

The list of segments currently intersecting the sweep line is stored in a std::set. That type is explicitly ordered, which I’m very glad I knew already. Rust has two set types, HashSet and BTreeSet; unsurprisingly, the former is unordered and the latter is ordered. Dropping in BTreeSet and fixing some method names got me 90% of the way there.

Which brought me to the other 90%. See, the C++ code also relies on finding nodes adjacent to the node that was just inserted, via STL iterators.

1
2
3
next = prev = se->posSL = it = sl.insert(se).first;
(prev != sl.begin()) ? --prev : prev = sl.end();
++next;

I freely admit I’m bad at C++, but this seems like something that could’ve used… I don’t know, 1 comment. Or variable names more than two letters long. What it actually does is:

  1. Add the current sweep event (se) to the sweep list (sl), which returns a pair whose first element is an iterator pointing at the just-inserted event.

  2. Copies that iterator to several other variables, including prev and next.

  3. If the event was inserted at the beginning of the sweep list, set prev to the sweep list’s end iterator, which in C++ is a legal-but-invalid iterator meaning “the space after the end” or something. This is checked for in later code, to see if there is a previous event to look at. Otherwise, decrement prev, so it’s now pointing at the event immediately before the inserted one.

  4. Increment next normally. If the inserted event is last, then this will bump next to the end iterator anyway.

In other words, I need to get the previous and next elements from a BTreeSet. Rust does have bidirectional iterators, which BTreeSet supports… but BTreeSet::insert only returns a bool telling me whether or not anything was inserted, not the position. I came up with this:

1
2
3
let mut maybe_below = active_segments.range(..segment).last().map(|v| *v);
let mut maybe_above = active_segments.range(segment..).next().map(|v| *v);
active_segments.insert(segment);

The range method returns an iterator over a subset of the tree. The .. syntax makes a range (where the right endpoint is exclusive), so ..segment finds the part of the tree before the new segment, and segment.. finds the part of the tree after it. (The latter would start with the segment itself, except I haven’t inserted it yet, so it’s not actually there.)

Then the standard next() and last() methods on bidirectional iterators find me the element I actually want. But the iterator might be empty, so they both return an Option. Also, iterators tend to return references to their contents, but in this case the contents are already references, and I don’t want a double reference, so the map call dereferences one layer — but only if the Option contains a value. Phew!

This is slightly less efficient than the C++ code, since it has to look up where segment goes three times rather than just one. I might be able to get it down to two with some more clever finagling of the iterator, but microsopic performance considerations were a low priority here.

Finally, the event queue uses a std::priority_queue to keep events in a desired order and efficiently pop the next one off the top.

Except priority queues act like heaps, where the greatest (i.e., last) item is made accessible.

Sorting out sorting

C++ comparison functions return true to indicate that the first argument is less than the second argument. Sweep events occur from left to right. You generally implement sorts so that the first thing comes, erm, first.

But sweep events go in a priority queue, and priority queues surface the last item, not the first. This C++ code handled this minor wrinkle by implementing its comparison backwards.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
struct SweepEventComp : public std::binary_function<SweepEvent, SweepEvent, bool> { // for sorting sweep events
// Compare two sweep events
// Return true means that e1 is placed at the event queue after e2, i.e,, e1 is processed by the algorithm after e2
bool operator() (const SweepEvent* e1, const SweepEvent* e2)
{
    if (e1->point.x () > e2->point.x ()) // Different x-coordinate
        return true;
    if (e2->point.x () > e1->point.x ()) // Different x-coordinate
        return false;
    if (e1->point.y () != e2->point.y ()) // Different points, but same x-coordinate. The event with lower y-coordinate is processed first
        return e1->point.y () > e2->point.y ();
    if (e1->left != e2->left) // Same point, but one is a left endpoint and the other a right endpoint. The right endpoint is processed first
        return e1->left;
    // Same point, both events are left endpoints or both are right endpoints.
    if (signedArea (e1->point, e1->otherEvent->point, e2->otherEvent->point) != 0) // not collinear
        return e1->above (e2->otherEvent->point); // the event associate to the bottom segment is processed first
    return e1->pol > e2->pol;
}
};

Maybe it’s just me, but I had a hell of a time just figuring out what problem this was even trying to solve. I still have to reread it several times whenever I look at it, to make sure I’m getting the right things backwards.

Making this even more ridiculous is that there’s a second implementation of this same sort, with the same name, in another file — and that one’s implemented forwards. And doesn’t use a tiebreaker. I don’t entirely understand how this even compiles, but it does!

I painstakingly translated this forwards to Rust. Unlike the STL, Rust doesn’t take custom comparators for its containers, so I had to implement ordering on the types themselves (which makes sense, anyway). I wrapped everything in the priority queue in a Reverse, which does what it sounds like.

I’m fairly pleased with Rust’s ordering model. Most of the work is done in Ord, a trait with a cmp() method returning an Ordering (one of Less, Equal, and Greater). No magic numbers, no need to implement all six ordering methods! It’s incredible. Ordering even has some handy methods on it, so the usual case of “order by this, then by this” can be written as:

1
2
return self.point().x.cmp(&other.point().x)
    .then(self.point().y.cmp(&other.point().y));

Well. Just kidding! It’s not quite that easy. You see, the points here are composed of floats, and floats have the fun property that not all of them are comparable. Specifically, NaN is not less than, greater than, or equal to anything else, including itself. So IEEE 754 float ordering cannot be expressed with Ord. Unless you want to just make up an answer for NaN, but Rust doesn’t tend to do that.

Rust’s float types thus implement the weaker PartialOrd, whose method returns an Option<Ordering> instead. That makes the above example slightly uglier:

1
2
return self.point().x.partial_cmp(&other.point().x).unwrap()
    .then(self.point().y.partial_cmp(&other.point().y).unwrap())

Also, since I use unwrap() here, this code will panic and take the whole program down if the points are infinite or NaN. Don’t do that.

This caused some minor inconveniences in other places; for example, the general-purpose cmp::min() doesn’t work on floats, because it requires an Ord-erable type. Thankfully there’s a f64::min(), which handles a NaN by returning the other argument.

(Cool story: for the longest time I had this code using f32s. I’m used to translating int to “32 bits”, and apparently that instinct kicked in for floats as well, even floats spelled double.)

The only other sorting adventure was this:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
// Due to overlapping edges the resultEvents array can be not wholly sorted
bool sorted = false;
while (!sorted) {
    sorted = true;
    for (unsigned int i = 0; i < resultEvents.size (); ++i) {
        if (i + 1 < resultEvents.size () && sec (resultEvents[i], resultEvents[i+1])) {
            std::swap (resultEvents[i], resultEvents[i+1]);
            sorted = false;
        }
    }
}

(I originally misread this comment as saying “the array cannot be wholly sorted” and had no idea why that would be the case, or why the author would then immediately attempt to bubble sort it.)

I’m still not sure why this uses an ad-hoc sort instead of std::sort. But I’m used to taking for granted that general-purpose sorting implementations are tuned to work well for almost-sorted data, like Python’s. Maybe C++ is untrustworthy here, for some reason. I replaced it with a call to .sort() and all seemed fine.

Phew! We’re getting there. Finally, my code appears to type-check.

But now I see storm clouds gathering on the horizon.

Ownership hell

I have a problem. I somehow run into this problem every single time I use Rust. The solutions are never especially satisfying, and all the hacks I might use if forced to write C++ turn out to be unsound, which is even more annoying because rustc is just sitting there with this smug “I told you so expression” and—

The problem is ownership, which Rust is fundamentally built on. Any given value must have exactly one owner, and Rust must be able to statically convince itself that:

  1. No reference to a value outlives that value.
  2. If a mutable reference to a value exists, no other references to that value exist at the same time.

This is the core of Rust. It guarantees at compile time that you cannot lose pointers to allocated memory, you cannot double-free, you cannot have dangling pointers.

It also completely thwarts a lot of approaches you might be inclined to take if you come from managed languages (where who cares, the GC will take care of it) or C++ (where you just throw pointers everywhere and hope for the best apparently).

For example, pointer loops are impossible. Rust’s understanding of ownership and lifetimes is hierarchical, and it simply cannot express loops. (Rust’s own doubly-linked list type uses raw pointers and unsafe code under the hood, where “unsafe” is an escape hatch for the usual ownership rules. Since I only recently realized that pointers to the inside of a mutable Vec are a bad idea, I figure I should probably not be writing unsafe code myself.)

This throws a few wrenches in the works.

Problem the first: pointer loops

I immediately ran into trouble with the SweepEvent struct itself. A SweepEvent pulls double duty: it represents one endpoint of a segment, but each left endpoint also handles bookkeeping for the segment itself — which means that most of the fields on a right endpoint are unused. Also, and more importantly, each SweepEvent has a pointer to the corresponding SweepEvent at the other end of the same segment. So a pair of SweepEvents point to each other.

Rust frowns upon this. In retrospect, I think I could’ve kept it working, but I also think I’m wrong about that.

My first step was to wrench SweepEvent apart. I moved all of the segment-stuff (which is virtually all of it) into a single SweepSegment type, and then populated the event queue with a SweepEndpoint tuple struct, similar to:

1
2
3
4
5
6
enum SegmentEnd {
    Left,
    Right,
}

struct SweepEndpoint<'a>(&'a SweepSegment, SegmentEnd);

This makes SweepEndpoint essentially a tuple with a name. The 'a is a lifetime and says, more or less, that a SweepEndpoint cannot outlive the SweepSegment it references. Makes sense.

Problem solved! I no longer have mutually referential pointers. But I do still have pointers (well, references), and they have to point to something.

Problem the second: where’s all the data

Which brings me to the problem I always run into with Rust. I have a bucket of things, and I need to refer to some of them multiple times.

I tried half a dozen different approaches here and don’t clearly remember all of them, but I think my core problem went as follows. I translated the C++ class to a Rust struct with some methods hanging off of it. A simplified version might look like this.

1
2
3
4
struct Algorithm {
    arena: LinkedList<SweepSegment>,
    event_queue: BinaryHeap<SweepEndpoint>,
}

Ah, hang on — SweepEndpoint needs to be annotated with a lifetime, so Rust can enforce that those endpoints don’t live longer than the segments they refer to. No problem?

1
2
3
4
struct Algorithm<'a> {
    arena: LinkedList<SweepSegment>,
    event_queue: BinaryHeap<SweepEndpoint<'a>>,
}

Okay! Now for some methods.

1
2
3
4
5
6
7
8
fn run(&mut self) {
    self.arena.push_back(SweepSegment{ data: 5 });
    self.event_queue.push(SweepEndpoint(self.arena.back().unwrap(), SegmentEnd::Left));
    self.event_queue.push(SweepEndpoint(self.arena.back().unwrap(), SegmentEnd::Right));
    for event in &self.event_queue {
        println!("{:?}", event)
    }
}

Aaand… this doesn’t work. Rust “cannot infer an appropriate lifetime for autoref due to conflicting requirements”. The trouble is that self.arena.back() takes a reference to self.arena, and then I put that reference in the event queue. But I promised that everything in the event queue has lifetime 'a, and I don’t actually know how long self lives here; I only know that it can’t outlive 'a, because that would invalidate the references it holds.

A little random guessing let me to change &mut self to &'a mut self — which is fine because the entire impl block this lives in is already parameterized by 'a — and that makes this compile! Hooray! I think that’s because I’m saying self itself has exactly the same lifetime as the references it holds onto, which is true, since it’s referring to itself.

Let’s get a little more ambitious and try having two segments.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
fn run(&'a mut self) {
    self.arena.push_back(SweepSegment{ data: 5 });
    self.event_queue.push(SweepEndpoint(self.arena.back().unwrap(), SegmentEnd::Left));
    self.event_queue.push(SweepEndpoint(self.arena.back().unwrap(), SegmentEnd::Right));
    self.arena.push_back(SweepSegment{ data: 17 });
    self.event_queue.push(SweepEndpoint(self.arena.back().unwrap(), SegmentEnd::Left));
    self.event_queue.push(SweepEndpoint(self.arena.back().unwrap(), SegmentEnd::Right));
    for event in &self.event_queue {
        println!("{:?}", event)
    }
}

Whoops! Rust complains that I’m trying to mutate self.arena while other stuff is referring to it. And, yes, that’s true — I have references to it in the event queue, and Rust is preventing me from potentially deleting everything from the queue when references to it still exist. I’m not actually deleting anything here, of course (though I could be if this were a Vec!), but Rust’s type system can’t encode that (and I dread the thought of a type system that can).

I struggled with this for a while, and rapidly encountered another complete showstopper:

1
2
3
4
5
6
fn run(&'a mut self) {
    self.mutate_something();
    self.mutate_something();
}

fn mutate_something(&'a mut self) {}

Rust objects that I’m trying to borrow self mutably, twice — once for the first call, once for the second.

But why? A borrow is supposed to end automatically once it’s no longer used, right? Maybe if I throw some braces around it for scope… nope, that doesn’t help either.

It’s true that borrows usually end automatically, but here I have explicitly told Rust that mutate_something() should borrow with the lifetime 'a, which is the same as the lifetime in run(). So the first call explicitly borrows self for at least the rest of the method. Removing the lifetime from mutate_something() does fix this error, but if that method tries to add new segments, I’m back to the original problem.

Oh no. The mutation in the C++ code is several calls deep. Porting it directly seems nearly impossible.

The typical solution here — at least, the first thing people suggest to me on Twitter — is to wrap basically everything everywhere in Rc<RefCell<T>>, which gives you something that’s reference-counted (avoiding questions of ownership) and defers borrow checks until runtime (avoiding questions of mutable borrows). But that seems pretty heavy-handed here — not only does RefCell add .borrow() noise anywhere you actually want to interact with the underlying value, but do I really need to refcount these tiny structs that only hold a handful of floats each?

I set out to find a middle ground.

Solution, kind of

I really, really didn’t want to perform serious surgery on this code just to get it to build. I still didn’t know if it worked at all, and now I had to rearrange it without being able to check if I was breaking it further. (This isn’t Rust’s fault; it’s a natural problem with porting between fairly different paradigms.)

So I kind of hacked it into working with minimal changes, producing a grotesque abomination which I’m ashamed to link to. Here’s how!

First, I got rid of the class. It turns out this makes lifetime juggling much easier right off the bat. I’m pretty sure Rust considers everything in a struct to be destroyed simultaneously (though in practice it guarantees it’ll destroy fields in order), which doesn’t leave much wiggle room. Locals within a function, on the other hand, can each have their own distinct lifetimes, which solves the problem of expressing that the borrows won’t outlive the arena.

Speaking of the arena, I solved the mutability problem there by switching to… an arena! The typed-arena crate (a port of a type used within Rust itself, I think) is an allocator — you give it a value, and it gives you back a reference, and the reference is guaranteed to be valid for as long as the arena exists. The method that does this is sneaky and takes &self rather than &mut self, so Rust doesn’t know you’re mutating the arena and won’t complain. (One drawback is that the arena will never free anything you give to it, but that’s not a big problem here.)


My next problem was with mutation. The main loop repeatedly calls possibleIntersection with pairs of segments, which can split either or both segment. Rust definitely doesn’t like that — I’d have to pass in two &muts, both of which are mutable references into the same arena, and I’d have a bunch of immutable references into that arena in the sweep list and elsewhere. This isn’t going to fly.

This is kind of a shame, and is one place where Rust seems a little overzealous. Something like this seems like it ought to be perfectly valid:

1
2
3
4
let mut v = vec![1u32, 2u32];
let a = &mut v[0];
let b = &mut v[1];
// do stuff with a, b

The trouble is, Rust only knows the type signature, which here is something like index_mut(&'a mut self, index: usize) -> &'a T. Nothing about that says that you’re borrowing distinct elements rather than some core part of the type — and, in fact, the above code is only safe because you’re borrowing distinct elements. In the general case, Rust can’t possibly know that. It seems obvious enough from the different indexes, but nothing about the type system even says that different indexes have to return different values. And what if one were borrowed as &mut v[1] and the other were borrowed with v.iter_mut().next().unwrap()?

Anyway, this is exactly where people start to turn to RefCell — if you’re very sure you know better than Rust, then a RefCell will skirt the borrow checker while still enforcing at runtime that you don’t have more than one mutable borrow at a time.

But half the lines in this algorithm examine the endpoints of a segment! I don’t want to wrap the whole thing in a RefCell, or I’ll have to say this everywhere:

1
if segment1.borrow().point.x < segment2.borrow().point.x { ... }

Gross.

But wait — this code only mutates the points themselves in one place. When a segment is split, the original segment becomes the left half, and a new segment is created to be the right half. There’s no compelling need for this; it saves an allocation for the left half, but it’s not critical to the algorithm.

Thus, I settled on a compromise. My segment type now looks like this:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
struct SegmentPacket {
    // a bunch of flags and whatnot used in the algorithm
}
struct SweepSegment {
    left_point: MapPoint,
    right_point: MapPoint,
    faces_outwards: bool,
    index: usize,
    order: usize,
    packet: RefCell<SegmentPacket>,
}

I do still need to call .borrow() or .borrow_mut() to get at the stuff in the “packet”, but that’s far less common, so there’s less noise overall. And I don’t need to wrap it in Rc because it’s part of a type that’s allocated in the arena and passed around only via references.


This still leaves me with the problem of how to actually perform the splits.

I’m not especially happy with what I came up with, I don’t know if I can defend it, and I suspect I could do much better. I changed possibleIntersection so that rather than performing splits, it returns the points at which each segment needs splitting, in the form (usize, Option<MapPoint>, Option<MapPoint>). (The usize is used as a flag for calling code and oughta be an enum, but, isn’t yet.)

Now the top-level function is responsible for all arena management, and all is well.

Except, er. possibleIntersection is called multiple times, and I don’t want to copy-paste a dozen lines of split code after each call. I tried putting just that code in its own function, which had the world’s most godawful signature, and that didn’t work because… uh… hm. I can’t remember why, exactly! Should’ve written that down.

I tried a local closure next, but closures capture their environment by reference, so now I had references to a bunch of locals for as long as the closure existed, which meant I couldn’t mutate those locals. Argh. (This seems a little silly to me, since the closure’s references cannot possibly be used for anything if the closure isn’t being called, but maybe I’m missing something. Or maybe this is just a limitation of lifetimes.)

Increasingly desperate, I tried using a macro. But… macros are hygienic, which means that any new name you use inside a macro is different from any name outside that macro. The macro thus could not see any of my locals. Usually that’s good, but here I explicitly wanted the macro to mess with my locals.

I was just about to give up and go live as a hermit in a cabin in the woods, when I discovered something quite incredible. You can define local macros! If you define a macro inside a function, then it can see any locals defined earlier in that function. Perfect!

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
macro_rules! _split_segment (
    ($seg:expr, $pt:expr) => (
        {
            let pt = $pt;
            let seg = $seg;
            // ... waaay too much code ...
        }
    );
);

loop {
    // ...
    // This is possibleIntersection, renamed because Rust rightfully complains about camelCase
    let cross = handle_intersections(Some(segment), maybe_above);
    if let Some(pt) = cross.1 {
        segment = _split_segment!(segment, pt);
    }
    if let Some(pt) = cross.2 {
        maybe_above = Some(_split_segment!(maybe_above.unwrap(), pt));
    }
    // ...
}

(This doesn’t actually quite match the original algorithm, which has one case where a segment can be split twice. I realized that I could just do the left-most split, and a later iteration would perform the other split. I sure hope that’s right, anyway.)

It’s a bit ugly, and I ran into a whole lot of implicit behavior from the C++ code that I had to fix — for example, the segment is sometimes mutated just before it’s split, purely as a shortcut for mutating the left part of the split. But it finally compiles! And runs! And kinda worked, a bit!

Aftermath

I still had a lot of work to do.

For one, this code was designed for intersecting two shapes, not mass-intersecting a big pile of shapes. The basic algorithm doesn’t care about how many polygons you start with — all it sees is segments — but the code for constructing the return value needed some heavy modification.

The biggest change by far? The original code traced each segment once, expecting the result to be only a single shape. I had to change that to trace each side of each segment once, since the vast bulk of the output consists of shapes which share a side. This violated a few assumptions, which I had to hack around.

I also ran into a couple very bad edge cases, spent ages debugging them, then found out that the original algorithm had a subtle workaround that I’d commented out because it was awkward to port but didn’t seem to do anything. Whoops!

The worst was a precision error, where a vertical line could be split on a point not quite actually on the line, which wreaked all kinds of havoc. I worked around that with some tasteful rounding, which is highly dubious but makes the output more appealing to my squishy human brain. (I might switch to the original workaround, but I really dislike that even simple cases can spit out points at 1500.0000000000003. The whole thing is parameterized over the coordinate type, so maybe I could throw a rational type in there and cross my fingers?)

All that done, I finally, finally, after a couple months of intermittent progress, got what I wanted!

This is Doom 2’s MAP01. The black area to the left of center is where the player starts. Gray areas indicate where the player can walk from there, with lighter shades indicating more distant areas, where “distance” is measured by the minimum number of line crossings. Red areas can’t be reached at all.

(Note: large playable chunks of the map, including the exit room, are red. That’s because those areas are behind doors, and this code doesn’t understand doors yet.)

(Also note: The big crescent in the lower-right is also black because I was lazy and looked for the player’s starting sector by checking the bbox, and that sector’s bbox happens to match.)

The code that generated this had to go out of its way to delete all the unreachable zones around solid walls. I think I could modify the algorithm to do that on the fly pretty easily, which would probably speed it up a bit too. Downside is that the algorithm would then be pretty specifically tied to this problem, and not usable for any other kind of polygon intersection, which I would think could come up elsewhere? The modifications would be pretty minor, though, so maybe I could confine them to a closure or something.

Some final observations

It runs surprisingly slowly. Like, multiple seconds. Unless I add --release, which speeds it up by a factor of… some number with multiple digits. Wahoo. Debug mode has a high price, especially with a lot of calls in play.

The current state of this code is on GitHub. Please don’t look at it. I’m very sorry.

Honestly, most of my anguish came not from Rust, but from the original code relying on lots of fairly subtle behavior without bothering to explain what it was doing or even hint that anything unusual was going on. God, I hate C++.

I don’t know if the Rust community can learn from this. I don’t know if I even learned from this. Let’s all just quietly forget about it.

Now I just need to figure this one out…

An elephant being eaten by a snake: Easter eggs on your Pi

Post Syndicated from Alex Bate original https://www.raspberrypi.org/blog/raspberry-pi-easter-eggs/

Grab your Raspberry Pi, everyone — we’re going on an Easter egg hunt, and all of you are invited!

Voilà, a terminal window!

When they’re not chocolate, Easter eggs are hidden content in movies, games, DVD menus, and computers. So open a terminal window and try the following:

1. A little attitude

Type aptitude moo into the terminal window and press Enter. Now type aptitude -v moo. Keep adding v’s, like this: aptitude -vv moo

2. Party

Addicted to memes? Type curl parrot.live into your window!

3. In a galaxy far, far away…

You’ll need to install telnet for this one: start by typing sudo apt-get install telnet into the terminal. Once it’s installed, enter telnet towel.blinkenlights.nl

4. Pinout

Type pinout into the window to see a handy GPIO pinout diagram for your Pi. Ideal for physical digital making projects!

5. Demo programs

Easter egg-ish: you can try out various demo programs on your Raspberry Pi, such as 1080p video playback and spinning teapots.

Any more?

There’s lots of fun to be had in the terminal of a Raspberry Pi. Do you know any other fun Easter eggs? Share them in the comments!

The post An elephant being eaten by a snake: Easter eggs on your Pi appeared first on Raspberry Pi.

Alex’s quick and easy digital making Easter egg hunt

Post Syndicated from Alex Bate original https://www.raspberrypi.org/blog/alexs-easter-egg-hunt/

Looking to incorporate some digital making into your Easter weekend? You’ve come to the right place! With a Raspberry Pi, a few wires, and some simple code, you can take your festivities to the next level — here’s how!

Easter Egg Hunt using Raspberry Pi

If you logged in to watch our Instagram live-stream yesterday, you’ll have seen me put together a simple egg carton and some wires to create circuits. These circuits, when closed by way of a foil-wrapped chocolate egg, instruct a Raspberry Pi to reveal the whereabouts of a larger chocolate egg!

Make it

You’ll need an egg carton, two male-to-female jumper wire, and two crocodile leads for each egg you use.

Easter Egg Hunt using Raspberry Pi

Connect your leads together in pairs: one end of a crocodile lead to the male end of one jumper wire. Attach the free crocodile clips of two leads to each corner of the egg carton (as shown up top). Then hook up the female ends to GPIO pins: one numbered pin and one ground pin per egg. I recommend pins 3, 4, 18 and 24, as they all have adjacent GND pins.

Easter Egg Hunt using Raspberry Pi

Your foil-wrapped Easter egg will complete the circuit — make sure it’s touching both the GPIO- and GND-connected clips when resting in the carton.

Easter Egg Hunt using Raspberry Pi

Wrap it

For your convenience (and our sweet tooth), we tested several foil-wrapped eggs (Easter and otherwise) to see which are conductive.

Raspberry Pi on Twitter

We’re egg-sperimenting with Easter deliciousness to find which treat is the most conductive. Why? All will be revealed in our Instagram Easter live-stream tomorrow.

The result? None of them are! But if you unwrap an egg and rewrap it with the non-decorative foil side outward, this tends to work. You could also use aluminium foil or copper tape to create a conductive layer.

Code it

Next, you’ll need to create the code for your hunt. The script below contains the bare bones needed to make the project work — you can embellish it however you wish using GUIs, flashing LEDs, music, etc.

Open Thonny or IDLE on Raspbian and create a new file called egghunt.py. Then enter the following code:

We’re using ButtonBoard from the gpiozero library. This allows us to link several buttons together as an object and set an action for when any number of the buttons are pressed. Here, the script waits for all four circuits to be completed before printing the location of the prize in the Python shell.

Your turn

And that’s it! Now you just need to hide your small foil eggs around the house and challenge your kids/friends/neighbours to find them. Then, once every circuit is completed with an egg, the great prize will be revealed.

Give it a go this weekend! And if you do, be sure to let us know on social media.

(Thank you to Lauren Hyams for suggesting we “do something for Easter” and Ben ‘gpiozero’ Nuttall for introducing me to ButtonBoard.)

The post Alex’s quick and easy digital making Easter egg hunt appeared first on Raspberry Pi.

Key Internet Players Excoriate Canadian Pirate Site Blocking Plan

Post Syndicated from Ernesto original https://torrentfreak.com/key-internet-players-excoriate-canadian-pirate-site-blocking-plan-180323/

In January, a coalition of Canadian companies called on the country’s telecom regulator CRTC to establish a local pirate site blocking program, which would be the first of its kind in North America.

The Canadian deal is supported by FairPlay Canada, a coalition of both copyright holders and major players in the telco industry, such as Bell and Rogers, which also have media companies of their own.

Before making any decisions, the CRTC has asked the public for comments. Last week we highlighted a few from both sides, but in recent days two Internet heavyweights have chimed in.

The first submission comes from the Internet Infrastructure Coalition (i2Coalition), which counts Google, Amazon, Cogeco PEER1, and Tucows among its members. These are all key players in the Internet ecosystem, with a rather strong opinion.

In a strongly-worded letter, the coalition urges the CRTC to reject the proposed “government-backed internet censorship committee” which they say will hurt the public as well as various companies that rely on a free and open Internet.

“The not-for-profit organization envisioned by the FairPlay Canada proposal lacks accountability and oversight, and is certain to cause tremendous collateral damage to innocent Internet business owners,” they write.

“There is shockingly little judicial review or due process in establishing and approving the list of websites being blocked — and no specifics of how this blocking is actually to be implemented.”

According to the coalition, the proposal would stifle innovation, shutter legitimate businesses through overblocking, and harm Canada’s Internet economy.

In addition, they fear that it may lead to broad blockades of specific technologies. This includes VPNs, which Bell condemned in the past, as well as BitTorrent traffic.

“VPN usage itself could be targeted by this proposal, as could the use of torrents, another technology with wide legitimate usage, including digital security on public wifi, along with myriad other business requirements,” the coalition writes.

“We caution that this proposal could be used to attempt to restrict technology innovation. There are no provisions within the FairPlay proposal to avoid vilification of specific technologies. Technologies themselves cannot be bad actors.”

According to the i2Coalition, Canada’s Copyright Modernization Act is already one of the toughest anti-piracy laws in the world and they see no need to go any further. As such, they urge the authorities to reject the plan.

“The government and the CRTC should not hesitate to firmly reject the website blocking plan as a disproportionate, unconstitutional proposal sorely lacking in due process that is inconsistent with the current communications law framework,” the letter concludes.

The second submission we want to highlight comes from the Internet Society. In addition to many individual members, it is supported by dozens of major companies. This includes Google and Facebook, but also ISPs such as Verizon and Comcast, and even copyright holders such as 21st Century Fox and Disney.

While the Internet Society’s Hollywood members have argued in favor of pirate site blockades in the past, even in court, the organization’s submission argues fiercely against this measure.

Pointing to an extensive report Internet Society published last Spring, they inform the CRTC that website blocking techniques “do not solve the problem” and “inflict collateral damage.”

The Internet Society calls on the CRTC to carefully examine the proposal’s potential negative effects on the security of the Internet, the privacy of Canadians, and how it may inadvertently block legitimate websites.

“In our opinion, the negative impacts of disabling access greatly outweigh any benefits,” the Internet Society writes.

Thus far, nearly 10,000 responses have been submitted to the CRTC. The official deadline passes on March 29, after which it is up to the telecoms regulator to factor the different opinions into its final decision.

The i2Coalition submission is available here (pdf) and the Internet Society’s comments can be found here (pdf).

Source: TF, for the latest info on copyright, file-sharing, torrent sites and more. We also have VPN reviews, discounts, offers and coupons.

E-Mailing Private HTTPS Keys

Post Syndicated from Bruce Schneier original https://www.schneier.com/blog/archives/2018/03/e-mailing_priva.html

I don’t know what to make of this story:

The email was sent on Tuesday by the CEO of Trustico, a UK-based reseller of TLS certificates issued by the browser-trusted certificate authorities Comodo and, until recently, Symantec. It was sent to Jeremy Rowley, an executive vice president at DigiCert, a certificate authority that acquired Symantec’s certificate issuance business after Symantec was caught flouting binding industry rules, prompting Google to distrust Symantec certificates in its Chrome browser. In communications earlier this month, Trustico notified DigiCert that 50,000 Symantec-issued certificates Trustico had resold should be mass revoked because of security concerns.

When Rowley asked for proof the certificates were compromised, the Trustico CEO emailed the private keys of 23,000 certificates, according to an account posted to a Mozilla security policy forum. The report produced a collective gasp among many security practitioners who said it demonstrated a shockingly cavalier treatment of the digital certificates that form one of the most basic foundations of website security.

Generally speaking, private keys for TLS certificates should never be archived by resellers, and, even in the rare cases where such storage is permissible, they should be tightly safeguarded. A CEO being able to attach the keys for 23,000 certificates to an email raises troubling concerns that those types of best practices weren’t followed.

I am croggled by the multiple layers of insecurity here.

BoingBoing post.

Serverless Dynamic Web Pages in AWS: Provisioned with CloudFormation

Post Syndicated from AWS Admin original https://aws.amazon.com/blogs/architecture/serverless-dynamic-web-pages-in-aws-provisioned-with-cloudformation/

***This blog is authored by Mike Okner of Monsanto, an AWS customer. It originally appeared on the Monsanto company blog. Minor edits were made to the original post.***

Recently, I was looking to create a status page app to monitor a few important internal services. I wanted this app to be as lightweight, reliable, and hassle-free as possible, so using a “serverless” architecture that doesn’t require any patching or other maintenance was quite appealing.

I also don’t deploy anything in a production AWS environment outside of some sort of template (usually CloudFormation) as a rule. I don’t want to have to come back to something I created ad hoc in the console after 6 months and try to recall exactly how I architected all of the resources. I’ll inevitably forget something and create more problems before solving the original one. So building the status page in a template was a requirement.

The Design
I settled on a design using two Lambda functions, both written in Python 3.6.

The first Lambda function makes requests out to a list of important services and writes their current status to a DynamoDB table. This function is executed once per minute via CloudWatch Event Rule.

The second Lambda function reads each service’s status & uptime information from DynamoDB and renders a Jinja template. This function is behind an API Gateway that has been configured to return text/html instead of its default application/json Content-Type.

The CloudFormation Template
AWS provides a Serverless Application Model template transformer to streamline the templating of Lambda + API Gateway designs, but it assumes (like everything else about the API Gateway) that you’re actually serving an API that returns JSON content. So, unfortunately, it won’t work for this use-case because we want to return HTML content. Instead, we’ll have to enumerate every resource like usual.

The Skeleton
We’ll be using YAML for the template in this example. I find it easier to read than JSON, but you can easily convert between the two with a converter if you disagree.

---
AWSTemplateFormatVersion: '2010-09-09'
Description: Serverless status page app
Resources:
  # [...Resources]

The Status-Checker Lambda Resource
This one is triggered on a schedule by CloudWatch, and looks like:

# Status Checker Lambda
CheckerLambda:
  Type: AWS::Lambda::Function
  Properties:
    Code: ./lambda.zip
    Environment:
      Variables:
        TABLE_NAME: !Ref DynamoTable
    Handler: checker.handler
    Role:
      Fn::GetAtt:
      - CheckerLambdaRole
      - Arn
    Runtime: python3.6
    Timeout: 45
CheckerLambdaRole:
  Type: AWS::IAM::Role
  Properties:
    ManagedPolicyArns:
    - arn:aws:iam::aws:policy/AmazonDynamoDBFullAccess
    - arn:aws:iam::aws:policy/service-role/AWSLambdaBasicExecutionRole
    AssumeRolePolicyDocument:
      Version: '2012-10-17'
      Statement:
      - Action:
        - sts:AssumeRole
        Effect: Allow
        Principal:
          Service:
          - lambda.amazonaws.com
CheckerLambdaTimer:
  Type: AWS::Events::Rule
  Properties:
    ScheduleExpression: rate(1 minute)
    Targets:
    - Id: CheckerLambdaTimerLambdaTarget
      Arn:
        Fn::GetAtt:
        - CheckerLambda
        - Arn
CheckerLambdaTimerPermission:
  Type: AWS::Lambda::Permission
  Properties:
    Action: lambda:invokeFunction
    FunctionName: !Ref CheckerLambda
    SourceArn:
      Fn::GetAtt:
      - CheckerLambdaTimer
      - Arn
    Principal: events.amazonaws.com

Let’s break that down a bit.

The CheckerLambda is the actual Lambda function. The Code section is a local path to a ZIP file containing the code and its dependencies. I’m using CloudFormation’s packaging feature to automatically push the deployable to S3.

The CheckerLambdaRole is the IAM role the Lambda will assume which grants it access to DynamoDB in addition to the usual Lambda logging permissions.

The CheckerLambdaTimer is the CloudWatch Events Rule that triggers the checker to run once per minute.

The CheckerLambdaTimerPermission grants CloudWatch the ability to invoke the checker Lambda function on its interval.

The Web Page Gateway
The API Gateway handles incoming requests for the web page, invokes the Lambda, and then returns the Lambda’s results as HTML content. Its template looks like:

# API Gateway for Web Page Lambda
PageGateway:
  Type: AWS::ApiGateway::RestApi
  Properties:
    Name: Service Checker Gateway
PageResource:
  Type: AWS::ApiGateway::Resource
  Properties:
    RestApiId: !Ref PageGateway
    ParentId:
      Fn::GetAtt:
      - PageGateway
      - RootResourceId
    PathPart: page
PageGatewayMethod:
  Type: AWS::ApiGateway::Method
  Properties:
    AuthorizationType: NONE
    HttpMethod: GET
    Integration:
      Type: AWS
      IntegrationHttpMethod: POST
      Uri:
        Fn::Sub: arn:aws:apigateway:${AWS::Region}:lambda:path/2015-03-31/functions/${WebRenderLambda.Arn}/invocations
      RequestTemplates:
        application/json: |
          {
              "method": "$context.httpMethod",
              "body" : $input.json('$'),
              "headers": {
                  #foreach($param in $input.params().header.keySet())
                  "$param": "$util.escapeJavaScript($input.params().header.get($param))"
                  #if($foreach.hasNext),#end
                  #end
              }
          }
      IntegrationResponses:
      - StatusCode: 200
        ResponseParameters:
          method.response.header.Content-Type: "'text/html'"
        ResponseTemplates:
          text/html: "$input.path('$')"
    ResourceId: !Ref PageResource
    RestApiId: !Ref PageGateway
    MethodResponses:
    - StatusCode: 200
      ResponseParameters:
        method.response.header.Content-Type: true
PageGatewayProdStage:
  Type: AWS::ApiGateway::Stage
  Properties:
    DeploymentId: !Ref PageGatewayDeployment
    RestApiId: !Ref PageGateway
    StageName: Prod
PageGatewayDeployment:
  Type: AWS::ApiGateway::Deployment
  DependsOn: PageGatewayMethod
  Properties:
    RestApiId: !Ref PageGateway
    Description: PageGateway deployment
    StageName: Stage

There’s a lot going on here, but the real meat is in the PageGatewayMethod section. There are a couple properties that deviate from the default which is why we couldn’t use the SAM transformer.

First, we’re passing request headers through to the Lambda in theRequestTemplates section. I’m doing this so I can validate incoming auth headers. The API Gateway can do some types of auth, but I found it easier to check auth myself in the Lambda function since the Gateway is designed to handle API calls and not browser requests.

Next, note that in the IntegrationResponses section we’re defining the Content-Type header to be ‘text/html’ (with single-quotes) and defining the ResponseTemplate to be $input.path(‘$’). This is what makes the request render as a HTML page in your browser instead of just raw text.

Due to the StageName and PathPart values in the other sections, your actual page will be accessible at https://someId.execute-api.region.amazonaws.com/Prod/page. I have the page behind an existing reverse-proxy and give it a saner URL for end-users. The reverse proxy also attaches the auth header I mentioned above. If that header isn’t present, the Lambda will render an error page instead so the proxy can’t be bypassed.

The Web Page Rendering Lambda
This Lambda is invoked by calls to the API Gateway and looks like:

# Web Page Lambda
WebRenderLambda:
  Type: AWS::Lambda::Function
  Properties:
    Code: ./lambda.zip
    Environment:
      Variables:
        TABLE_NAME: !Ref DynamoTable
    Handler: web.handler
    Role:
      Fn::GetAtt:
      - WebRenderLambdaRole
      - Arn
    Runtime: python3.6
    Timeout: 30
WebRenderLambdaRole:
  Type: AWS::IAM::Role
  Properties:
    ManagedPolicyArns:
    - arn:aws:iam::aws:policy/AmazonDynamoDBReadOnlyAccess
    - arn:aws:iam::aws:policy/service-role/AWSLambdaBasicExecutionRole
    AssumeRolePolicyDocument:
      Version: '2012-10-17'
      Statement:
      - Action:
        - sts:AssumeRole
        Effect: Allow
        Principal:
          Service:
          - lambda.amazonaws.com
WebRenderLambdaGatewayPermission:
  Type: AWS::Lambda::Permission
  Properties:
    FunctionName: !Ref WebRenderLambda
    Action: lambda:invokeFunction
    Principal: apigateway.amazonaws.com
    SourceArn:
      Fn::Sub:
      - arn:aws:execute-api:${AWS::Region}:${AWS::AccountId}:${__ApiId__}/*/*/*
      - __ApiId__: !Ref PageGateway

The WebRenderLambda and WebRenderLambdaRole should look familiar.

The WebRenderLambdaGatewayPermission is similar to the Status Checker’s CloudWatch permission, only this time it allows the API Gateway to invoke this Lambda.

The DynamoDB Table
This one is straightforward.

# DynamoDB table
DynamoTable:
  Type: AWS::DynamoDB::Table
  Properties:
    AttributeDefinitions:
    - AttributeName: name
      AttributeType: S
    ProvisionedThroughput:
      WriteCapacityUnits: 1
      ReadCapacityUnits: 1
    TableName: status-page-checker-results
    KeySchema:
    - KeyType: HASH
      AttributeName: name

The Deployment
We’ve made it this far defining every resource in a template that we can check in to version control, so we might as well script the deployment as well rather than manually manage the CloudFormation Stack via the AWS web console.

Since I’m using the packaging feature, I first run:

$ aws cloudformation package \
    --template-file template.yaml \
    --s3-bucket <some-bucket-name> \
    --output-template-file template-packaged.yaml
Uploading to 34cd6e82c5e8205f9b35e71afd9e1548 1922559 / 1922559.0 (100.00%) Successfully packaged artifacts and wrote output template to file template-packaged.yaml.

Then to deploy the template (whether new or modified), I run:

$ aws cloudformation deploy \
    --region '<aws-region>' \
    --template-file template-packaged.yaml \
    --stack-name '<some-name>' \
    --capabilities CAPABILITY_IAM
Waiting for changeset to be created.. Waiting for stack create/update to complete Successfully created/updated stack - <some-name>

And that’s it! You’ve just created a dynamic web page that will never require you to SSH anywhere, patch a server, recover from a disaster after Amazon terminates your unhealthy EC2, or any other number of pitfalls that are now the problem of some ops person at AWS. And you can reproduce deployments and make changes with confidence because everything is defined in the template and can be tracked in version control.

HDD vs SSD: What Does the Future for Storage Hold?

Post Syndicated from Roderick Bauer original https://www.backblaze.com/blog/ssd-vs-hdd-future-of-storage/

SSD 60 TB drive

This is part one of a series. Use the Join button above to receive notification of future posts on this and other topics.

Customers frequently ask us whether and when we plan to move our cloud backup and data storage to SSDs (Solid-State Drives). That’s not a surprising question considering the many advantages SSDs have over magnetic platter type drives, also known as HDDs (Hard-Disk Drives).

We’re a large user of HDDs in our data centers (currently 100,000 hard drives holding over 500 petabytes of data). We want to provide the best performance, reliability, and economy for our cloud backup and cloud storage services, so we continually evaluate which drives to use for operations and in our data centers. While we use SSDs for some applications, which we’ll describe below, there are reasons why HDDs will continue to be the primary drives of choice for us and other cloud providers for the foreseeable future.

HDDs vs SSDs

HDD vs SSD

The laptop computer I am writing this on has a single 512GB SSD, which has become a common feature in higher end laptops. The SSD’s advantages for a laptop are easy to understand: they are smaller than an HDD, faster, quieter, last longer, and are not susceptible to vibration and magnetic fields. They also have much lower latency and access times.

Today’s typical online price for a 2.5” 512GB SSD is $140 to $170. The typical online price for a 3.5” 512 GB HDD is $44 to $65. That’s a pretty significant difference in price, but since the SSD helps make the laptop lighter, enables it to be more resistant to the inevitable shocks and jolts it will experience in daily use, and adds of benefits of faster booting, faster waking from sleep, and faster launching of applications and handling of big files, the extra cost for the SSD in this case is worth it.

Some of these SSD advantages, chiefly speed, also will apply to a desktop computer, so desktops are increasingly outfitted with SSDs, particularly to hold the operating system, applications, and data that is accessed frequently. Replacing a boot drive with an SSD has become a popular upgrade option to breathe new life into a computer, especially one that seems to take forever to boot or is used for notoriously slow-loading applications such as Photoshop.

We covered upgrading your computer with an SSD in our blog post SSD 101: How to Upgrade Your Computer With An SSD.

Data centers are an entirely different kettle of fish. The primary concerns for data center storage are reliability, storage density, and cost. While SSDs are strong in the first two areas, it’s the third where they are not yet competitive. At Backblaze we adopt higher density HDDs as they become available — we’re currently using both 10TB and 12TB drives (among other capacities) in our data centers. Higher density drives provide greater storage density per Storage Pod and Vault and reduce our overhead cost through less required maintenance and lower total power requirements. Comparable SSDs in those sizes would cost roughly $1,000 per terabyte, considerably higher than the corresponding HDD. Simply put, SSDs are not yet in the price range to make their use economical for the benefits they provide, which is the reason why we expect to be using HDDs as our primary storage media for the foreseeable future.

What Are HDDs?

HDDs have been around over 60 years since IBM introduced them in 1956. The first disk drive was the size of a car, stored a mere 3.75 megabytes, and cost $300,000 in today’s dollars.

IBM 350 Disk Storage System — 3.75MB in 1956

The 350 Disk Storage System was a major component of the IBM 305 RAMAC (Random Access Method of Accounting and Control) system, which was introduced in September 1956. It consisted of 40 platters and a dual read/write head on a single arm that moved up and down the stack of magnetic disk platters.

The basic mechanism of an HDD remains unchanged since then, though it has undergone continual refinement. An HDD uses magnetism to store data on a rotating platter. A read/write head is affixed to an arm that floats above the spinning platter reading and writing data. The faster the platter spins, the faster an HDD can perform. Typical laptop drives today spin at either 5400 RPM (revolutions per minute) or 7200 RPM, though some server-based platters spin at even higher speeds.

Exploded drawing of a hard drive

Exploded drawing of a hard drive

The platters inside the drives are coated with a magnetically sensitive film consisting of tiny magnetic grains. Data is recorded when a magnetic write-head flies just above the spinning disk; the write head rapidly flips the magnetization of one magnetic region of grains so that its magnetic pole points up or down, to encode a 1 or a 0 in binary code. If all this sounds like an HDD is vulnerable to shocks and vibration, you’d be right. They also are vulnerable to magnets, which is one way to destroy the data on an HDD if you’re getting rid of it.

The major advantage of an HDD is that it can store lots of data cheaply. One and two terabyte (1,024 and 2,048 gigabytes) hard drives are not unusual for a laptop these days, and 10TB and 12TB drives are now available for desktops and servers. Densities and rotation speeds continue to grow. However, if you compare the cost of common HDDs vs SSDs for sale online, the SSDs are roughly 3-5x the cost per gigabyte. So if you want cheap storage and lots of it, using a standard hard drive is definitely the more economical way to go.

What are the best uses for HDDs?

  • Disk arrays (NAS, RAID, etc.) where high capacity is needed
  • Desktops when low cost is priority
  • Media storage (photos, videos, audio not currently being worked on)
  • Drives with extreme number of reads and writes

What Are SSDs?

SSDs go back almost as far as HDDs, with the first semiconductor storage device compatible with a hard drive interface introduced in 1978, the StorageTek 4305.

Storage Technology 4305 SSD

The StorageTek was an SSD aimed at the IBM mainframe compatible market. The STC 4305 was seven times faster than IBM’s popular 2305 HDD system (and also about half the price). It consisted of a cabinet full of charge-coupled devices and cost $400,000 for 45MB capacity with throughput speeds up to 1.5 MB/sec.

SSDs are based on a type of non-volatile memory called NAND (named for the Boolean operator “NOT AND,” and one of two main types of flash memory). Flash memory stores data in individual memory cells, which are made of floating-gate transistors. Though they are semiconductor-based memory, they retain their information when no power is applied to them — a feature that’s obviously a necessity for permanent data storage.

Samsung SSD

Samsung SSD 850 Pro

Compared to an HDD, SSDs have higher data-transfer rates, higher areal storage density, better reliability, and much lower latency and access times. For most users, it’s the speed of an SSD that primarily attracts them. When discussing the speed of drives, what we are referring to is the speed at which they can read and write data.

For HDDs, the speed at which the platters spin strongly determines the read/write times. When data on an HDD is accessed, the read/write head must physically move to the location where the data was encoded on a magnetic section on the platter. If the file being read was written sequentially to the disk, it will be read quickly. As more data is written to the disk, however, it’s likely that the file will be written across multiple sections, resulting in fragmentation of the data. Fragmented data takes longer to read with an HDD as the read head has to move to different areas of the platter(s) to completely read all the data requested.

Because SSDs have no moving parts, they can operate at speeds far above those of a typical HDD. Fragmentation is not an issue for SSDs. Files can be written anywhere with little impact on read/write times, resulting in read times far faster than any HDD, regardless of fragmentation.

Samsung SSD 850 Pro (back)

Due to the way data is written and read to the drive, however, SSD cells can wear out over time. SSD cells push electrons through a gate to set its state. This process wears on the cell and over time reduces its performance until the SSD wears out. This effect takes a long time and SSDs have mechanisms to minimize this effect, such as the TRIM command. Flash memory writes an entire block of storage no matter how few pages within the block are updated. This requires reading and caching the existing data, erasing the block and rewriting the block. If an empty block is available, a write operation is much faster. The TRIM command, which must be supported in both the OS and the SSD, enables the OS to inform the drive which blocks are no longer needed. It allows the drive to erase the blocks ahead of time in order to make empty blocks available for subsequent writes.

The effect of repeated reading and erasing on an SSD is cumulative and an SSD can slow down and even display errors with age. It’s more likely, however, that the system using the SSD will be discarded for obsolescence before the SSD begins to display read/write errors. Hard drives eventually wear out from constant use as well, since they use physical recording methods, so most users won’t base their selection of an HDD or SSD drive based on expected longevity.

SSD internals

SSD circuit board

Overall, SSDs are considered far more durable than HDDs due to a lack of mechanical parts. The moving mechanisms within an HDD are susceptible to not only wear and tear over time, but to damage due to movement or forceful contact. If one were to drop a laptop with an HDD, there is a high likelihood that all those moving parts will collide, resulting in potential data loss and even destructive physical damage that could kill the HDD outright. SSDs have no moving parts so, while they hold the risk of a potentially shorter life span due to high use, they can survive the rigors we impose upon our portable devices and laptops.

What are the best uses for SSDs?

  • Notebooks, laptops, where performance, lightweight, areal storage density, resistance to shock and general ruggedness are desirable
  • Boot drives holding operating system and applications, which will speed up booting and application launching
  • Working files (media that is being edited: photos, video, audio, etc.)
  • Swap drives where SSD will speed up disk paging
  • Cache drives
  • Database servers
  • Revitalizing an older computer. If you’ve got a computer that seems slow to start up and slow to load applications and files, updating the boot drive with an SSD could make it seem, if not new, at least as if it just came back refreshed from spending some time on the beach.

Stay Tuned for Part 2 of HDD vs SSD

That’s it for part 1. In our second part we’ll take a deeper look at the differences between HDDs and SSDs, how both HDD and SSD technologies are evolving, and how Backblaze takes advantage of SSDs in our operations and data centers.

Here's a tip!Here’s a tip on finding all the posts tagged with SSD on our blog. Just follow https://www.backblaze.com/blog/tag/ssd/.

Don’t miss future posts on HDDs, SSDs, and other topics, including hard drive stats, cloud storage, and tips and tricks for backing up to the cloud. Use the Join button above to receive notification of future posts on our blog.

The post HDD vs SSD: What Does the Future for Storage Hold? appeared first on Backblaze Blog | Cloud Storage & Cloud Backup.

Welcome Alex!

Post Syndicated from Yev original https://www.backblaze.com/blog/welcome-alex/

As we sail past 500 Petabytes of data stored, our Operations Department continues to grow. To that end we’ve added a brand new member to our Operations and Engineering teams, Alex! He straddles the line between Ops and Engineering, working on both internal and external systems – making sure they run smoothly. Lets learn a bit more about Alex, shall we?

What is your Backblaze Title?
Operations Engineer.

Where are you originally from?
Chicago, IL.

What attracted you to Backblaze?
The company mission and overall transparency really appealed to me. It was a great opportunity to work in an environment that aligned with my core values.

What do you expect to learn while being at Backblaze?
I expect to learn more about modern cloud technologies and data center deployments.

Where else have you worked?
I worked at a startup called Cleversafe out of college which was later acquired by IBM.

Where did you go to school?
University of Illinois at Urbana-Champaign

What’s your dream job?
NHL general manager.

Favorite place you’ve traveled?
The Cinque Terre. It’s a set of small towns along the Italian Riviera that has great hiking.

Favorite hobby?
Playing hockey.

Of what achievement are you most proud?
Graduating college with two engineering degrees.

Star Trek or Star Wars?
Neither.

Favorite food?
Homemade pizza.

Why do you like certain things?
Life’s too short not to like certain things.

Cinque Terre is definitely one of the most beautiful places on earth, as long as you don’t visit on a foggy day! If you happen to find the perfect NHL job, we’ll understand! Oh, and thanks for bringing Cosmo to the office on occasion! Welcome aboard!

The post Welcome Alex! appeared first on Backblaze Blog | Cloud Storage & Cloud Backup.

Happy birthday to us!

Post Syndicated from Eben Upton original https://www.raspberrypi.org/blog/happy-birthday-2018/

The eagle-eyed among you may have noticed that today is 28 February, which is as close as you’re going to get to our sixth birthday, given that we launched on a leap day. For the last three years, we’ve launched products on or around our birthday: Raspberry Pi 2 in 2015; Raspberry Pi 3 in 2016; and Raspberry Pi Zero W in 2017. But today is a snow day here at Pi Towers, so rather than launching something, we’re taking a photo tour of the last six years of Raspberry Pi products before we don our party hats for the Raspberry Jam Big Birthday Weekend this Saturday and Sunday.

Prehistory

Before there was Raspberry Pi, there was the Broadcom BCM2763 ‘micro DB’, designed, as it happens, by our very own Roger Thornton. This was the first thing we demoed as a Raspberry Pi in May 2011, shown here running an ARMv6 build of Ubuntu 9.04.

BCM2763 micro DB

Ubuntu on Raspberry Pi, 2011-style

A few months later, along came the first batch of 50 “alpha boards”, designed for us by Broadcom. I used to have a spreadsheet that told me where in the world each one of these lived. These are the first “real” Raspberry Pis, built around the BCM2835 application processor and LAN9512 USB hub and Ethernet adapter; remarkably, a software image taken from the download page today will still run on them.

Raspberry Pi alpha board, top view

Raspberry Pi alpha board

We shot some great demos with this board, including this video of Quake III:

Raspberry Pi – Quake 3 demo

A little something for the weekend: here’s Eben showing the Raspberry Pi running Quake 3, and chatting a bit about the performance of the board. Thanks to Rob Bishop and Dave Emett for getting the demo running.

Pete spent the second half of 2011 turning the alpha board into a shippable product, and just before Christmas we produced the first 20 “beta boards”, 10 of which were sold at auction, raising over £10000 for the Foundation.

The beginnings of a Bramble

Beta boards on parade

Here’s Dom, demoing both the board and his excellent taste in movie trailers:

Raspberry Pi Beta Board Bring up

See http://www.raspberrypi.org/ for more details, FAQ and forum.

Launch

Rather to Pete’s surprise, I took his beta board design (with a manually-added polygon in the Gerbers taking the place of Paul Grant’s infamous red wire), and ordered 2000 units from Egoman in China. After a few hiccups, units started to arrive in Cambridge, and on 29 February 2012, Raspberry Pi went on sale for the first time via our partners element14 and RS Components.

Pallet of pis

The first 2000 Raspberry Pis

Unboxing continues

The first Raspberry Pi from the first box from the first pallet

We took over 100000 orders on the first day: something of a shock for an organisation that had imagined in its wildest dreams that it might see lifetime sales of 10000 units. Some people who ordered that day had to wait until the summer to finally receive their units.

Evolution

Even as we struggled to catch up with demand, we were working on ways to improve the design. We quickly replaced the USB polyfuses in the top right-hand corner of the board with zero-ohm links to reduce IR drop. If you have a board with polyfuses, it’s a real limited edition; even more so if it also has Hynix memory. Pete’s “rev 2” design made this change permanent, tweaked the GPIO pin-out, and added one much-requested feature: mounting holes.

Revision 1 versus revision 2

If you look carefully, you’ll notice something else about the revision 2 board: it’s made in the UK. 2012 marked the start of our relationship with the Sony UK Technology Centre in Pencoed, South Wales. In the five years since, they’ve built every product we offer, including more than 12 million “big” Raspberry Pis and more than one million Zeros.

Celebrating 500,000 Welsh units, back when that seemed like a lot

Economies of scale, and the decline in the price of SDRAM, allowed us to double the memory capacity of the Model B to 512MB in the autumn of 2012. And as supply of Model B finally caught up with demand, we were able to launch the Model A, delivering on our original promise of a $25 computer.

A UK-built Raspberry Pi Model A

In 2014, James took all the lessons we’d learned from two-and-a-bit years in the market, and designed the Model B+, and its baby brother the Model A+. The Model B+ established the form factor for all our future products, with a 40-pin extended GPIO connector, four USB ports, and four mounting holes.

The Raspberry Pi 1 Model B+ — entering the era of proper product photography with a bang.

New toys

While James was working on the Model B+, Broadcom was busy behind the scenes developing a follow-on to the BCM2835 application processor. BCM2836 samples arrived in Cambridge at 18:00 one evening in April 2014 (chips never arrive at 09:00 — it’s always early evening, usually just before a public holiday), and within a few hours Dom had Raspbian, and the usual set of VideoCore multimedia demos, up and running.

We launched Raspberry Pi 2 at the start of 2015, pairing BCM2836 with 1GB of memory. With a quad-core Arm Cortex-A7 clocked at 900MHz, we’d increased performance sixfold, and memory fourfold, in just three years.

Nobody mention the xenon death flash.

And of course, while James was working on Raspberry Pi 2, Broadcom was developing BCM2837, with a quad-core 64-bit Arm Cortex-A53 clocked at 1.2GHz. Raspberry Pi 3 launched barely a year after Raspberry Pi 2, providing a further doubling of performance and, for the first time, wireless LAN and Bluetooth.

All our recent products are just the same board shot from different angles

Zero to hero

Where the PC industry has historically used Moore’s Law to “fill up” a given price point with more performance each year, the original Raspberry Pi used Moore’s law to deliver early-2000s PC performance at a lower price. But with Raspberry Pi 2 and 3, we’d gone back to filling up our original $35 price point. After the launch of Raspberry Pi 2, we started to wonder whether we could pull the same trick again, taking the original Raspberry Pi platform to a radically lower price point.

The result was Raspberry Pi Zero. Priced at just $5, with a 1GHz BCM2835 and 512MB of RAM, it was cheap enough to bundle on the front of The MagPi, making us the first computer magazine to give away a computer as a cover gift.

Cheap thrills

MagPi issue 40 in all its glory

We followed up with the $10 Raspberry Pi Zero W, launched exactly a year ago. This adds the wireless LAN and Bluetooth functionality from Raspberry Pi 3, using a rather improbable-looking PCB antenna designed by our buddies at Proant in Sweden.

Up to our old tricks again

Other things

Of course, this isn’t all. There has been a veritable blizzard of point releases; RAM changes; Chinese red units; promotional blue units; Brazilian blue-ish units; not to mention two Camera Modules, in two flavours each; a touchscreen; the Sense HAT (now aboard the ISS); three compute modules; and cases for the Raspberry Pi 3 and the Zero (the former just won a Design Effectiveness Award from the DBA). And on top of that, we publish three magazines (The MagPi, Hello World, and HackSpace magazine) and a whole host of Project Books and Essentials Guides.

Chinese Raspberry Pi 1 Model B

RS Components limited-edition blue Raspberry Pi 1 Model B

Brazilian-market Raspberry Pi 3 Model B

Visible-light Camera Module v2

Learning about injection moulding the hard way

250 pages of content each month, every month

Essential reading

Forward the Foundation

Why does all this matter? Because we’re providing everyone, everywhere, with the chance to own a general-purpose programmable computer for the price of a cup of coffee; because we’re giving people access to tools to let them learn new skills, build businesses, and bring their ideas to life; and because when you buy a Raspberry Pi product, every penny of profit goes to support the Raspberry Pi Foundation in its mission to change the face of computing education.

We’ve had an amazing six years, and they’ve been amazing in large part because of the community that’s grown up alongside us. This weekend, more than 150 Raspberry Jams will take place around the world, comprising the Raspberry Jam Big Birthday Weekend.

Raspberry Pi Big Birthday Weekend 2018. GIF with confetti and bopping JAM balloons

If you want to know more about the Raspberry Pi community, go ahead and find your nearest Jam on our interactive map — maybe we’ll see you there.

The post Happy birthday to us! appeared first on Raspberry Pi.

Ode to ‘Locate My Computer’

Post Syndicated from Yev original https://www.backblaze.com/blog/laptop-locator-can-save-you/

Laptop locator signal

Some things don’t get the credit they deserve. For one of our engineers, Billy, the Locate My Computer feature is near and dear to his heart. It took him a while to build it, and it requires some regular updates, even after all these years. Billy loves the Locate My Computer feature, but really loves knowing how it’s helped customers over the years. One recent story made us decide to write a bit of a greatest hits post as an ode to one of our favorite features — Locate My Computer.

What is it?

Locate My Computer, as you’ll read in the stories below, came about because some of our users had their computers stolen and were trying to find a way to retrieve their devices. They realized that while some of their programs and services like Find My Mac were wiped, in some cases, Backblaze was still running in the background. That created the ability to use our software to figure out where the computer was contacting us from. After manually helping some of the individuals that wrote in, we decided to build it in as a feature. Little did we know the incredible stories it would lead to. We’ll get into that, but first, a little background on why the whole thing came about.

Identifying the Customer Need

“My friend’s laptop was stolen. He tracked the thief via @Backblaze for weeks & finally identified him on Facebook & Twitter. Digital 007.”

Mat —
In December 2010, we saw a tweet from @DigitalRoyalty which read: “My friend’s laptop was stolen. He tracked the thief via @Backblaze for weeks & finally identified him on Facebook & Twitter. Digital 007.” Our CEO was manning Twitter at the time and reached out for the whole story. It turns out that Mat Miller had his laptop stolen, and while he was creating some restores a few days later, he noticed a new user was created on his computer and was backing up data. He restored some of those files, saw some information that could help identify the thief, and filed a police report. Read the whole story: Digital 007 — Outwitting The Thief.

Mark —
Following Mat Miller’s story we heard from Mark Bao, an 18-year old entrepreneur and student at Bentley University who had his laptop stolen. The laptop was stolen out of Mark’s dorm room and the thief started using it in a variety of ways, including audition practice for Dancing with the Stars. Once Mark logged in to Backblaze and saw that there were new files being uploaded, including a dance practice video, he was able to reach out to campus police and got his laptop back. You can read more about the story on: 18 Year Old Catches Thief Using Backblaze.

After Mat and Mark’s story we thought we were onto something. In addition to those stories that had garnered some media attention, we would occasionally get requests from users that said something along the lines of, “Hey, my laptop was stolen, but I had Backblaze installed. Could you please let me know if it’s still running, and if so, what the IP address is so that I can go to the authorities?” We would help them where we could, but knew that there was probably a much more efficient method of helping individuals and businesses keep track of their computers.

Some of the Greatest Hits, and the Mafia Story

In May of 2011, we launched “Locate My Computer.” This was our way of adding a feature to our already-popular backup client that would allow users to see a rough representation of where their computer was located, and the IP address associated with its last known transmission. After speaking to law enforcement, we learned that those two things were usually enough for the authorities to subpoena an ISP and get the physical address of the last known place the computer phoned home from. From there, they could investigate and, if the device was still there, return it to its rightful owner.

Bridgette —
Once the feature went live the stories got even more interesting. Almost immediately after we launched Locate My Computer, we were contacted by Bridgette, who told us of a break-in at her house. Luckily no one was home at the time, but the thief was able to get away with her iMac, DSLR, and a few other prized possessions. As soon as she reported the robbery to the police, they were able to use the Locate My Computer feature to find the thief’s location and recover her missing items. We even made a case study out of Bridgette’s experience. You can read it at: Backblaze And The Stolen iMac.

“Joe” —
The crazy recovery stories didn’t end there. Shortly after Bridgette’s story, we received an email from a user (“Joe” — to protect the innocent) who was traveling to Argentina from the United States and had his laptop stolen. After he contacted the police department in Buenos Aires, and explained to them that he was using Backblaze (which the authorities thought was a computer tracking service, and in this case, we were), they were able to get the location of the computer from an ISP in Argentina. When they went to investigate, they realized that the perpetrators were foreign nationals connected to the mafia, and that in addition to a handful of stolen laptops, they were also in the possession of over $1,000,000 in counterfeit currency! Read the whole story about “Joe” and how: Backblaze Found $1 Million in Counterfeit Cash!

The Maker —
After “Joe,” we thought that our part in high-profile “busts was over, but we were wrong. About a year later we received word from a “maker” who told us that he was able to act as an “internet super-sleuth” and worked hard to find his stolen computer. After a Maker Faire in Detroit, the maker’s car was broken into while they were getting BBQ following a successful show. While some of the computers were locked and encrypted, others were in hibernation mode and wide open to prying eyes. After the police report was filed, the maker went to Backblaze to retrieve his lost files and remembered seeing the little Locate My Computer button. That’s when the story gets really interesting. The victim used a combination of ingenuity, Craigslist, Backblaze, and the local police department to get his computer back, and make a drug bust along the way. Head over to Makezine.com to read about how:How Tracking Down My Stolen Computer Triggered a Drug Bust.

Una —
While we kept hearing praise and thanks from our customers who were able to recover their data and find their computers, a little while passed before we would hear a story that was as incredible as the ones above. In July of 2016, we received an email from Una who told us one of the most amazing stories of perseverance that we’d ever heard. With the help of Backblaze and a sympathetic constable in Australia, Una tracked her stolen computer’s journey across 6 countries. She got her computer back and we wrote up the whole story: How Una Found Her Stolen Laptop.

And the Hits Keep on Coming

The most recent story came from “J,” and we’ll share the whole thing with you because it has a really nice conclusion:

Back in September of 2017, I brought my laptop to work to finish up some administrative work before I took off for a vacation. I work in a mall where traffic [is] plenty and more specifically I work at a kiosk in the middle of the mall. This allows for a high amount of traffic passing by every few seconds. I turned my back for about a minute to put away some paperwork. At the time I didn’t notice my laptop missing. About an hour later when I was gathering my belongings for the day I noticed it was gone. I was devastated. This was a high end MacBook Pro that I just purchased. So we are not talking about a little bit of money here. This was a major investment.

Time [went] on. When I got back from my vacation I reached out to my LP (Loss Prevention) team to get images from our security to submit to the police with some thread of hope that they would find whomever stole it. December approached and I did not hear anything. I gave up hope and assumed that the laptop was scrapped. I put an iCloud lock on it and my Find My Mac feature was saying that laptop was “offline.” I just assumed that they opened it, saw it was locked, and tried to scrap it for parts.

Towards the end of January I got an email from Backblaze saying that the computer was successfully backed up. This came as a shock to me as I thought it was wiped. But I guess however they wiped it didn’t remove Backblaze from the SSD. None the less, I was very happy. I sifted through the backup and found the person’s name via the search history. Then, using the Locate my Computer feature I saw where it came online. I reached out on social media to the person in question and updated the police. I finally got ahold of the person who stated she bought it online a few weeks backs. We made arrangements and I’m happy to say that I am typing this email on my computer right now.

J finished by writing: “Not only did I want to share this story with you but also wanted to say thanks! Apple’s find my computer system failed. The police failed to find it. But Backblaze saved the day. This has been the best $5 a month I have ever spent. Not only that but I got all my stuff back. Which made the deal even better! It was like it was never gone.”

Have a Story of Your Own?

We’re more than thrilled to have helped all of these people restore their lost data using Backblaze. Recovering the actual machine using Locate My Computer though, that’s the icing on the cake. We’re proud of what we’ve been able to build here at Backblaze, and we really enjoy hearing stories from people who have used our service to successfully get back up and running, whether that meant restoring their data or recovering their actual computer.

If you have any interesting data recovery or computer recovery stories that you’d like to share with us, please email press@backblaze.com and we’ll share it with Billy and the rest of the Backblaze team. We love hearing them!

The post Ode to ‘Locate My Computer’ appeared first on Backblaze Blog | Cloud Storage & Cloud Backup.

Dutch Continue to Curb Illegal Downloading But What About Streaming?

Post Syndicated from Andy original https://torrentfreak.com/dutch-continue-to-curb-illegal-downloading-but-what-about-streaming-180222/

After many years of downloading content with impunity, 2014 brought a culture shock to the Dutch.

Citizens were previously allowed to obtain content for their own use due to a levy on blank media that compensated rightsholders. However, the European Court of Justice found that system to be illegal and the government quickly moved to ban downloading from unauthorized sources.

In the four years that have passed since the ban, the downloading landscape has undergone change. That’s according to a study published by the Consumer Insights panel at Telecompaper which found that while 41% of respondents downloaded movies, TV shows, music and games from unauthorized sources in 2013, the figure had plunged to 27% at the end of 2016. There was a further drop to 24% by the end of 2017.

Of the people who continue to download illegally, men are overrepresented, the study found. While 27% of men obtained media for free during the last year to October 2017, only 21% of women did likewise.

While as many as 150 million people still use P2P technologies such as BitTorrent worldwide, there is a general decline in usage and this is reflected in the report.

In 2013, 18% of Dutch respondents used torrent-like systems to download, a figure that had fallen to 8% in 2016 and 6% last year. Again, male participants were overrepresented, outnumbering women by two to one. However, people appear to be visiting P2P networks less.

“The study showed that people who reported using P2P to download content, have done so on average 37 times a year [to October 2017]. In January of 2017 it was significantly higher, 61 times,” the study notes. P2P usage in November 2015 was rated at 98 instances per year.

Perhaps surprisingly, one of the oldest methods of downloading content has maintained its userbase in more recent years. Usenet, otherwise known as the newsgroups, accounted for 9% of downloaders in 2013 but after falling to around 6% of downloaders in 2016, that figure remained unchanged in 2017. Almost five times more men used newsgroups than women.

At the same time as showing a steady trend in terms of users, instances of newsgroup downloading are reportedly up in the latest count. In November 2015, people used the system an average of 98 times per year but in January 2017 that had fallen to 66 times. The latest figures find an average use of 68 times per year.

Drilling down into more obscure systems, 2% of respondents told Telecompaper that they’d used an FTP server during the past year, a method that was entirely dominated by men.

While the Dutch downloading ban in 2013 may have played some part in changing perceptions, the increased availability of legal offers cannot be ignored. Films and TV shows are now widely available on services such as Netflix and Amazon, while music is strongly represented via Spotify, Apple, Deezer and similar platforms.

Indeed, 12% of respondents said they are now downloading less illegally because it’s easier to obtain paid content, that’s versus 11% at the start of 2017 and just 3% in 2013. Interestingly, 14% of respondents this time around said their illegal downloads are down because they have more restrictions on their time.

Another interesting reason given for downloading less is that pirate content is becoming harder to find. In 2013, just 4% cited this as a cause for reduction yet in 2017, this had jumped to 8% of respondents, with blocked sites proving a stumbling block for some users.

On the other hand, 3% of respondents said that since content had become easier to find, they are now downloading more. However, that figure is down from 13% in November 2013 and 6% in January 2017.

But with legal streaming certainly making its mark in the Netherlands, the illegal streaming phenomenon isn’t directly addressed in the report. It is likely that a considerable number of citizens are now using this method to obtain their content fix in a way that’s not as easily trackable as torrent-like systems.

Furthermore, given the plans of local film distribution Dutch FilmWorks to chase and demand cash settlements from BitTorrent users, it’s likely that traffic to streaming sites will only increase in the months to come, at least for those looking to consume TV shows and movies.

Source: TF, for the latest info on copyright, file-sharing, torrent sites and more. We also have VPN discounts, offers and coupons

Pirate Site Admin Sentenced to Two Years Prison & €83.6 Million Damages

Post Syndicated from Andy original https://torrentfreak.com/pirate-site-admin-sentenced-to-two-years-prison-e83-6-million-damages-180221/

Way back in 2011, Streamiz was reported to be the second most popular pirate streaming site in France with around 250,000 visitors per day. The site didn’t host its own content but linked to movies elsewhere.

This prominent status soon attracted the attention of various entertainment companies including the National Federation of Film Distributors (FNDF) which filed a complaint against the site back in 2009.

Investigators eventually traced the presumed operator of the site to a location in the Hauts-de-Seine region of France. In October 2011 he was arrested leaving his Montrouge home in the southern Parisian suburbs. His backpack reportedly contained socks stuffed with almost 30,000 euros in cash.

The man was ordered to appear before the investigating judge but did not attend. He also failed to appear during his sentencing this Monday, which may or may not have been a good thing, depending on one’s perspective.

In his absence, the now 41-year-old was found guilty of copyright infringement offenses and handed one of the toughest sentences ever in a case of its type.

According to an AFP report, when the authorities can catch up with him the man must not only serve two years in prison but also pay a staggering 83.6 million euros in damages to Disney, 20th Century Fox, Warner Bros and SACEM, the Society of Authors, Composers and Music Publishers.

Streamiz is now closed but at its peak offered around 40,000 movies to millions of users per month. In total, the site stood accused of around 500,000,000 infringements, earning its operator an estimated 150,000 euros in advertising revenue over a two year period.

“This is a clear case of commercial counterfeiting” based on a “very structured” system, David El Sayegh, Secretary General of SACEM, told AFP. His sentence “sends a very clear message: there will be no impunity for pirates,” he added.

With an arrest warrant still outstanding, the former Streamiz admin is now on the run with very few options available to him. Certainly, the 83.6 million euro fine won’t ever be paid but the prison sentence is something he might need to get behind him.

Source: TF, for the latest info on copyright, file-sharing, torrent sites and more. We also have VPN discounts, offers and coupons

How I built a data warehouse using Amazon Redshift and AWS services in record time

Post Syndicated from Stephen Borg original https://aws.amazon.com/blogs/big-data/how-i-built-a-data-warehouse-using-amazon-redshift-and-aws-services-in-record-time/

This is a customer post by Stephen Borg, the Head of Big Data and BI at Cerberus Technologies.

Cerberus Technologies, in their own words: Cerberus is a company founded in 2017 by a team of visionary iGaming veterans. Our mission is simple – to offer the best tech solutions through a data-driven and a customer-first approach, delivering innovative solutions that go against traditional forms of working and process. This mission is based on the solid foundations of reliability, flexibility and security, and we intend to fundamentally change the way iGaming and other industries interact with technology.

Over the years, I have developed and created a number of data warehouses from scratch. Recently, I built a data warehouse for the iGaming industry single-handedly. To do it, I used the power and flexibility of Amazon Redshift and the wider AWS data management ecosystem. In this post, I explain how I was able to build a robust and scalable data warehouse without the large team of experts typically needed.

In two of my recent projects, I ran into challenges when scaling our data warehouse using on-premises infrastructure. Data was growing at many tens of gigabytes per day, and query performance was suffering. Scaling required major capital investment for hardware and software licenses, and also significant operational costs for maintenance and technical staff to keep it running and performing well. Unfortunately, I couldn’t get the resources needed to scale the infrastructure with data growth, and these projects were abandoned. Thanks to cloud data warehousing, the bottleneck of infrastructure resources, capital expense, and operational costs have been significantly reduced or have totally gone away. There is no more excuse for allowing obstacles of the past to delay delivering timely insights to decision makers, no matter how much data you have.

With Amazon Redshift and AWS, I delivered a cloud data warehouse to the business very quickly, and with a small team: me. I didn’t have to order hardware or software, and I no longer needed to install, configure, tune, or keep up with patches and version updates. Instead, I easily set up a robust data processing pipeline and we were quickly ingesting and analyzing data. Now, my data warehouse team can be extremely lean, and focus more time on bringing in new data and delivering insights. In this post, I show you the AWS services and the architecture that I used.

Handling data feeds

I have several different data sources that provide everything needed to run the business. The data includes activity from our iGaming platform, social media posts, clickstream data, marketing and campaign performance, and customer support engagements.

To handle the diversity of data feeds, I developed abstract integration applications using Docker that run on Amazon EC2 Container Service (Amazon ECS) and feed data to Amazon Kinesis Data Streams. These data streams can be used for real time analytics. In my system, each record in Kinesis is preprocessed by an AWS Lambda function to cleanse and aggregate information. My system then routes it to be stored where I need on Amazon S3 by Amazon Kinesis Data Firehose. Suppose that you used an on-premises architecture to accomplish the same task. A team of data engineers would be required to maintain and monitor a Kafka cluster, develop applications to stream data, and maintain a Hadoop cluster and the infrastructure underneath it for data storage. With my stream processing architecture, there are no servers to manage, no disk drives to replace, and no service monitoring to write.

Setting up a Kinesis stream can be done with a few clicks, and the same for Kinesis Firehose. Firehose can be configured to automatically consume data from a Kinesis Data Stream, and then write compressed data every N minutes to Amazon S3. When I want to process a Kinesis data stream, it’s very easy to set up a Lambda function to be executed on each message received. I can just set a trigger from the AWS Lambda Management Console, as shown following.

I also monitor the duration of function execution using Amazon CloudWatch and AWS X-Ray.

Regardless of the format I receive the data from our partners, I can send it to Kinesis as JSON data using my own formatters. After Firehose writes this to Amazon S3, I have everything in nearly the same structure I received but compressed, encrypted, and optimized for reading.

This data is automatically crawled by AWS Glue and placed into the AWS Glue Data Catalog. This means that I can immediately query the data directly on S3 using Amazon Athena or through Amazon Redshift Spectrum. Previously, I used Amazon EMR and an Amazon RDS–based metastore in Apache Hive for catalog management. Now I can avoid the complexity of maintaining Hive Metastore catalogs. Glue takes care of high availability and the operations side so that I know that end users can always be productive.

Working with Amazon Athena and Amazon Redshift for analysis

I found Amazon Athena extremely useful out of the box for ad hoc analysis. Our engineers (me) use Athena to understand new datasets that we receive and to understand what transformations will be needed for long-term query efficiency.

For our data analysts and data scientists, we’ve selected Amazon Redshift. Amazon Redshift has proven to be the right tool for us over and over again. It easily processes 20+ million transactions per day, regardless of the footprint of the tables and the type of analytics required by the business. Latency is low and query performance expectations have been more than met. We use Redshift Spectrum for long-term data retention, which enables me to extend the analytic power of Amazon Redshift beyond local data to anything stored in S3, and without requiring me to load any data. Redshift Spectrum gives me the freedom to store data where I want, in the format I want, and have it available for processing when I need it.

To load data directly into Amazon Redshift, I use AWS Data Pipeline to orchestrate data workflows. I create Amazon EMR clusters on an intra-day basis, which I can easily adjust to run more or less frequently as needed throughout the day. EMR clusters are used together with Amazon RDS, Apache Spark 2.0, and S3 storage. The data pipeline application loads ETL configurations from Spring RESTful services hosted on AWS Elastic Beanstalk. The application then loads data from S3 into memory, aggregates and cleans the data, and then writes the final version of the data to Amazon Redshift. This data is then ready to use for analysis. Spark on EMR also helps with recommendations and personalization use cases for various business users, and I find this easy to set up and deliver what users want. Finally, business users use Amazon QuickSight for self-service BI to slice, dice, and visualize the data depending on their requirements.

Each AWS service in this architecture plays its part in saving precious time that’s crucial for delivery and getting different departments in the business on board. I found the services easy to set up and use, and all have proven to be highly reliable for our use as our production environments. When the architecture was in place, scaling out was either completely handled by the service, or a matter of a simple API call, and crucially doesn’t require me to change one line of code. Increasing shards for Kinesis can be done in a minute by editing a stream. Increasing capacity for Lambda functions can be accomplished by editing the megabytes allocated for processing, and concurrency is handled automatically. EMR cluster capacity can easily be increased by changing the master and slave node types in Data Pipeline, or by using Auto Scaling. Lastly, RDS and Amazon Redshift can be easily upgraded without any major tasks to be performed by our team (again, me).

In the end, using AWS services including Kinesis, Lambda, Data Pipeline, and Amazon Redshift allows me to keep my team lean and highly productive. I eliminated the cost and delays of capital infrastructure, as well as the late night and weekend calls for support. I can now give maximum value to the business while keeping operational costs down. My team pushed out an agile and highly responsive data warehouse solution in record time and we can handle changing business requirements rapidly, and quickly adapt to new data and new user requests.


Additional Reading

If you found this post useful, be sure to check out Deploy a Data Warehouse Quickly with Amazon Redshift, Amazon RDS for PostgreSQL and Tableau Server and Top 8 Best Practices for High-Performance ETL Processing Using Amazon Redshift.


About the Author

Stephen Borg is the Head of Big Data and BI at Cerberus Technologies. He has a background in platform software engineering, and first became involved in data warehousing using the typical RDBMS, SQL, ETL, and BI tools. He quickly became passionate about providing insight to help others optimize the business and add personalization to products. He is now the Head of Big Data and BI at Cerberus Technologies.

 

 

 

Internet Security Threats at the Olympics

Post Syndicated from Bruce Schneier original https://www.schneier.com/blog/archives/2018/02/internet_securi.html

There are a lot:

The cybersecurity company McAfee recently uncovered a cyber operation, dubbed Operation GoldDragon, attacking South Korean organizations related to the Winter Olympics. McAfee believes the attack came from a nation state that speaks Korean, although it has no definitive proof that this is a North Korean operation. The victim organizations include ice hockey teams, ski suppliers, ski resorts, tourist organizations in Pyeongchang, and departments organizing the Pyeongchang Olympics.

Meanwhile, a Russia-linked cyber attack has already stolen and leaked documents from other Olympic organizations. The so-called Fancy Bear group, or APT28, began its operations in late 2017 –­ according to Trend Micro and Threat Connect, two private cybersecurity firms­ — eventually publishing documents in 2018 outlining the political tensions between IOC officials and World Anti-Doping Agency (WADA) officials who are policing Olympic athletes. It also released documents specifying exceptions to anti-doping regulations granted to specific athletes (for instance, one athlete was given an exception because of his asthma medication). The most recent Fancy Bear leak exposed details about a Canadian pole vaulter’s positive results for cocaine. This group has targeted WADA in the past, specifically during the 2016 Rio de Janeiro Olympics. Assuming the attribution is right, the action appears to be Russian retaliation for the punitive steps against Russia.

A senior analyst at McAfee warned that the Olympics may experience more cyber attacks before closing ceremonies. A researcher at ThreatConnect asserted that organizations like Fancy Bear have no reason to stop operations just because they’ve already stolen and released documents. Even the United States Department of Homeland Security has issued a notice to those traveling to South Korea to remind them to protect themselves against cyber risks.

One presumes the Olympics network is sufficiently protected against the more pedestrian DDoS attacks and the like, but who knows?

EDITED TO ADD: There was already one attack.

Udemy Targets ‘Pirate’ Site Giving Away its Paid Courses For Free

Post Syndicated from Andy original https://torrentfreak.com/udemy-targets-pirate-site-giving-away-its-paid-courses-for-free-180129/

While there’s no shortage of people who advocate free sharing of movies and music, passions are often raised when it comes to the availability of educational information.

Significant numbers of people believe that learning should be open to all and that texts and associated materials shouldn’t be locked away by copyright holders trying to monetize knowledge. Of course, people who make a living creating learning materials see the position rather differently.

A clash of these ideals is brewing in the United States where online learning platform Udemy has been trying to have some of its courses taken down from FreeTutorials.us, a site that makes available premium tutorials and other learning materials for free.

Early December 2017, counsel acting for Udemy and a number of its individual and corporate instructors (Maximilian Schwarzmüller, Academind GmbH, Peter Dalmaris, Futureshock Enterprises, Jose Marcial Portilla, and Pierian Data) wrote to FreeTutorials.us with DMCA takedown notice.

“Pursuant to 17 U.S.C. § 512(c)(3)(A) of the Digital Millennium Copyright Act (‘DMCA’), this communication serves as a notice of infringement and request for removal of certain web content available on freetutorials.us,” the letter reads.

“I hereby request that you remove or disable access to the material listed in Exhibit A in as expedient a fashion as possible. This communication does not constitute a waiver of any right to recover damages incurred by virtue of any such unauthorized activities, and such rights as well as claims for other relief are expressly retained.”

A small sample of Exhibit A

On January 10, 2018, the same law firm wrote to Cloudflare, which provides services to FreeTutorials. The DMCA notice asked Cloudflare to disable access to the same set of infringing content listed above.

It seems likely that whatever happened next wasn’t to Udemy’s satisfaction. On January 16, an attorney from the same law firm filed a DMCA subpoena at a district court in California. A DMCA subpoena can enable a copyright holder to obtain the identity of an alleged infringer without having to file a lawsuit and without needing a signature from a judge.

The subpoena was directed at Cloudflare, which provides services to FreeTutorials. The company was ordered to hand over “all identifying information identifying the owner, operator and/or contact person(s) associated with the domain www.freetutorials.us, including but not limited to name(s), address(es), telephone number(s), email address(es), Internet protocol connection records, administrative records and billing records from the time the account was established to the present.”

On January 26, the date by which Cloudflare was ordered to hand over the information, Cloudflare wrote to FreeTutorials with a somewhat late-in-the-day notification.

“We received the attached subpoena regarding freetutorials.us, a domain managed through your Cloudflare account. The subpoena requires us to provide information in our systems related to this website,” the company wrote.

“We have determined that this is a valid subpoena, and we are required to provide the requested information. In accordance with our Privacy Policy, we are informing you before we provide any of the requested subscriber information. We plan to turn over documents in response to the subpoena on January 26th, 2018, unless you intervene in the case.”

With that deadline passing last Friday, it’s safe to say that Cloudflare has complied with the subpoena as the law requires. However, TorrentFreak spoke with FreeTutorials who told us that the company doesn’t hold anything useful on them.

“No, they have nothing,” the team explained.

Noting that they’ll soon dispense with the services of Cloudflare, the team confirmed that they had received emails from Udemy and its instructors but hadn’t done a lot in response.

“How about a ‘NO’? was our answer to all the DMCA takedown requests from Udemy and its Instructors,” they added.

FreeTutorials (FTU) are affiliated with FreeCoursesOnline (FCO) and seem passionate about what they do. In common with others who distribute learning materials online, they express a belief in free education for all, irrespective of financial resources.

“We, FTU and FCO, are a group of seven members assorted as a team from different countries and cities. We are JN, SRZ aka SunRiseZone, Letap, Lihua Google Drive, Kaya, Zinnia, Faiz MeemBazooka,” a spokesperson revealed.

“We’re all members and colleagues and we also have our own daily work and business stuff to do. We have been through that phase of life when we didn’t have enough money to buy books and get tuition or even apply for a good course that we always wanted to have, so FTU & FCO are just our vision to provide Free Education For Everyone.

“We would love to change our priorities towards our current and future projects, only if we manage to get some faithful FTU’ers to join in and help us to grow together and make FTU a place it should be.”

TorrentFreak requested comment from Udemy but at the time of publication, we were yet to hear back. However, we did manage to get in touch with Jonathan Levi, an Udemy instructor who sent this takedown notice to the site in October 2017:

“I’m writing to you on behalf of SuperHuman Enterprises, LLC. You are in violation of our copyright, using our images, and linking to pirated copies of our courses. Remove them IMMEDIATELY or face severe legal action….You have 48 hours to comply,” he wrote, adding:

“And in case you’re going to say I don’t have evidence that I own the files, it’s my fucking face in the videos.”

Levi says that the site had been non-responsive so now things are being taken to the next level.

“They don’t reply to takedowns, so we’ve joined a class action lawsuit against FTU lead by Udemy and a law firm specializing in this type of thing,” Levi concludes.

Source: TF, for the latest info on copyright, file-sharing, torrent sites and more. We also have VPN discounts, offers and coupons

Top 8 Best Practices for High-Performance ETL Processing Using Amazon Redshift

Post Syndicated from Thiyagarajan Arumugam original https://aws.amazon.com/blogs/big-data/top-8-best-practices-for-high-performance-etl-processing-using-amazon-redshift/

An ETL (Extract, Transform, Load) process enables you to load data from source systems into your data warehouse. This is typically executed as a batch or near-real-time ingest process to keep the data warehouse current and provide up-to-date analytical data to end users.

Amazon Redshift is a fast, petabyte-scale data warehouse that enables you easily to make data-driven decisions. With Amazon Redshift, you can get insights into your big data in a cost-effective fashion using standard SQL. You can set up any type of data model, from star and snowflake schemas, to simple de-normalized tables for running any analytical queries.

To operate a robust ETL platform and deliver data to Amazon Redshift in a timely manner, design your ETL processes to take account of Amazon Redshift’s architecture. When migrating from a legacy data warehouse to Amazon Redshift, it is tempting to adopt a lift-and-shift approach, but this can result in performance and scale issues long term. This post guides you through the following best practices for ensuring optimal, consistent runtimes for your ETL processes:

  • COPY data from multiple, evenly sized files.
  • Use workload management to improve ETL runtimes.
  • Perform table maintenance regularly.
  • Perform multiple steps in a single transaction.
  • Loading data in bulk.
  • Use UNLOAD to extract large result sets.
  • Use Amazon Redshift Spectrum for ad hoc ETL processing.
  • Monitor daily ETL health using diagnostic queries.

1. COPY data from multiple, evenly sized files

Amazon Redshift is an MPP (massively parallel processing) database, where all the compute nodes divide and parallelize the work of ingesting data. Each node is further subdivided into slices, with each slice having one or more dedicated cores, equally dividing the processing capacity. The number of slices per node depends on the node type of the cluster. For example, each DS2.XLARGE compute node has two slices, whereas each DS2.8XLARGE compute node has 16 slices.

When you load data into Amazon Redshift, you should aim to have each slice do an equal amount of work. When you load the data from a single large file or from files split into uneven sizes, some slices do more work than others. As a result, the process runs only as fast as the slowest, or most heavily loaded, slice. In the example shown below, a single large file is loaded into a two-node cluster, resulting in only one of the nodes, “Compute-0”, performing all the data ingestion:

When splitting your data files, ensure that they are of approximately equal size – between 1 MB and 1 GB after compression. The number of files should be a multiple of the number of slices in your cluster. Also, I strongly recommend that you individually compress the load files using gzip, lzop, or bzip2 to efficiently load large datasets.

When loading multiple files into a single table, use a single COPY command for the table, rather than multiple COPY commands. Amazon Redshift automatically parallelizes the data ingestion. Using a single COPY command to bulk load data into a table ensures optimal use of cluster resources, and quickest possible throughput.

2. Use workload management to improve ETL runtimes

Use Amazon Redshift’s workload management (WLM) to define multiple queues dedicated to different workloads (for example, ETL versus reporting) and to manage the runtimes of queries. As you migrate more workloads into Amazon Redshift, your ETL runtimes can become inconsistent if WLM is not appropriately set up.

I recommend limiting the overall concurrency of WLM across all queues to around 15 or less. This WLM guide helps you organize and monitor the different queues for your Amazon Redshift cluster.

When managing different workloads on your Amazon Redshift cluster, consider the following for the queue setup:

  • Create a queue dedicated to your ETL processes. Configure this queue with a small number of slots (5 or fewer). Amazon Redshift is designed for analytics queries, rather than transaction processing. The cost of COMMIT is relatively high, and excessive use of COMMIT can result in queries waiting for access to the commit queue. Because ETL is a commit-intensive process, having a separate queue with a small number of slots helps mitigate this issue.
  • Claim extra memory available in a queue. When executing an ETL query, you can take advantage of the wlm_query_slot_count to claim the extra memory available in a particular queue. For example, a typical ETL process might involve COPYing raw data into a staging table so that downstream ETL jobs can run transformations that calculate daily, weekly, and monthly aggregates. To speed up the COPY process (so that the downstream tasks can start in parallel sooner), the wlm_query_slot_count can be increased for this step.
  • Create a separate queue for reporting queries. Configure query monitoring rules on this queue to further manage long-running and expensive queries.
  • Take advantage of the dynamic memory parameters. They swap the memory from your ETL to your reporting queue after the ETL job has completed.

3. Perform table maintenance regularly

Amazon Redshift is a columnar database, which enables fast transformations for aggregating data. Performing regular table maintenance ensures that transformation ETLs are predictable and performant. To get the best performance from your Amazon Redshift database, you must ensure that database tables regularly are VACUUMed and ANALYZEd. The Analyze & Vacuum schema utility helps you automate the table maintenance task and have VACUUM & ANALYZE executed in a regular fashion.

  • Use VACUUM to sort tables and remove deleted blocks

During a typical ETL refresh process, tables receive new incoming records using COPY, and unneeded data (cold data) is removed using DELETE. New rows are added to the unsorted region in a table. Deleted rows are simply marked for deletion.

DELETE does not automatically reclaim the space occupied by the deleted rows. Adding and removing large numbers of rows can therefore cause the unsorted region and the number of deleted blocks to grow. This can degrade the performance of queries executed against these tables.

After an ETL process completes, perform VACUUM to ensure that user queries execute in a consistent manner. The complete list of tables that need VACUUMing can be found using the Amazon Redshift Util’s table_info script.

Use the following approaches to ensure that VACCUM is completed in a timely manner:

  • Use wlm_query_slot_count to claim all the memory allocated in the ETL WLM queue during the VACUUM process.
  • DROP or TRUNCATE intermediate or staging tables, thereby eliminating the need to VACUUM them.
  • If your table has a compound sort key with only one sort column, try to load your data in sort key order. This helps reduce or eliminate the need to VACUUM the table.
  • Consider using time series This helps reduce the amount of data you need to VACUUM.
  • Use ANALYZE to update database statistics

Amazon Redshift uses a cost-based query planner and optimizer using statistics about tables to make good decisions about the query plan for the SQL statements. Regular statistics collection after the ETL completion ensures that user queries run fast, and that daily ETL processes are performant. The Amazon Redshift utility table_info script provides insights into the freshness of the statistics. Keeping the statistics off (pct_stats_off) less than 20% ensures effective query plans for the SQL queries.

4. Perform multiple steps in a single transaction

ETL transformation logic often spans multiple steps. Because commits in Amazon Redshift are expensive, if each ETL step performs a commit, multiple concurrent ETL processes can take a long time to execute.

To minimize the number of commits in a process, the steps in an ETL script should be surrounded by a BEGIN…END statement so that a single commit is performed only after all the transformation logic has been executed. For example, here is an example multi-step ETL script that performs one commit at the end:

Begin
CREATE temporary staging_table;
INSERT INTO staging_table SELECT .. FROM source (transformation logic);
DELETE FROM daily_table WHERE dataset_date =?;
INSERT INTO daily_table SELECT .. FROM staging_table (daily aggregate);
DELETE FROM weekly_table WHERE weekending_date=?;
INSERT INTO weekly_table SELECT .. FROM staging_table(weekly aggregate);
Commit

5. Loading data in bulk

Amazon Redshift is designed to store and query petabyte-scale datasets. Using Amazon S3 you can stage and accumulate data from multiple source systems before executing a bulk COPY operation. The following methods allow efficient and fast transfer of these bulk datasets into Amazon Redshift:

  • Use a manifest file to ingest large datasets that span multiple files. The manifest file is a JSON file that lists all the files to be loaded into Amazon Redshift. Using a manifest file ensures that Amazon Redshift has a consistent view of the data to be loaded from S3, while also ensuring that duplicate files do not result in the same data being loaded more than one time.
  • Use temporary staging tables to hold the data for transformation. These tables are automatically dropped after the ETL session is complete. Temporary tables can be created using the CREATE TEMPORARY TABLE syntax, or by issuing a SELECT … INTO #TEMP_TABLE query. Explicitly specifying the CREATE TEMPORARY TABLE statement allows you to control the DISTRIBUTION KEY, SORT KEY, and compression settings to further improve performance.
  • User ALTER table APPEND to swap data from the staging tables to the target table. Data in the source table is moved to matching columns in the target table. Column order doesn’t matter. After data is successfully appended to the target table, the source table is empty. ALTER TABLE APPEND is much faster than a similar CREATE TABLE AS or INSERT INTO operation because it doesn’t involve copying or moving data.

6. Use UNLOAD to extract large result sets

Fetching a large number of rows using SELECT is expensive and takes a long time. When a large amount of data is fetched from the Amazon Redshift cluster, the leader node has to hold the data temporarily until the fetches are complete. Further, data is streamed out sequentially, which results in longer elapsed time. As a result, the leader node can become hot, which not only affects the SELECT that is being executed, but also throttles resources for creating execution plans and managing the overall cluster resources. Here is an example of a large SELECT statement. Notice that the leader node is doing most of the work to stream out the rows:

Use UNLOAD to extract large results sets directly to S3. After it’s in S3, the data can be shared with multiple downstream systems. By default, UNLOAD writes data in parallel to multiple files according to the number of slices in the cluster. All the compute nodes participate to quickly offload the data into S3.

If you are extracting data for use with Amazon Redshift Spectrum, you should make use of the MAXFILESIZE parameter to and keep files are 150 MB. Similar to item 1 above, having many evenly sized files ensures that Redshift Spectrum can do the maximum amount of work in parallel.

7. Use Redshift Spectrum for ad hoc ETL processing

Events such as data backfill, promotional activity, and special calendar days can trigger additional data volumes that affect the data refresh times in your Amazon Redshift cluster. To help address these spikes in data volumes and throughput, I recommend staging data in S3. After data is organized in S3, Redshift Spectrum enables you to query it directly using standard SQL. In this way, you gain the benefits of additional capacity without having to resize your cluster.

For tips on getting started with and optimizing the use of Redshift Spectrum, see the previous post, 10 Best Practices for Amazon Redshift Spectrum.

8. Monitor daily ETL health using diagnostic queries

Monitoring the health of your ETL processes on a regular basis helps identify the early onset of performance issues before they have a significant impact on your cluster. The following monitoring scripts can be used to provide insights into the health of your ETL processes:

Script Use when… Solution
commit_stats.sql – Commit queue statistics from past days, showing largest queue length and queue time first DML statements such as INSERT/UPDATE/COPY/DELETE operations take several times longer to execute when multiple of these operations are in progress Set up separate WLM queues for the ETL process and limit the concurrency to < 5.
copy_performance.sql –  Copy command statistics for the past days Daily COPY operations take longer to execute • Follow the best practices for the COPY command.
• Analyze data growth with the incoming datasets and consider cluster resize to meet the expected SLA.
table_info.sql – Table skew and unsorted statistics along with storage and key information Transformation steps take longer to execute • Set up regular VACCUM jobs to address unsorted rows and claim the deleted blocks so that transformation SQL execute optimally.
• Consider a table redesign to avoid data skewness.
v_check_transaction_locks.sql – Monitor transaction locks INSERT/UPDATE/COPY/DELETE operations on particular tables do not respond back in timely manner, compared to when run after the ETL Multiple DML statements are operating on the same target table at the same moment from different transactions. Set up ETL job dependency so that they execute serially for the same target table.
v_get_schema_priv_by_user.sql – Get the schema that the user has access to Reporting users can view intermediate tables Set up separate database groups for reporting and ETL users, and grants access to objects using GRANT.
v_generate_tbl_ddl.sql – Get the table DDL You need to create an empty table with same structure as target table for data backfill Generate DDL using this script for data backfill.
v_space_used_per_tbl.sql – monitor space used by individual tables Amazon Redshift data warehouse space growth is trending upwards more than normal

Analyze the individual tables that are growing at higher rate than normal. Consider data archival using UNLOAD to S3 and Redshift Spectrum for later analysis.

Use unscanned_table_summary.sql to find unused table and archive or drop them.

top_queries.sql – Return the top 50 time consuming statements aggregated by its text ETL transformations are taking longer to execute Analyze the top transformation SQL and use EXPLAIN to find opportunities for tuning the query plan.

There are several other useful scripts available in the amazon-redshift-utils repository. The AWS Lambda Utility Runner runs a subset of these scripts on a scheduled basis, allowing you to automate much of monitoring of your ETL processes.

Example ETL process

The following ETL process reinforces some of the best practices discussed in this post. Consider the following four-step daily ETL workflow where data from an RDBMS source system is staged in S3 and then loaded into Amazon Redshift. Amazon Redshift is used to calculate daily, weekly, and monthly aggregations, which are then unloaded to S3, where they can be further processed and made available for end-user reporting using a number of different tools, including Redshift Spectrum and Amazon Athena.

Step 1:  Extract from the RDBMS source to a S3 bucket

In this ETL process, the data extract job fetches change data every 1 hour and it is staged into multiple hourly files. For example, the staged S3 folder looks like the following:

 [[email protected] ~]$ aws s3 ls s3://<<S3 Bucket>>/batch/2017/07/02/
2017-07-02 01:59:58   81900220 20170702T01.export.gz
2017-07-02 02:59:56   84926844 20170702T02.export.gz
2017-07-02 03:59:54   78990356 20170702T03.export.gz
…
2017-07-02 22:00:03   75966745 20170702T21.export.gz
2017-07-02 23:00:02   89199874 20170702T22.export.gz
2017-07-02 00:59:59   71161715 20170702T23.export.gz

Organizing the data into multiple, evenly sized files enables the COPY command to ingest this data using all available resources in the Amazon Redshift cluster. Further, the files are compressed (gzipped) to further reduce COPY times.

Step 2: Stage data to the Amazon Redshift table for cleansing

Ingesting the data can be accomplished using a JSON-based manifest file. Using the manifest file ensures that S3 eventual consistency issues can be eliminated and also provides an opportunity to dedupe any files if needed. A sample manifest20170702.json file looks like the following:

{
  "entries": [
    {"url":" s3://<<S3 Bucket>>/batch/2017/07/02/20170702T01.export.gz", "mandatory":true},
    {"url":" s3://<<S3 Bucket>>/batch/2017/07/02/20170702T02.export.gz", "mandatory":true},
    …
    {"url":" s3://<<S3 Bucket>>/batch/2017/07/02/20170702T23.export.gz", "mandatory":true}
  ]
}

The data can be ingested using the following command:

SET wlm_query_slot_count TO <<max available concurrency in the ETL queue>>;
COPY stage_tbl FROM 's3:// <<S3 Bucket>>/batch/manifest20170702.json' iam_role 'arn:aws:iam::0123456789012:role/MyRedshiftRole' manifest;

Because the downstream ETL processes depend on this COPY command to complete, the wlm_query_slot_count is used to claim all the memory available to the queue. This helps the COPY command complete as quickly as possible.

Step 3: Transform data to create daily, weekly, and monthly datasets and load into target tables

Data is staged in the “stage_tbl” from where it can be transformed into the daily, weekly, and monthly aggregates and loaded into target tables. The following job illustrates a typical weekly process:

Begin
INSERT into ETL_LOG (..) values (..);
DELETE from weekly_tbl where dataset_week = <<current week>>;
INSERT into weekly_tbl (..)
  SELECT date_trunc('week', dataset_day) AS week_begin_dataset_date, SUM(C1) AS C1, SUM(C2) AS C2
	FROM   stage_tbl
GROUP BY date_trunc('week', dataset_day);
INSERT into AUDIT_LOG values (..);
COMMIT;
End;

As shown above, multiple steps are combined into one transaction to perform a single commit, reducing contention on the commit queue.

Step 4: Unload the daily dataset to populate the S3 data lake bucket

The transformed results are now unloaded into another S3 bucket, where they can be further processed and made available for end-user reporting using a number of different tools, including Redshift Spectrum and Amazon Athena.

unload ('SELECT * FROM weekly_tbl WHERE dataset_week = <<current week>>’) TO 's3:// <<S3 Bucket>>/datalake/weekly/20170526/' iam_role 'arn:aws:iam::0123456789012:role/MyRedshiftRole';

Summary

Amazon Redshift lets you easily operate petabyte-scale data warehouses on the cloud. This post summarized the best practices for operating scalable ETL natively within Amazon Redshift. I demonstrated efficient ways to ingest and transform data, along with close monitoring. I also demonstrated the best practices being used in a typical sample ETL workload to transform the data into Amazon Redshift.

If you have questions or suggestions, please comment below.

 


About the Author

Thiyagarajan Arumugam is a Big Data Solutions Architect at Amazon Web Services and designs customer architectures to process data at scale. Prior to AWS, he built data warehouse solutions at Amazon.com. In his free time, he enjoys all outdoor sports and practices the Indian classical drum mridangam.

 

Pirate Bay Founder’s Domain Service “Mocks” NY Times Legal Threats

Post Syndicated from Ernesto original https://torrentfreak.com/pirate-bay-founders-domain-service-mocks-ny-times-legal-threats-180125/

Back in the day, The Pirate Bay was famous for its amusing responses to legal threats. Instead of complying with takedown notices, it sent witty responses to embarrass the senders.

Today the notorious torrent site gives copyright holders the silent treatment, but the good-old Pirate Bay spirit still lives on elsewhere.

Earlier today the anonymous domain registration service Njalla, which happens to be a venture of TPB co-founder Peter Sunde, posted a series of noteworthy responses it sent to The New York Times’ (NYT) legal department.

The newspaper warned the registration service about one of its customers, paywallnews.com, which offers the news service’s content without permission. Since this is a violation of The Times’ copyrights, according to the paper, Njalla should take action or face legal consequences.

NYT: Accordingly, we hereby demand that you immediately provide us with contact information — including email addresses — for both the actual owner of the paywallnew.com website, and for the hosting provider on which the paywallnew.com website is located.

If we have not heard from you within three (3) business days of receipt of this letter, we will have no choice but to pursue all available legal remedies.

Njalla is no stranger to threats of this kind but were somewhat offended by the harsh language, it seems. The company, therefore, decided to inform the NYT that there are more friendly ways to reach out.

Njalla: Thanks for that lovely e-mail. It’s always good to communicate with people that in their first e-mail use words as “we demand”, “pursue all available legal remedies” and so forth. I’d like to start out with some free (as in no cost) advice: please update your boiler threat letters to actually try what most people try first: being nice. It’s not expensive (actually the opposite) and actually it works much better than your method (source: a few tens of thousands years of human development that would not have been as efficient with threats as it would have been with cooperation).

In addition, Njalla also included a request of its own. They kindly asked (no demand) the newspaper’s legal department for proof that they are who they say they are. You can never be too cautious, after all.

Njalla: Now, back to the questions you sent us. We’re not sure who you are, so in order to move further we’d like to see a copy of your ID card, as well as a notarised power of attorney showing that you are actually representing the people you’re claiming to do.

This had the desired effect, for Njalla at least. The NYT replied with an apology for the tough language that was used, noting that they usually deal with companies that employ people who are used to reading legal documents.

The newspaper did, however, submit a notarized letter signed by the company’s Executive Vice President, General Counsel and Secretary, and once again asked for details on the Njalla customer.

NYT: Once again, as I mention above, the referenced website is stealing large amounts of New York Times content. If you click on this link: http://www.paywallnews.com/sites/nytimes

As this abuse — aside from being an egregious infringement of The Times’s copyright — breaches your own Terms of Service, I hope you will be able to see your way to helping me to put a stop to this practice by providing me with the name and contact information for the owner of paywallnews.com and for the ISP on which it is hosted.

This is when things started to get really interesting. Founded by someone with an extensive background in “sharing,” Njalla clearly has a different definition of stealing than the NYT’s legal department.

The reply, which is worth reading in full along with the rest of the communication, makes this quite clear.

Njalla: Stealing content seem quite harsh of this website though, didn’t know that they did that! Is there anyway you can get the stolen items back though? You should either go to the police and request them to help you get the stolen items back. Or maybe talk to your insurance company, they might help to compensate you for the loss. But a helpful idea; if they’ve stolen something and then put copies of that on a website that you can freely access, I would suggest just copying it, so that both of you have the same things. That’s a great thing with the digital world, everyone can have copies of things. I am surprised they stole something when they could just have copied it. I’m guessing it’s some older individuals that don’t know the possibilities of modern day technology to make copies.

It’s obvious that the domain registration service makes a clear distinction between copying and stealing.

Piracy vs. Theft

In addition, Njalla contests that the site is problematic at all, noting that this might be a “cultural difference.”

Njalla spotted something even more worrying though. The NYT claims that the site in question violates its terms of service. Specifically, they reference the section that prohibits sites from spreading content that is illegal according to local law.

Is the NYT perhaps spreading illegal content itself, Njalla questions?

Njalla: Deborah, I was quite shocked and appalled that you referred to this part of our ToS. It made me actually not visit the website in question even though you’ve linked it now a few times. You’re admitting to spreading illegal content at your newspaper, for profit, is that correct?

We’re quite big proponents of freedom of speech, let me assure you of that, but we also have limits. If you spread illegal content, and our customers stole that illegal content and are now handing out free copies of that, that’s a huge issue for us. Since it would be illegal for us to get those copies if they’re illegal, I’m asking you what type of content it is?

As an attachment to the reply, Njalla also sent back a “notarized” letter of their own, by simply copying the NYT letter and sticking their own logo on it, to show how easily these can be fabricated.

TorrentFreak reached out to Sunde who informed us that they never heard from The New York Times after the last reply. As a domain registrant, Njalla is not obliged to comply with takedown requests, he explains.

“If they need help from us on copyright issues, they’re totally missing what we’re doing, and that they should look somewhere else anyhow. But I think most domain services gets tons of these threat emails, and a lot of them think they’re responsible because they don’t have access to legal help and just shut customers down.

“That’s what a lot of our customers say at least, since they migrated from a shitty service which doesn’t know their own business,” Sunde adds.

The NYT is not completely without options though. If they take the case to court in Sweden and win an injunction against paywallnews.com, Njalla will comply. The same is true if a customer really violates the terms of service.

Meanwhile, paywallnews.com remains online.

Source: TF, for the latest info on copyright, file-sharing, torrent sites and more. We also have VPN discounts, offers and coupons