Tag Archives: santa

[$] Finding driver bugs with DR. CHECKER

Post Syndicated from jake original https://lwn.net/Articles/733056/rss

Drivers are a consistent source of kernel bugs, at least partly due to less
review, but also because drivers are typically harder for tools to
analyze. A team from the University of California, Santa Barbara has set
out to change that with a static-analysis tool called DR. CHECKER. In a paper
[PDF]
presented at the recent 26th USENIX
Security Symposium
, the team introduced the tool and the results of
running it on nine production Linux kernels. Those results were rather
encouraging:
it
correctly identified 158 critical zero-day
bugs with an overall precision of 78%
“.

Hard Drive Stats for Q2 2017

Post Syndicated from Andy Klein original https://www.backblaze.com/blog/hard-drive-failure-stats-q2-2017/

Backblaze Drive Stats Q2 2017

In this update, we’ll review the Q2 2017 and lifetime hard drive failure rates for all our current drive models. We also look at how our drive migration strategy is changing the drives we use and we’ll check in on our enterprise class drives to see how they are doing. Along the way we’ll share our observations and insights and as always we welcome your comments and critiques.

Since our last report for Q1 2017, we have added 635 additional hard drives to bring us to the 83,151 drives we’ll focus on. In Q1 we added over 10,000 new drives to the mix, so adding just 635 in Q2 seems “odd.” In fact, we added 4,921 new drives and retired 4,286 old drives as we migrated from lower density drives to higher density drives. We cover more about migrations later on, but first let’s look at the Q2 quarterly stats.

Hard Drive Stats for Q2 2017

We’ll begin our review by looking at the statistics for the period of April 1, 2017 through June 30, 2017 (Q2 2017). This table includes 17 different 3 ½” drive models that were operational during the indicated period, ranging in size from 3 to 8 TB.

Quarterly Hard Drive Failure Rates for Q2 2017

When looking at the quarterly numbers, remember to look for those drives with at least 50,000 drive hours for the quarter. That works out to about 550 drives running the entire quarter. That’s a good sample size. If the sample size is below that, the failure rates can be skewed based on a small change in the number of drive failures.

As noted previously, we use the quarterly numbers to look for trends. So this time we’ve included a trend indicator in the table. The “Q2Q Trend” column is short for quarter-to-quarter trend, i.e. last quarter to this quarter. We can add, change, or delete trend columns depending on community interest. Let us know what you think in the comments.

Good Migrations

In Q2 we continued with our data migration program. For us, a drive migration means we intentionally remove a good drive from service and replace it with another drive. Drives that are removed via migrations are not counted as failed. Once they are removed they stop accumulating drive hours and other stats in our system.

There are three primary drivers for our migration program.

  1. Increase Storage Density – For example, in Q3 we replaced 3 TB drives with 8 TB drives, more than doubling the amount of storage in a given Storage Pod for the same footprint. The cost of electricity was nominally more with the 8 TB drives, but the increase in density more than offset the additional cost. For those interested you can read more about the cost of cloud storage here.
  2. Backblaze Vaults – Our Vault architecture has proven to be more cost effective over the past two years than using stand-alone Storage Pods. A major goal of the migration program is to have the entire Backblaze cloud deployed on the highly efficient and resilient Backblaze Vault architecture.
  3. Balancing the Load – With our Phoenix data center online and accepting data, we have migrated some systems to the Phoenix DC. Don’t worry, we didn’t put your data on a truck and drive it to Phoenix. We simply built new systems there and transferred the data from our Northern California DC. In the process, we are gaining valuable insights as we move towards being able to replicate data between the two data centers.
During Q2 we migrated nearly 30 Petabytes of data.

During Q2 we migrated the data on 155 systems, giving nearly 30 petabytes of data a new, more durable, place to call home. There are still 644 individual Storage Pods (Storage Pod Classics, as we call them) left to migrate to the Backblaze Vault architecture.

Just in case you don’t know, a Backblaze Vault is a logical collection of 20 beefy Storage Pods (not Classics). Using our own Reed-Solomon erasure coding library, data is spread out across the 20 Pods into 17 data shards and 3 parity shards. The data and parity shards of each arriving data blob can be stored on different Storage Pods in a given Backblaze Vault.

Lifetime Hard Drive Failure Rates for Current Drives

The table below shows the failure rates for the hard drive models we had in service as of June 30, 2017. This is over the period beginning in April 2013 and ending June 30, 2017. If you are interested in the hard drive failure rates for all the hard drives we’ve used over the years, please refer to our 2016 hard drive review.

Cumulative Hard Drive Failure Rates

Enterprise vs Consumer Drives

We added 3,595 enterprise class 8 TB drives in Q2 bringing our total to 6,054 drives. You may be tempted to compare the failure rates of the 8 TB enterprise drive (model: ST8000NM005) to the consumer 8 TB drive (model: ST8000DM002), and conclude the enterprise drives fail at a higher rate. Let’s not jump to that conclusion yet, as the average operational age of the enterprise drives is only 2.11 months.

There are some insights we can gain from the current data. The enterprise drives have 363,282 drives hours and an annualized failure rate of 1.61%. If we look back at our data, we find that as of Q3 2016, the 8 TB consumer drives had 422,263 drive hours with an annualized failure rate of 1.60%. That means that when both drive models had a similar number of drive hours, they had nearly the same annualized failure rate. There are no conclusions to be made here, but the observation is worth considering as we gather data for our comparison.

Next quarter, we should have enough data to compare the 8 TB drives, but by then the 8TB drives could be “antiques.” In the next week or so, we’ll be installing 12 TB hard drives in a Backblaze Vault. Each 60-drive Storage Pod in the Vault would have 720 TB of storage available and a 20-pod Backblaze Vault would have 14.4 petabytes of raw storage.

Better Late Than Never

Sorry for being a bit late with the hard drive stats report this quarter. We were ready to go last week, then this happened. Some folks here thought that was more important than our Q2 Hard Drive Stats. Go figure.

Drive Stats at the Storage Developers Conference

We will be presenting at the Storage Developers Conference in Santa Clara on Monday September 11th at 8:30am. We’ll be reviewing our drive stats along with some interesting observations from the SMART stats we also collect. The conference is the leading event for technical discussions and education on the latest storage technologies and standards. Come join us.

The Data For This Review

If you are interested in the data from the two tables in this review, you can download an Excel spreadsheet containing the two tables. Note: the domain for this download will be f001.backblazeb2.com.

You also can download the entire data set we use for these reports from our Hard Drive Test Data page. You can download and use this data for free for your own purposes. All we ask are three things: 1) you cite Backblaze as the source if you use the data, 2) you accept that you are solely responsible for how you use the data, and 3) you do not sell this data to anyone. It is free.

Good luck, and let us know if you find anything interesting.

The post Hard Drive Stats for Q2 2017 appeared first on Backblaze Blog | Cloud Storage & Cloud Backup.

Affordable Raspberry Pi 3D Body Scanner

Post Syndicated from Alex Bate original https://www.raspberrypi.org/blog/affordable-raspberry-pi-3d-body-scanner/

With a £1000 grant from Santander, Poppy Mosbacher set out to build a full-body 3D body scanner with the intention of creating an affordable setup for makespaces and similar community groups.

First Scan from DIY Raspberry Pi Scanner

Head and Shoulders Scan with 29 Raspberry Pi Cameras

Uses for full-body 3D scanning

Poppy herself wanted to use the scanner in her work as a fashion designer. With the help of 3D scans of her models, she would be able to create custom cardboard dressmakers dummy to ensure her designs fit perfectly. This is a brilliant way of incorporating digital tech into another industry – and it’s not the only application for this sort of build. Growing numbers of businesses use 3D body scanning, for example the stores around the world where customers can 3D scan and print themselves as action-figure-sized replicas.

Print your own family right on the high street!
image c/o Tom’s Guide and Shapify

We’ve also seen the same technology used in video games for more immersive virtual reality. Moreover, there are various uses for it in healthcare and fitness, such as monitoring the effect of exercise regimes or physiotherapy on body shape or posture.

Within a makespace environment, a 3D body scanner opens the door to including new groups of people in community make projects: imagine 3D printing miniatures of a theatrical cast to allow more realistic blocking of stage productions and better set design, or annually sending grandparents a print of their grandchild so they can compare the child’s year-on-year growth in a hands-on way.

Raspberry Pi 3d Body Scan

The Germany-based clothing business Outfittery uses full body scanners to take the stress out of finding clothes that fits well.
image c/o Outfittery

As cheesy as it sounds, the only limit for the use of 3D scanning is your imagination…and maybe storage space for miniature prints.

Poppy’s Raspberry Pi 3D Body Scanner

For her build, Poppy acquired 27 Raspberry Pi Zeros and 27 Raspberry Pi Camera Modules. With various other components, some 3D-printed or made of cardboard, Poppy got to work. She was helped by members of Build Brighton and by her friend Arthur Guy, who also wrote the code for the scanner.

Raspberry Pi 3D Body Scanner

The Pi Zeros run Raspbian Lite, and are connected to a main server running a node application. Each is fitted into its own laser-cut cardboard case, and secured to a structure of cardboard tubing and 3D-printed connectors.

Raspberry Pi 3D Body Scanner

In the finished build, the person to be scanned stands within the centre of the structure, and the press of a button sends the signal for all Pis to take a photo. The images are sent back to the server, and processed through Autocade ReMake, a freemium software available for the PC (Poppy discovered part-way through the project that the Mac version has recently lost support).

Build your own

Obviously there’s a lot more to the process of building this full-body 3D scanner than what I’ve reported in these few paragraphs. And since it was Poppy’s goal to make a readily available and affordable scanner that anyone can recreate, she’s provided all the instructions and code for it on her Instructables page.

Projects like this, in which people use the Raspberry Pi to create affordable and interesting tech for communities, are exactly the type of thing we love to see. Always make sure to share your Pi-based projects with us on social media, so we can boost their visibility!

If you’re a member of a makespace, run a workshop in a school or club, or simply love to tinker and create, this build could be the perfect addition to your workshop. And if you recreate Poppy’s scanner, or build something similar, we’d love to see the results in the comments below.

The post Affordable Raspberry Pi 3D Body Scanner appeared first on Raspberry Pi.

Lawyer Says He Was Deceived Into BitTorrent Copyright Trolling Scheme

Post Syndicated from Andy original https://torrentfreak.com/lawyer-says-he-was-deceived-into-bittorrent-copyright-trolling-scheme-170807/

For more than a decade, companies around the world have been trying to turn piracy into profit. For many this has meant the development of “copyright trolling” schemes, in which alleged pirates are monitored online and then pressured into cash settlements.

The shadowy nature of this global business means that its true scale will never be known but due to the controversial activities of some of the larger players, it’s occasionally possible to take a peek inside their operations. One such opportunity has just raised its head.

According to a lawsuit filed in California, James Davis is an attorney licensed in Oregon and California. Until two years ago, he was largely focused on immigration law. However, during March 2015, Davis says he was approached by an old classmate with an opportunity to get involved in a new line of business.

That classmate was Oregon lawyer Carl Crowell, who over the past several years has been deeply involved in copyright-trolling cases, including a deluge of Dallas Buyers Club and London Has Fallen litigation. He envisioned a place for Davis in the business.

Davis seemed to find the proposals attractive and became seriously involved in the operation, filing 58 cases on behalf of the companies involved. In common with similar cases, the lawsuits were brought in the name of the entities behind each copyrighted work, such as Dallas Buyers Club, LLC and LHF Productions, Inc.

In time, however, things started to go wrong. Davis claims that he discovered that Crowell, in connection with and on behalf of the other named defendants, “misrepresented the true nature of the Copyright Litigation Campaign, including the ownership of the works at issue and the role of the various third-parties involved in the litigation.”

Davis says that Crowell and the other defendants (which include the infamous Germany-based troll outfit Guardaley) made false representations to secure his participation, while holding back other information that might have made him think twice about becoming involved.

“Crowell and other Defendants withheld numerous material facts that were known to Crowell and the knowledge of which would have cast doubt on the value and ethical propriety of the Copyright Litigation Campaign for Mr. Davis,” the lawsuit reads.

Davis goes on to allege serious misconduct, including that representations regarding ownership of various entities were false and used to deceive him into participating in the scheme.

As time went on, Davis said he had increasing doubts about the operation. Then, in August 2016 as a result of a case underway in California, he began asking questions which resulted in him uncovering additional facts. These undermined both the representations of the people he was working for and his own belief in the “value and ethical propriety of the Copyright Litigation Campaign,” the lawsuit claims.

Davis said this spurred him on to “aggressively seek further information” from Crowell and other people involved in the scheme, including details of its structure and underlying support. He says all he received were “limited responses, excuses, and delays.”

The case was later dismissed by mutual agreement of the parties involved but of course, Davis’ concerns about the underlying case didn’t come to the forefront until the filing of his suit against Crowell and the others.

Davis says that following a meeting in Santa Monica with several of the main players behind the litigation campaign, he decided its legal and factual basis were unsound. He later told Crowell and Guardaley that he was withdrawing from their project.

As the result of the misrepresentations made to him, Davis is now suing the defendants on a number of counts, detailed below.

“Defendants’ business practices are unfair, unlawful, and fraudulent. Davis has suffered monetary damage as a direct result of the unfair, unlawful, and fraudulent business practices set forth herein,” the lawsuit reads.

Requesting a trial by jury, Davis is seeking actual damages, statutory damages, punitive or treble damages “in the amount of no less than $300,000.”

While a payment of that not insignificant amount would clearly satisfy Davis, the prospect of a trial in which the Guardaley operation is laid bare would be preferable when the interests of its thousands of previous targets are considered.

Only time will tell how things will pan out but like the vast majority of troll cases, this one too seems destined to be settled in private, to ensure the settlement machine keeps going.

Note: The case was originally filed in June, only to be voluntarily dismissed. It has now been refiled in state court.

Source: TF, for the latest info on copyright, file-sharing, torrent sites and ANONYMOUS VPN services.

Event: AWS Serverless Roadshow – Hands-on Workshops

Post Syndicated from Tara Walker original https://aws.amazon.com/blogs/aws/event-aws-serverless-roadshow-hands-on-workshops/

Surely, some of you have contemplated how you would survive the possible Zombie apocalypse or how you would build your exciting new startup to disrupt the transportation industry when Unicorn haven is uncovered. Well, there is no need to worry; I know just the thing to get you prepared to handle both of those scenarios: the AWS Serverless Computing Workshop Roadshow.

With the roadshow’s serverless workshops, you can get hands-on experience building serverless applications and microservices so you can rebuild what remains of our great civilization after a widespread viral infection causes human corpses to reanimate around the world in the AWS Zombie Microservices Workshop. In addition, you can give your startup a jump on the competition with the Wild Rydes workshop in order to revolutionize the transportation industry; just in time for a pilot’s crash landing leading the way to the discovery of abundant Unicorn pastures found on the outskirts of the female Amazonian warrior inhabited island of Themyscira also known as Paradise Island.

These free, guided hands-on workshops will introduce the basics of building serverless applications and microservices for common and uncommon scenarios using services like AWS Lambda, Amazon API Gateway, Amazon DynamoDB, Amazon S3, Amazon Kinesis, AWS Step Functions, and more. Let me share some advice before you decide to tackle Zombies and mount Unicorns – don’t forget to bring your laptop to the workshop and make sure you have an AWS account established and available for use for the event.

Check out the schedule below and get prepared today by registering for an upcoming workshop in a city near you. Remember these are workshops are completely free, so participation is on a first come, first served basis. So register and get there early, we need Zombie hunters and Unicorn riders across the globe.  Learn more about AWS Serverless Computing Workshops here and register for your city using links below.

Event Location Date
Wild Rydes New York Thursday, June 8
Wild Rydes Austin Thursday, June 22
Wild Rydes Santa Monica Thursday, July 20
Zombie Apocalypse Chicago Thursday, July 20
Wild Rydes Atlanta Tuesday, September 12
Zombie Apocalypse Dallas Tuesday, September 19

 

I look forward to fighting zombies and riding unicorns with you all.

Tara

Hard Drive Stats for Q1 2017

Post Syndicated from Andy Klein original https://www.backblaze.com/blog/hard-drive-failure-rates-q1-2017/

2017 hard drive stats

In this update, we’ll review the Q1 2017 and lifetime hard drive failure rates for all our current drive models, and we’ll look at a relatively new class of drives for us – “enterprise”. We’ll share our observations and insights, and as always, you can download the hard drive statistics data we use to create these reports.

Our Hard Drive Data Set

Backblaze has now recorded and saved daily hard drive statistics from the drives in our data centers for over 4 years. This data includes the SMART attributes reported by each drive, along with related information such a the drive serial number and failure status. As of March 31, 2017 we had 84,469 operational hard drives. Of that there were 1,800 boot drives and 82,669 data drives. For our review, we remove drive models of which we have less than 45 drives, leaving us to analyze 82,516 hard drives for this report. There are currently 17 different hard drives models, ranging in size from 3 to 8 TB in size. All of these models are 3½” drives.

Hard Drive Reliability Statistics for Q1 2017

Since our last report in Q4 2016, we have added 10,577 additional hard drives to bring us to the 82,516 drives we’ll focus on. We’ll start by looking at the statistics for the period of January 1, 2017 through March 31, 2017 – Q1 2017. This is for the drives that were operational during that period, ranging in size from 3 to 8 TB as listed below.

hard drive failure rates by model

Observations and Notes on the Q1 Review

You’ll notice that some of the drive models have a failure rate of “0” (zero). Here a failure rate of zero means there were no drive failures for that model during Q1 2017. Later, we will cover how these same drive models faired over their lifetime. Why is the quarterly data important? We use it to look for anything unusual. For example, in Q1 the 4 TB Seagate drive model: ST4000DX000, has a high failure rate of 35.88%, while the lifetime annualized failure rate for this model is much lower, 7.50%. In this case, we only have a 170 drives of this particular drive model, so the failure rate is not statistically significant, but such information could be useful if we were using several thousand drives of this particular model.

There were a total 375 drive failures in Q1. A drive is considered failed if one or more of the following conditions are met:

  • The drive will not spin up or connect to the OS.
  • The drive will not sync, or stay synced, in a RAID Array (see note below).
  • The Smart Stats we use show values above our thresholds.
  • Note: Our stand-alone Storage Pods use RAID-6, our Backblaze Vaults use our own open-sourced implementation of Reed-Solomon erasure coding instead. Both techniques have a concept of a drive not syncing or staying synced with the other member drives in its group.

The annualized hard drive failure rate for Q1 in our current population of drives is 2.11%. That’s a bit higher than previous quarters, but might be a function of us adding 10,577 new drives to our count in Q1. We’ve found that there is a slightly higher rate of drive failures early on, before the drives “get comfortable” in their new surroundings. This is seen in the drive failure rate “bathtub curve” we covered in a previous post.

10,577 More Drives

The additional 10,577 drives are really a combination of 11,002 added drives, less 425 drives that were removed. The removed drives were in addition to the 375 drives marked as failed, as those were replaced 1 for 1. The 425 drives were primarily removed from service due to migrations to higher density drives.

The table below shows the breakdown of the drives added in Q1 2017 by drive size.

drive counts by size

Lifetime Hard Drive Failure Rates for Current Drives

The table below shows the failure rates for the hard drive models we had in service as of March 31, 2017. This is over the period beginning in April 2013 and ending March 31, 2017. If you are interested in the hard drive failure rates for all the hard drives we’ve used over the years, please refer to our 2016 hard drive review.

lifetime hard drive reliability rates

The annualized failure rate for the drive models listed above is 2.07%. This compares to 2.05% for the same collection of drive models as of the end of Q4 2016. The increase makes sense given the increase in Q1 2017 failure rate over previous quarters noted earlier. No new models were added during the current quarter and no old models exited the collection.

Backblaze is Using Enterprise Drives – Oh My!

Some of you may have noticed we now have a significant number of enterprise drives in our data center, namely 2,459 Seagate 8 TB drives, model: ST8000NM055. The HGST 8 TB drives were the first true enterprise drives we used as data drives in our data centers, but we only have 45 of them. So, why did we suddenly decide to purchase 2,400+ of the Seagate 8 TB enterprise drives? There was a very short period of time, as Seagate was introducing new and phasing out old drive models, that the cost per terabyte of the 8 TB enterprise drives fell within our budget. Previously we had purchased 60 of these drives to test in one Storage Pod and were satisfied they could work in our environment. When the opportunity arose to acquire the enterprise drives at a price we liked, we couldn’t resist.

Here’s a comparison of the 8 TB consumer drives versus the 8 TB enterprise drives to date:

enterprise vs. consumer hard drives

What have we learned so far…

  1. It is too early to compare failure rates – The oldest enterprise drives have only been in service for about 2 months, with most being placed into service just prior to the end of Q1. The Backblaze Vaults the enterprise drives reside in have yet to fill up with data. We’ll need at least 6 months before we could start comparing failure rates as the data is still too volatile. For example, if the current enterprise drives were to experience just 2 failures in Q2, their annualized failure rate would be about 0.57% lifetime.
  2. The enterprise drives load data faster – The Backblaze Vaults containing the enterprise drives, loaded data faster than the Backblaze Vaults containing consumer drives. The vaults with the enterprise drives loaded on average 140 TB per day, while the vaults with the consumer drives loaded on average 100 TB per day.
  3. The enterprise drives use more power – No surprise here as according to the Seagate specifications the enterprise drives use 9W average in idle and 10W average in operation. While the consumer drives use 7.2W average in idle and 9W average in operation. For a single drive this may seem insignificant, but when you put 60 drives in a 4U Storage Pod chassis and then 10 chassis in a rack, the difference adds up quickly.
  4. Enterprise drives have some nice features – The Seagate enterprise 8TB drives we used have PowerChoice™ technology that gives us the option to use less power. The data loading times noted above were recorded after we changed to a lower power mode. In short, the enterprise drive in a low power mode still stored 40% more data per day on average than the consumer drives.
  5. While it is great that the enterprise drives can load data faster, drive speed has never been a bottleneck in our system. A system that can load data faster will just “get in line” more often and fill up faster. There is always extra capacity when it comes to accepting data from customers.

    Wrapping Up

    We’ll continue to monitor the 8 TB enterprise drives and keep reporting our findings.

    If you’d like to hear more about our Hard Drive Stats, Backblaze will be presenting at the 33rd International Conference on Massive Storage Systems and Technology (MSST 2017) being held at Santa Clara University in Santa Clara California from May 15th – 19th. The conference will dedicate five days to computer-storage technology, including a day of tutorials, two days of invited papers, two days of peer-reviewed research papers, and a vendor exposition. Come join us.

    As a reminder, the hard drive data we use is available on our Hard Drive Test Data page. You can download and use this data for free for your own purpose, all we ask is three things 1) you cite Backblaze as the source if you use the data, 2) you accept that you are solely responsible for how you use the data, and 3) you do not sell this data to anyone, it is free.

    Good luck and let us know if you find anything interesting.

The post Hard Drive Stats for Q1 2017 appeared first on Backblaze Blog | Cloud Storage & Cloud Backup.

AWS Hot Startups – February 2017

Post Syndicated from Ana Visneski original https://aws.amazon.com/blogs/aws/aws-hot-startups-february-2017-2/

As we finish up the month of February, Tina Barr is back with some awesome startups.

-Ana


This month we are bringing you five innovative hot startups:

  • GumGum – Creating and popularizing the field of in-image advertising.
  • Jiobit – Smart tags to help parents keep track of kids.
  • Parsec – Offers flexibility in hardware and location for PC gamers.
  • Peloton – Revolutionizing indoor cycling and fitness classes at home.
  • Tendril – Reducing energy consumption for homeowners.

If you missed any of our January startups, make sure to check them out here.

GumGum (Santa Monica, CA)
GumGum logo1GumGum is best known for inventing and popularizing the field of in-image advertising. Founded in 2008 by Ophir Tanz, the company is on a mission to unlock the value held within the vast content produced daily via social media, editorials, and broadcasts in a variety of industries. GumGum powers campaigns across more than 2,000 premium publishers, which are seen by over 400 million users.

In-image advertising was pioneered by GumGum and has given companies a platform to deliver highly visible ads to a place where the consumer’s attention is already focused. Using image recognition technology, GumGum delivers targeted placements as contextual overlays on related pictures, as banners that fit on all screen sizes, or as In-Feed placements that blend seamlessly into the surrounding content. Using Visual Intelligence, GumGum can scour social media and broadcast TV for all images and videos related to a brand, allowing companies to gain a stronger understanding of their audience and how they are relating to that brand on social media.

GumGum relies on AWS for its Image Processing and Ad Serving operations. Using AWS infrastructure, GumGum currently processes 13 million requests per minute across the globe and generates 30 TB of new data every day. The company uses a suite of services including but not limited to Amazon EC2, Amazon S3, Amazon Kinesis, Amazon EMR, AWS Data Pipeline, and Amazon SNS. AWS edge locations allow GumGum to serve its customers in the US, Europe, Australia, and Japan and the company has plans to expand its infrastructure to Australia and APAC regions in the future.

For a look inside GumGum’s startup culture, check out their first Hackathon!

Jiobit (Chicago, IL)
Jiobit Team1
Jiobit was inspired by a real event that took place in a crowded Chicago park. A couple of summers ago, John Renaldi experienced every parent’s worst nightmare – he lost track of his then 6-year-old son in a public park for almost 30 minutes. John knew he wasn’t the only parent with this problem. After months of research, he determined that over 50% of parents have had a similar experience and an even greater percentage are actively looking for a way to prevent it.

Jiobit is the world’s smallest and longest lasting smart tag that helps parents keep track of their kids in every location – indoors and outdoors. The small device is kid-proof: lightweight, durable, and waterproof. It acts as a virtual “safety harness” as it uses a combination of Bluetooth, Wi-Fi, Multiple Cellular Networks, GPS, and sensors to provide accurate locations in real-time. Jiobit can automatically learn routes and locations, and will send parents an alert if their child does not arrive at their destination on time. The talented team of experienced engineers, designers, marketers, and parents has over 150 patents and has shipped dozens of hardware and software products worldwide.

The Jiobit team is utilizing a number of AWS services in the development of their product. Security is critical to the overall product experience, and they are over-engineering security on both the hardware and software side with the help of AWS. Jiobit is also working towards being the first child monitoring device that will have implemented an Alexa Skill via the Amazon Echo device (see here for a demo!). The devices use AWS IoT to send and receive data from the Jio Cloud over the MQTT protocol. Once data is received, they use AWS Lambda to parse the received data and take appropriate actions, including storing relevant data using Amazon DynamoDB, and sending location data to Amazon Machine Learning processing jobs.

Visit the Jiobit blog for more information.

Parsec (New York, NY)
Parsec logo large1
Parsec operates under the notion that everyone should have access to the best computing in the world because access to technology creates endless opportunities. Founded in 2016 by Benjy Boxer and Chris Dickson, Parsec aims to eliminate the burden of hardware upgrades that users frequently experience by building the technology to make a computer in the cloud available anywhere, at any time. Today, they are using their technology to enable greater flexibility in the hardware and location that PC gamers choose to play their favorite games on. Check out this interview with Benjy and our Startups team for a look at how Parsec works.

Parsec built their first product to improve the gaming experience; gamers no longer have to purchase consoles or expensive PCs to access the entertainment they love. Their low latency video streaming and networking technologies allow gamers to remotely access their gaming rig and play on any Windows, Mac, Android, or Raspberry Pi device. With the global reach of AWS, Parsec is able to deliver cloud gaming to the median user in the US and Europe with less than 30 milliseconds of network latency.

Parsec users currently have two options available to start gaming with cloud resources. They can either set up their own machines with the Parsec AMI in their region or rely on Parsec to manage everything for a seamless experience. In either case, Parsec uses the g2.2xlarge EC2 instance type. Parsec is using Amazon Elastic Block Storage to store games, Amazon DynamoDB for scalability, and Amazon EC2 for its web servers and various APIs. They also deal with a high volume of logs and take advantage of the Amazon Elasticsearch Service to analyze the data.

Be sure to check out Parsec’s blog to keep up with the latest news.

Peloton (New York, NY)
Peloton image 3
The idea for Peloton was born in 2012 when John Foley, Founder and CEO, and his wife Jill started realizing the challenge of balancing work, raising young children, and keeping up with personal fitness. This is a common challenge people face – they want to work out, but there are a lot of obstacles that stand in their way. Peloton offers a solution that enables people to join indoor cycling and fitness classes anywhere, anytime.

Peloton has created a cutting-edge indoor bike that streams up to 14 hours of live classes daily and has over 4,000 on-demand classes. Users can access live classes from world-class instructors from the convenience of their home or gym. The bike tracks progress with in-depth ride metrics and allows people to compete in real-time with other users who have taken a specific ride. The live classes even feature top DJs that play current playlists to keep users motivated.

With an aggressive marketing campaign, which has included high-visibility TV advertising, Peloton made the decision to run its entire platform in the cloud. Most recently, they ran an ad during an NFL playoff game and their rate of requests per minute to their site increased from ~2k/min to ~32.2k/min within 60 seconds. As they continue to grow and diversify, they are utilizing services such as Amazon S3 for thousands of hours of archived on-demand video content, Amazon Redshift for data warehousing, and Application Load Balancer for intelligent request routing.

Learn more about Peloton’s engineering team here.

Tendril (Denver, CO)
Tendril logo1
Tendril was founded in 2004 with the goal of helping homeowners better manage and reduce their energy consumption. Today, electric and gas utilities use Tendril’s data analytics platform on more than 140 million homes to deliver a personalized energy experience for consumers around the world. Using the latest technology in decision science and analytics, Tendril can gain access to real-time, ever evolving data about energy consumers and their homes so they can improve customer acquisition, increase engagement, and orchestrate home energy experiences. In turn, Tendril helps its customers unlock the true value of energy interactions.

AWS helps Tendril run its services globally, while scaling capacity up and down as needed, and in real-time. This has been especially important in support of Tendril’s newest solution, Orchestrated Energy, a continuous demand management platform that calculates a home’s thermal mass, predicts consumer behavior, and integrates with smart thermostats and other connected home devices. This solution allows millions of consumers to create a personalized energy plan for their home based on their individual needs.

Tendril builds and maintains most of its infrastructure services with open sources tools running on Amazon EC2 instances, while also making use of AWS services such as Elastic Load Balancing, Amazon API Gateway, Amazon CloudFront, Amazon Route 53, Amazon Simple Queue Service, and Amazon RDS for PostgreSQL.

Visit the Tendril Blog for more information!

— Tina Barr

U.S. Homeland Security ‘Harbors’ BitTorrent Pirates

Post Syndicated from Ernesto original https://torrentfreak.com/u-s-homeland-security-harbors-bittorrent-pirates-170108/

dhsDue to the public nature of BitTorrent transfers, it’s easy to see what people behind a certain IP-address are downloading.

Last month we reported about a new website that puts this information on public display. According to its operators, this information can help rightholders and law enforcement to track down pirates.

In response, we decided to do some field work to see if downloads are also linked to more unusual locations, and the answer is YES.

To the Department of Homeland Security, for example, which helped to bring down KickassTorrents a few months ago. While it’s not a place where you would expect people to be torrenting, the spy tool suggests otherwise.

We could easily spot several IP-addresses that list over a dozen recent ‘downloads’ of copyrighted material. This includes popular films, TV-series and music, but also porn and far more worrying content.

The screenshot below lists an overview of the recent torrents that are tied to a single Homeland Security IP-address. As you can see, it lists several files including the film ‘Gone Girl’ and ‘Bad Santa.’ But we’ve also seen a copy of the film Let’s Be Cops and a discography of the heavy metal band Dio.

A few DHS downloads

dhsmain

It’s worth mentioning that BitTorrent monitoring tools are regularly discredited for being prone to errors. They often don’t check whether a full copy has been downloaded, for example.

Mistakes also appeared in the ‘I Know What You Download‘ database, which previously listed downloads for several non-routable IP-addresses they picked up via DHT tracking.

However, the company’s Marketing director Andrey Rogov is confident that the DHS IPs are indeed sharing (parts of) these files.

“These reports are accurate,” Rogov tells us. “They contain information about the downloading or distribution activities of IP-addresses, for all torrents which we could classify for the last 30 days.”

The company also provided us with extra information showing combinations of specific ports and IP-addresses, which refutes the defense that a tracker added these IP-addresses as fake data.

Since Homeland Security employs more than 230,000 people, finding a pirating IP-address is hardly a surprise. In fact, there are many more in the ‘I Know What You Download’ database. This is also true for other United States Government branches.

Take The House of Representatives, for example, where adult material, Snoop Dogg, and several movies are listed as recent downloads. Again, that’s just the tip of the iceberg.

A few House downloads

housedl

In the end, the most sensible conclusion is that you’re going to find pirates in any large organization or institution. Even in the very place that just dismantled the largest torrent site on the Internet.

Source: TF, for the latest info on copyright, file-sharing, torrent sites and ANONYMOUS VPN services.

Top 10 Most Pirated Movies of The Week – 12/26/16

Post Syndicated from Ernesto original https://torrentfreak.com/top-10-pirated-movies-week-122616/

deepwaterhoThis week we have four newcomers in our chart.

Deepwater Horizon is the most downloaded movie.

The data for our weekly download chart is estimated by TorrentFreak, and is for informational and educational reference only. All the movies in the list are Web-DL/Webrip/HDRip/BDrip/DVDrip unless stated otherwise.

RSS feed for the weekly movie download chart.

This week’s most downloaded movies are:
Movie Rank Rank last week Movie name IMDb Rating / Trailer
Most downloaded movies via torrents
1 (…) Deepwater Horizon 7.4 / trailer
2 (3) Rogue One: A Star Wars Story (HDTS) 8.3 / trailer
3 (1) The Magnificent Seven 7.1 / trailer
4 (2) The Accountant (subbed HDrip) 7.6 / trailer
5 (…) Bad Santa 2 5.6 / trailer
6 (…) Max Steel 4.6 / trailer
7 (6) Doctor Strange (HDTS) 8.0 / trailer
8 (4) Inferno (subbed HDrip) 6.4 / trailer
9 (…) Trolls 6.6 / trailer
10 (7) Moana (HDTS) 8.1 / trailer

Source: TF, for the latest info on copyright, file-sharing, torrent sites and ANONYMOUS VPN services.

Merry Christmas!.. Torrent Downloads and Takedowns

Post Syndicated from Ernesto original https://torrentfreak.com/merry-christmas-torrent-downloads-and-takedowns-161225/

christmasAround this time of the year, Christmas music is hard to escape, including on torrent sites.

While torrents may not be as popular as a few years ago, there are still plenty of people that use torrent sites to get the latest entertainment.

Christmas is a time for sharing, a mantra that suits BitTorrent pirates very well, and that shows. Over the past few days several Christmas compilations have populated Pirate Bay’s list of 100 most-shared music torrents.

At the time of writing ‘The Ultimate Christmas Music Compilation,’ a homemade collection of 300 songs, is the most shared Christmas torrent. The torrent was first uploaded five years ago and is still very much alive.

As can be seen below, more than six hundred people are sharing it at the time of writing. For comparison, late October the same torrent only had a measly thirteen active sharers, barely enough to survive.

It’s that time of the year…

christmastpb23dec

And there’s more Christmas music that moved up the ranks in recent days.

The Pirate Bay’s music top 100 also includes the torrents ‘100 Hits Christmas Legends (2010),’ ‘Boney M – The 20 Greatest Christmas Songs,’ NOW That’s What I Call Christmas [2014],’ and ‘Pentatonix – A Pentatonix Christmas (2016)’ to name a few.

But pirates are not the only ones who are watching torrent sites for Christmas themed music. Music industry groups do too.

Christmas takedowns

piratebaytakedown1

They are less appreciative of the sharing culture and have set out numerous ‘Christmas’ takedown requests.

This week alone, Google received thousands of takedown requests mentioning the keyword Christmas. This includes takedown requests to remove various Pirate Bay results, of course, but also many other sites.

And not just for music either. Takedown requests are also going out for classic Christmas films such as Scrooged (1988) and even the virtually unknown Santa Claws (2014).

Christmas or not, it’s clear that the piracy whack-a-mole doesn’t stop. Pirates continue to share, and rightsholder groups counter this by sharing their takedown request in return.

Merry Christmas everyone!

Source: TF, for the latest info on copyright, file-sharing, torrent sites and ANONYMOUS VPN services.

Crazy Pirates Troll TorrentFreak With Bad Santa 2 Watermark

Post Syndicated from Andy original https://torrentfreak.com/crazy-pirates-troll-torrentfreak-with-bad-santa-2-watermark-161225/

xmas-trollHo! Ho! Ho! Many happy returns and Merry Christmas to all our readers. It’s Christmas Day once again and it’s been a pretty eventful year in file-sharing and copyright.

While we wish things were different, there hasn’t been much positive news to report in 2016. There’s been the occasional ray of light here and there, but overall it’s been a cascade of negativity. Today, however, we promise not to spoil anyone’s Christmas lunch or well-deserved day off.

In fact, this morning we can confidently report that for at least the next 48 hours, no one will be fined, detained, arrested, extradited, or otherwise screwed around with by rightsholder groups and their affiliates. Instead, we have a rather crazy mystery to solve, one that we really hope you can help us solve.

On November 23, the movie Bad Santa 2 was released in the United States to a somewhat lukewarm reception. Despite the average reviews, it’s a Christmas movie so pirates were still looking for something seasonal to watch.

Three weeks ago a copy surfaced in Russia with local dubbing but this week pirates obliged with an English language edition of the Billy Bob Thornton movie. However, something embedded in one of the sundry copies left us both surprised and scratching our heads here at TF.

Within seconds of the movie starting and for the next couple of minutes, a giant watermark appears on screen. Filling the entire width of the print from border to border, the watermark then slowly makes its way up the screen until it disappears off the top.

santa-tf2

Of course, watermarks are usually put in place to indicate some kind of ownership. Studios use visible and invisible watermarks on screener copies of movies to literally stamp their name on pre-release versions of movies. However, we have absolutely no idea why someone would put our site name on a cam copy of a movie.

TorrentFreak spoke with releasers and even a couple of site operators to find out who might be behind this little surprise but we’ve had no success getting to the bottom of the mystery. It’s certainly possible that the “Streetcams” reference at the start of the watermark could hold the secret, but we’ve had no success in identifying who or what could be behind that particular brand either.

The watermark eventually scrolls away but at the end of the movie it reappears, beginning its journey from the bottom of the screen to the top in all its glory.

santa-tf3

From there, who knows where it goes but we are aware that the “streetcams” watermark has appeared elsewhere, although not with additional TorrentFreak branding. It’s more difficult to see when compared to Bad Santa 2, but here it is on a cam copy of the movie Shut In.

shut-in

So with logs on the fire and gifts on the tree, can you help us solve this cam mystery?

Merry Christmas and other celebrations to all our readers

Source: TF, for the latest info on copyright, file-sharing, torrent sites and ANONYMOUS VPN services.

Amazon Redshift Engineering’s Advanced Table Design Playbook: Distribution Styles and Distribution Keys

Post Syndicated from AWS Big Data Blog original https://aws.amazon.com/blogs/big-data/amazon-redshift-engineerings-advanced-table-design-playbook-distribution-styles-and-distribution-keys/

Zach Christopherson is a Senior Database Engineer on the Amazon Redshift team.


Part 1: Preamble, Prerequisites, and Prioritization
Part 2: Distribution Styles and Distribution Keys
Part 3: Compound and Interleaved Sort Keys (December 6, 2016)
Part 4: Compression Encodings (December 7, 2016)
Part 5: Table Data Durability (December 8, 2016)


The first table and column properties we discuss in this blog series are table distribution styles (DISTSTYLE) and distribution keys (DISTKEY). This blog installment presents a methodology to guide you through the identification of optimal DISTSTYLEs and DISTKEYs for your unique workload.

When you load data into a table, Amazon Redshift distributes the rows to each of the compute nodes according to the table’s DISTSTYLE. Within each compute node, the rows are assigned to a cluster slice. Depending on node type, each compute node contains 2, 16, or 32 slices. You can think of a slice like a virtual compute node. During query execution, all slices process the rows that they’ve had assigned in parallel. The primary goal in selecting a table’s DISTSTYLE is to evenly distribute the data throughout the cluster for parallel processing.

When you execute a query, the query optimizer might redistribute or broadcast the intermediate tuples throughout the cluster to facilitate any join or aggregation operations. The secondary goal in selecting a table’s DISTSTYLE is to minimize the cost of data movement necessary for query processing. To achieve minimization, data should be located where it needs to be before the query is executed.

A table might be defined with a DISTSTYLE of EVEN, KEY, or ALL. If you’re unfamiliar with these table properties, you can watch my presentation at the 2016 AWS Santa Clara Summit, where I discussed the basics of distribution starting at the 17-minute mark. I summarize these here:

  • EVEN will do a round-robin distribution of data.
  • KEY requires a single column to be defined as a DISTKEY. On ingest, Amazon Redshift hashes each DISTKEY column value, and route hashes to the same slice consistently.
  • ALL distribution stores a full copy of the table on the first slice of each node.

Which style is most appropriate for your table is determined by several criteria. This post presents a two-phase flow chart that will guide you through questions to ask of your data profile to arrive at the ideal DISTSTYLE and DISTKEY for your scenario.

Phase 1: Identifying Appropriate DISTKEY Columns

Phase 1 seeks to determine if KEY distribution is appropriate. To do so, first determine if the table contains any columns that would appropriately distribute the table data if they were specified as a DISTKEY. If we find that no columns are acceptable DISTKEY columns, then we can eliminate DISTSTYLE KEY as a potential DISTSTYLE option for this table.

o_redshift_table_design_1

 

Does the column data have a uniformly distributed data profile?

 

If the hashed column values don’t enable uniform distribution of data to the cluster slices, then you’ll end with both data skew at rest and data skew in flight (during query processing)—which results in a performance hit due to an unevenly parallelized workload. A nonuniformly distributed data profile occurs in scenarios such as these:

  • Distributing on a column containing a significant percentage of NULL values
  • Distributing on a column, customer_id, where a minority of your customers are responsible for the majority of your data

You can easily identify columns that contain “heavy hitters” or introduce “hot spots” by using some simple SQL code to review the dataset. In the example following, l_orderkey stands out as a poor option that you can eliminate as a potential DISTKEY column:

[email protected]/dev=# SELECT l_orderkey, COUNT(*) 
FROM lineitem 
GROUP BY 1 
ORDER BY 2 DESC 
LIMIT 100;
 l_orderkey |   count
------------+----------
     [NULL] | 124993010
  260642439 |        80
  240404513 |        80
   56095490 |        72
  348088964 |        72
  466727011 |        72
  438870661 |        72 
...
...

When distributing on a given column, it is desirable to have a nearly consistent number of rows/blocks on each slice. Suppose that you think that you’ve identified a column that should result in uniform distribution but want to confirm this. Here, it’s much more efficient to materialize a single-column temporary table, rather than redistributing the entire table only to find out there was nonuniform distribution:

-- Materialize a single column to check distribution
CREATE TEMP TABLE lineitem_dk_l_partkey DISTKEY (l_partkey) AS 
SELECT l_partkey FROM lineitem;

-- Identify the table OID
[email protected]/tpch=# SELECT 'lineitem_dk_l_partkey'::regclass::oid;
  oid
--------
 240791
(1 row) 

Now that the table exists, it’s trivial to review the distribution. In the following query results, we can assess the following characteristics for a given table with a defined DISTKEY:

  • skew_rows: A ratio of the number of table rows from the slice with the most rows compared to the slice with the fewest table rows. This value defaults to 100.00 if the table doesn’t populate every slice in the cluster. Closer to 1.00 is ideal.
  • storage_skew: A ratio of the number of blocks consumed by the slice with the most blocks compared to the slice with the fewest blocks. Closer to 1.00 is ideal.
  • pct_populated: Percentage of slices in the cluster that have at least 1 table row. Closer to 100 is ideal.
SELECT "table" tablename, skew_rows,
  ROUND(CAST(max_blocks_per_slice AS FLOAT) /
  GREATEST(NVL(min_blocks_per_slice,0)::int,1)::FLOAT,5) storage_skew,
  ROUND(CAST(100*dist_slice AS FLOAT) /
  (SELECT COUNT(DISTINCT slice) FROM stv_slices),2) pct_populated
FROM svv_table_info ti
  JOIN (SELECT tbl, MIN(c) min_blocks_per_slice,
          MAX(c) max_blocks_per_slice,
          COUNT(DISTINCT slice) dist_slice
        FROM (SELECT b.tbl, b.slice, COUNT(*) AS c
              FROM STV_BLOCKLIST b
              GROUP BY b.tbl, b.slice)
        WHERE tbl = 240791 GROUP BY tbl) iq ON iq.tbl = ti.table_id;
       tablename       | skew_rows | storage_skew | pct_populated
-----------------------+-----------+--------------+---------------
 lineitem_dk_l_partkey |      1.00 |      1.00259 |           100
(1 row)

Note: A small amount of data skew shouldn’t immediately discourage you from considering an otherwise appropriate distribution key. In many cases, the benefits of collocating large JOIN operations offset the cost of cluster slices processing a slightly uneven workload.

Does the column data have high cardinality?

Cardinality is a relative measure of how many distinct values exist within the column. It’s important to consider cardinality alongside the uniformity of data distribution. In some scenarios, a uniform distribution of data can result in low relative cardinality. Low relative cardinality leads to wasted compute capacity from lack of parallelization. For example, consider a cluster with 576 slices (36x DS2.8XLARGE) and the following table:

CREATE TABLE orders (                                            
  o_orderkey int8 NOT NULL			,
  o_custkey int8 NOT NULL			,
  o_orderstatus char(1) NOT NULL		,
  o_totalprice numeric(12,2) NOT NULL	,
  o_orderdate date NOT NULL DISTKEY ,
  o_orderpriority char(15) NOT NULL	,
  o_clerk char(15) NOT NULL			,
  o_shippriority int4 NOT NULL		,
  o_comment varchar(79) NOT NULL                  
); 

 

Within this table, I retain a billion records representing 12 months of orders. Day to day, I expect that the number of orders remains more or less consistent. This consistency creates a uniformly distributed dataset:

[email protected]/tpch=# SELECT o_orderdate, count(*) 
FROM orders GROUP BY 1 ORDER BY 2 DESC; 
 o_orderdate |  count
-------------+---------
 1993-01-18  | 2651712
 1993-08-29  | 2646252
 1993-12-05  | 2644488
 1993-12-04  | 2642598
...
...
 1993-09-28  | 2593332
 1993-12-12  | 2593164
 1993-11-14  | 2593164
 1993-12-07  | 2592324
(365 rows)

However, the cardinality is relatively low when we compare the 365 distinct values of the o_orderdate DISTKEY column to the 576 cluster slices. If each day’s value were hashed and assigned to an empty slice, this data only populates 63% of the cluster at best. Over 37% of the cluster remains idle during scans against this table. In real-life scenarios, we’ll end up assigning multiple distinct values to already populated slices before we populate each empty slice with at least one value.

-- How many values are assigned to each slice
[email protected]/tpch=# SELECT rows/2592324 assigned_values, COUNT(*) number_of_slices FROM stv_tbl_perm WHERE name='orders' AND slice<6400 
GROUP BY 1 ORDER BY 1;
 assigned_values | number_of_slices
-----------------+------------------
               0 |              307
               1 |              192
               2 |               61
               3 |               13
               4 |                3
(5 rows)

So in this scenario, on one end of the spectrum we have 307 of 576 slices not populated with any day’s worth of data, and on the other end we have 3 slices populated with 4 days’ worth of data. Query execution is limited by the rate at which those 3 slices can process their data. At the same time, over half of the cluster remains idle.

Note: The pct_slices_populated column from the table_inspector.sql query result identifies tables that aren’t fully populating the slices within a cluster.

On the other hand, suppose the o_orderdate DISTKEY column was defined with the timestamp data type and actually stores true order timestamp data (not dates stored as timestamps). In this case, the granularity of the time dimension causes the cardinality of the column to increase from the order of hundreds to the order of millions of distinct values. This approach results in all 576 slices being much more evenly populated.

Note: A timestamp column isn’t usually an appropriate DISTKEY column, because it’s often not joined or aggregated on. However, this case illustrates how relative cardinality can be influenced by data granularity, and the significance it has in resulting in a uniform and complete distribution of table data throughout a cluster.

Do queries perform selective filters on the column?

 

Even if the DISTKEY column ensures a uniform distribution of data throughout the cluster, suboptimal parallelism can arise if that same column is also used to selectively filter records from the table. To illustrate this, the same orders table with a DISTKEY on o_orderdate is still populated with 1 billion records spanning 365 days of data:

CREATE TABLE orders (                                            
  o_orderkey int8 NOT NULL			,
  o_custkey int8 NOT NULL			,
  o_orderstatus char(1) NOT NULL		,
  o_totalprice numeric(12,2) NOT NULL	,
  o_orderdate date NOT NULL DISTKEY ,
  o_orderpriority char(15) NOT NULL	,
  o_clerk char(15) NOT NULL			,
  o_shippriority int4 NOT NULL		,
  o_comment varchar(79) NOT NULL                  
); 

This time, consider the table on a smaller cluster with 80 slices (5x DS2.8XLARGE) instead of 576 slices. With a uniform data distribution and ~4-5x more distinct values than cluster slices, it’s likely that query execution is more evenly parallelized for full table scans of the table. This effect occurs because each slice is more likely to be populated and assigned an equivalent number of records.

However, in many use cases full table scans are uncommon. For example, with time series data it’s more typical for the workload to scan the past 1, 7, or 30 days of data than it is to repeatedly scan the entire table. Let’s assume I have one of these time series data workloads that performs analytics on orders from the last 7 days with SQL patterns, such as the following:

SELECT ... FROM orders 
JOIN ... 
JOIN ... 
WHERE ...
AND o_orderdate between current_date-7 and current_date-1
GROUP BY ...;  

With a predicate such as this, we limit the relevant values to just 7 days. All of these days must reside on a maximum of 7 slices within the cluster. Due to consistent hashing, slices that contain one or more of these 7 values contain all of the records for those specific values:

[email protected]/tpch=# SELECT SLICE_NUM(), COUNT(*) FROM orders 
WHERE o_orderdate BETWEEN current_date-7 AND current_date-1 
GROUP BY 1 ORDER BY 1;
 slice_num |  count
-----------+---------
         3 | 2553840
        33 | 2553892
        40 | 2555232
        41 | 2553092
        54 | 2554296
        74 | 2552168
        76 | 2552224
(7 rows)  

With the dataset shown above, we have at best 7 slices, each fetching 2.5 million rows to perform further processing. For the scenario with EVEN distribution, we expect 80 slices to fetch ~240,000 records each (((109 records / 365 days) * 7 days) / 80 slices). The important comparison to consider is whether there is in having only 7 slices fetch and process 2.5 million records relative to all 80 slices fetching and processing ~240,000 records each.

If the overhead of having a subset of slices perform the majority of the work is significant, then you want to separate your distribution style from your selective filtering criteria. To do so, choose a different distribution key.

Use the following query to identify how frequently your scans include predicates which filter on the table’s various columns:

SELECT 
    ti."table", ti.diststyle, RTRIM(a.attname) column_name,
    COUNT(DISTINCT s.query ||'-'|| s.segment ||'-'|| s.step) as num_scans,
    COUNT(DISTINCT CASE WHEN TRANSLATE(TRANSLATE(info,')',' '),'(',' ') LIKE ('%'|| a.attname ||'%') THEN s.query ||'-'|| s.segment ||'-'|| s.step END) AS column_filters
FROM stl_explain p
JOIN stl_plan_info i ON ( i.userid=p.userid AND i.query=p.query AND i.nodeid=p.nodeid  )
JOIN stl_scan s ON (s.userid=i.userid AND s.query=i.query AND s.segment=i.segment AND s.step=i.step)
JOIN svv_table_info ti ON ti.table_id=s.tbl
JOIN pg_attribute a ON (a.attrelid=s.tbl AND a.attnum > 0)
WHERE s.tbl IN ([table_id]) 
GROUP BY 1,2,3,a.attnum
ORDER BY attnum;  

From this query result, if the potential DISTKEY column is frequently scanned, you can perform further investigation to identify if those filters are extremely selective or not using more complex SQL:

SELECT 
    ti.schemaname||'.'||ti.tablename AS "table", 
    ti.tbl_rows,
    AVG(r.s_rows_pre_filter) avg_s_rows_pre_filter,
    100*ROUND(1::float - AVG(r.s_rows_pre_filter)::float/ti.tbl_rows::float,6) avg_prune_pct,
    AVG(r.s_rows) avg_s_rows,
    100*ROUND(1::float - AVG(r.s_rows)::float/AVG(r.s_rows_pre_filter)::float,6) avg_filter_pct,
    COUNT(DISTINCT i.query) AS num,
    AVG(r.time) AS scan_time,
    MAX(i.query) AS query, TRIM(info) as filter
FROM stl_explain p
JOIN stl_plan_info i ON ( i.userid=p.userid AND i.query=p.query AND i.nodeid=p.nodeid  )
JOIN stl_scan s ON (s.userid=i.userid AND s.query=i.query AND s.segment=i.segment AND s.step=i.step)
JOIN (SELECT table_id,"table" tablename,schema schemaname,tbl_rows,unsorted,sortkey1,sortkey_num,diststyle FROM svv_table_info) ti ON ti.table_id=s.tbl
JOIN (
SELECT query, segment, step, DATEDIFF(s,MIN(starttime),MAX(endtime)) AS time, SUM(rows) s_rows, SUM(rows_pre_filter) s_rows_pre_filter, ROUND(SUM(rows)::float/SUM(rows_pre_filter)::float,6) filter_pct
FROM stl_scan
WHERE userid>1 AND type=2
AND starttime < endtime
GROUP BY 1,2,3
HAVING sum(rows_pre_filter) > 0
) r ON (r.query = i.query and r.segment = i.segment and r.step = i.step)
LEFT JOIN (SELECT attrelid,t.typname FROM pg_attribute a JOIN pg_type t ON t.oid=a.atttypid WHERE attsortkeyord IN (1,-1)) a ON a.attrelid=s.tbl
WHERE s.tbl IN ([table_id])
AND p.info LIKE 'Filter:%' AND p.nodeid > 0
GROUP BY 1,2,10 ORDER BY 1, 9 DESC;

The above SQL describes these items:

  • tbl_rows: Current number of rows in the table at this moment in time.
  • avg_s_rows_pre_filter: Number of rows that were actually scanned after the zone maps were leveraged to prune a number of blocks from being fetched.
  • avg_prune_pct: Percentage of rows that were pruned from the table just by leveraging the zone maps.
  • avg_s_rows: Number of rows remaining after applying the filter criteria defined in the SQL.
  • avg_filter_pct: Percentage of rows remaining, relative to avg_s_rows_pre_filter, after a user defined filter has been applied.
  • num: Number of queries that include this filter criteria.
  • scan_time: Average number of seconds it takes for the segment which includes that scan to complete.
  • query: Example query ID for the query that issued these filter criteria.
  • filter: Detailed filter criteria specified by user.

In the following query results, we can assess the selectivity for a given filter predicate. Your knowledge of the data profile, and how many distinct values exist within a given range constrained by the filter condition, lets you identify whether a filter should be considered selective or not. If you’re not sure of the data profile, you can always construct SQL code from the query results to get a count of distinct values within that range:

table                 | public.orders
tbl_rows              | 22751520
avg_s_rows_pre_filter | 12581124
avg_prune_pct         | 44.7021
avg_s_rows            | 5736106
avg_filter_pct        | 54.407
num                   | 2
scan_time             | 19
query                 | 1721037
filter                | Filter: ((o_orderdate < '1993-08-01'::date) AND (o_orderdate >= '1993-05-01'::date))

SELECT COUNT(DISTINCT o_orderdate) 
FROM public.orders 
WHERE o_orderdate < '1993-08-01' AND o_orderdate >= '1993-05-01';

We’d especially like to avoid columns that have query patterns with these characteristics:

  • Relative to tbl_rows:
    • A low value for avg_s_rows
    • A high value for avg_s_rows_pre_filter
  • A selective filter on the potential DISTKEY column
  • Limited distinct values within the returned range
  • High scan_time

 If such patterns exist for a column, it’s likely that this column is not a good DISTKEY candidate.

Is the column also a primary compound sortkey column?

 

Note: Sort keys are discussed in greater detail within Part 3 of this blog series.

As shown in the flow chart, even if we are using the column to selectively filter records (thereby potentially restricting post-scan processing to a portion of the slices), in some circumstances it still makes sense to use the column as the distribution key.

If we selectively filter on the column, we might also be using a sortkey on this column. This approach lets us use the column zone maps effectively on non-necessary slices, to quickly identify the relevant blocks to fetch. Doing this makes selective scanning less expensive by orders of magnitude than a full column scan on each slice. In turn, this lower cost helps to offset the cost of a reduced number of slices processing the bulk of the data after the scan.

You can use the following query to determine the primary sortkey for a table:

SELECT attname FROM pg_attribute 
WHERE attrelid = [table_id] AND attsortkeyord = 1;

Using the SQL code from the last step (used to check avg_s_rows, the number of distinct values in the returned range, and so on), we see the characteristics of a valid DISTKEY option include the following:

  • Relative to tbl_rows, a low value for avg_s_rows_pre_filter
  • Relative to avg_s_rows_pre_filter, a similar number for avg_s_rows
  • Selective filter on the potential DISTKEY column
  • Numerous distinct values within the returned range
  • Low or insignificant scan_time

If such patterns exist, it’s likely that this column is a good DISTKEY candidate.

Do the query patterns facilitate MERGE JOINs?

 

When the following criteria are met, you can use a MERGE JOIN operation, the fastest of the three join operations:

  1. Two tables are sorted (using a compound sort key) and distributed on the same columns.
  2. Both tables are over 80% sorted (svv_table_info.unsorted < 20%)
  3. These tables are joined using the DISTKEY and SORTKEY columns in the JOIN condition.

Because of these restrictive criteria, it’s unusual to encounter a MERGE JOIN operation by chance. Typically, an end user makes explicit design decisions to force this type of JOIN operation, usually because of a requirement for a particular query’s performance. If this JOIN pattern doesn’t exist in your workload, then you won’t benefit from this optimized JOIN operation.

The following query returns the number of statements that scanned your table, scanned another table that was sorted and distributed on the same column, and performed some type of JOIN operation:

SELECT COUNT(*) num_queries FROM stl_query
WHERE query IN (
  SELECT DISTINCT query FROM stl_scan 
  WHERE tbl = [table_id] AND type = 2 AND userid > 1
  INTERSECT
  SELECT DISTINCT query FROM stl_scan 
  WHERE tbl <> [table_id] AND type = 2 AND userid > 1
  AND tbl IN (
    SELECT DISTINCT attrelid FROM pg_attribute 
    WHERE attisdistkey = true AND attsortkeyord > 0
    MINUS
    SELECT DISTINCT attrelid FROM pg_attribute
    WHERE attsortkeyord = -1)
  INTERSECT
  (SELECT DISTINCT query FROM stl_hashjoin WHERE userid > 1
  UNION
  SELECT DISTINCT query FROM stl_nestloop WHERE userid > 1
  UNION
  SELECT DISTINCT query FROM stl_mergejoin WHERE userid > 1)
);

If this query returns any results, you potentially have an opportunity to enable a MERGE JOIN for existing queries without modifying any other tables. If this query returns no results, then you need to proactively tune multiple tables simultaneously to facilitate the performance of a single query.

Note: If a desired MERGE JOIN optimization requires reviewing and modifying multiple tables, you approach the problem in a different fashion than this straightforward approach. This more complex approach goes beyond the scope of this article. If you’re interested in implementing such an optimization, you can check our documentation on the JOIN operations and ask specific questions in the comments at the end of this blog post.

Phase One Recap

Throughout this phase, we answered questions to determine which columns in this table were potentially appropriate DISTKEY columns for our table. At the end of these steps, you might have identified zero to many potential columns for your specific table and dataset. We’ll be keeping these columns (or lack thereof) in mind as we move along to the next phase.

Phase 2: Deciding Distribution Style

Phase 2 dives deeper into the potential distribution styles to determine which is the best choice for your workload. Generally, it’s best to strive for a DISTSTYLE of KEY whenever appropriate. Choose ALL in the scenarios where it makes sense (and KEY doesn’t). Only choose EVEN when neither KEY nor ALL is appropriate.

We’ll work though the following flowchart to assist us with our decision. Because DISTSTYLE is a table property, we run through this analysis table by table, after having completed phase 1 preceding.

o_redshift_table_design_2

Does the table participate in JOINs?

 

DISTSTYLE ALL is only used to guarantee colocation of JOIN operations, regardless of the columns specified in the JOIN conditions. If the table doesn’t participate in JOIN operations, then DISTSTYLE ALL offers no performance benefits and should be eliminated from consideration.

JOIN operations that benefit from colocation span a robust set of database operations. WHERE clause and JOIN clause join operations (INNER, OUTER, and so on) are obviously included, and so are some not-as-obvious operations and syntax like IN, NOT IN, MINUS/EXCEPT, INTERSECT and EXISTS. When answering whether the table participates in JOINs, consider all of these operations.

This query confirms how many distinct queries have scanned this table and have included one or more JOIN operations at some point in the same query:

SELECT COUNT(*) FROM (
SELECT DISTINCT query FROM stl_scan 
WHERE tbl = [table_id] AND type = 2 AND userid > 1
INTERSECT
(SELECT DISTINCT query FROM stl_hashjoin
UNION
SELECT DISTINCT query FROM stl_nestloop
UNION
SELECT DISTINCT query FROM stl_mergejoin));

If this query returns a count of 0, then the table isn’t participating in any type of JOIN, no matter what operations are in use.

Note: Certain uncommon query patterns can cause the preceding query to return false positives (such as if you have simple scan against your table that is later appended to a result set of a subquery that contains JOINs). If you’re not sure, you can always look at the queries specifically with this code:

SELECT userid, query, starttime, endtime, rtrim(querytxt) qtxt 
FROM stl_query WHERE query IN (
SELECT DISTINCT query FROM stl_scan 
WHERE tbl = [table_id] AND type = 2 AND userid > 1
INTERSECT
(SELECT DISTINCT query FROM stl_hashjoin
UNION
SELECT DISTINCT query FROM stl_nestloop
UNION
SELECT DISTINCT query FROM stl_mergejoin))
ORDER BY starttime;

 

Does the table contain at least one potential DISTKEY column?

 

The process detailed in phase 1 helped us to identify a table’s appropriate DISTKEY columns. If no appropriate DISTKEY columns exist, then KEY DISTSTYLE is removed from consideration. If appropriate DISTKEY columns do exist, then EVEN distribution is removed from consideration.

With this simple rule, the decision is never between KEY, EVEN, and ALL—rather it’s between these:

  • KEY and ALL in cases where at least one valid DISTKEY column exists
  • EVEN and ALL in cases where no valid DISTKEY columns exist

 

Can you tolerate additional storage overhead?

 

To answer whether you can tolerate additional storage overhead, the questions are: How large is the table and how is it currently distributed? You can use the following query to answer these questions:

SELECT table_id, "table", diststyle, size, pct_used 
FROM svv_table_info WHERE table_id = [table_id];

The following example shows how many 1 MB blocks and the percentage of total cluster storage that are currently consumed by duplicate versions of the same orders table with different DISTSTYLEs:

[email protected]/tpch=# SELECT "table", diststyle, size, pct_used
FROM svv_table_info
WHERE "table" LIKE 'orders_diststyle_%';
         table         |    diststyle    | size  | pct_used
-----------------------+-----------------+-------+----------
 orders_diststyle_even | EVEN            |  6740 |   1.1785
 orders_diststyle_key  | KEY(o_orderkey) |  6740 |   1.1785
 orders_diststyle_all  | ALL             | 19983 |   3.4941
(3 rows)

For DISTSTYLE EVEN or KEY, each node receives just a portion of total table data. However, with DISTSTYLE ALL we are storing a complete version of the table on each compute node. For ALL, as we add nodes to a cluster the amount of data per node remains unchanged. Whether this is significant or not depends on your table size, cluster configuration, and storage overhead. If you use a DS2.8XLARGE configuration with 16TB of storage per node, this increase might be a negligible amount of per-node storage. However, if you use a DC1.LARGE configuration with 160GB of storage per node, then the increase in total cluster storage might be an unacceptable increase.

You can multiply the number of nodes by the current size of your KEY or EVEN distributed table to get a rough estimate of the size of the table as DISTSTYLE ALL. This approach should be provide information to determine if ALL results in an unacceptable growth in table storage:

SELECT "table", size, pct_used, 
 CASE diststyle
  WHEN 'ALL' THEN size::TEXT
  ELSE '< ' || size*(SELECT COUNT(DISTINCT node) FROM stv_slices)
 END est_distall_size,
 CASE diststyle
  WHEN 'ALL' THEN pct_used::TEXT
  ELSE '< ' || pct_used*(SELECT COUNT(DISTINCT node) FROM stv_slices)
 END est_distall_pct_used
FROM svv_table_info WHERE table_id = [table_id];

If the estimate is unacceptable, then DISTSTYLE ALL should be removed from consideration.

Do the query patterns tolerate reduced parallelism?

 

In MPP database systems, performance at scale is achieved by simultaneously processing portions of the complete dataset with several distributed resources. DISTSTYLE ALL means that you’re sacrificing some parallelism, for both read and write operations, to guarantee a colocation of data on each node.

At some point, the benefits of DISTSTYLE ALL tables are offset by the parallelism reduction. At this point, DISTSTYLE ALL is not a valid option. Where that threshold occurs is different for your write operations and your read operations.

Write operations

For a table with KEY or EVEN DISTSTYLE, database write operations are parallelized across each of the slices. This parallelism means that each slice needs to process only a portion of the complete write operation. For ALL distribution, the write operation doesn’t benefit from parallelism because the write needs to be performed in full on every single node to keep the full dataset synchronized on all nodes. This approach significantly reduces performance compared to the same type of write operation performed on a KEY or EVEN distributed table.

If your table is the target of frequent write operations and you find you can’t tolerate the performance hit, that eliminates DISTSTYLE ALL from consideration.

This query identifies how many write operations have modified a table:

SELECT '[table_id]' AS "table_id", 
(SELECT count(*) FROM 
(SELECT DISTINCT query FROM stl_insert WHERE tbl = [table_id]
INTERSECT
SELECT DISTINCT query FROM stl_delete WHERE tbl = [table_id])) AS num_updates,
(SELECT count(*) FROM 
(SELECT DISTINCT query FROM stl_delete WHERE tbl = [table_id]
MINUS
SELECT DISTINCT query FROM stl_insert WHERE tbl = [table_id])) AS num_deletes,
(SELECT COUNT(*) FROM 
(SELECT DISTINCT query FROM stl_insert WHERE tbl = [table_id] 
MINUS 
SELECT distinct query FROM stl_s3client
MINUS
SELECT DISTINCT query FROM stl_delete WHERE tbl = [table_id])) AS num_inserts,
(SELECT COUNT(*) FROM 
(SELECT DISTINCT query FROM stl_insert WHERE tbl = [table_id]
INTERSECT
SELECT distinct query FROM stl_s3client)) as num_copies,
(SELECT COUNT(*) FROM 
(SELECT DISTINCT xid FROM stl_vacuum WHERE table_id = [table_id]
AND status NOT LIKE 'Skipped%')) AS num_vacuum;

If your table is rarely written to, or if you can tolerate the performance hit, then DISTSTYLE ALL is still a valid option.

Read operations

Reads that access DISTSTYLE ALL tables require slices to scan and process the same data multiple times for a single query operation. This approach seeks to improve query performance by avoiding the network I/O overhead of broadcasting or redistributing data to facilitate a join or aggregation. At the same time, it increases the necessary compute and disk I/O due to the excess work being performed over the same data multiple times.

Suppose that you access the table in many ways, sometimes joining, sometimes not. In this case, you’ll need to determine if the benefit of collocating JOINs with DISTSTYLE ALL is significant and desirable or if the cost of reduced parallelism impacts your queries more significantly.

Patterns and trends to avoid

DISTSTYLE ALL tables are most appropriate for smaller, slowly changing dimension tables. As a general set of guidelines, the patterns following typically suggest that DISTSTYLE ALL is a poor option for a given table:

  • Read operations:
    • Scans against large fact tables
    • Single table scans that are not participating in JOINs
    • Scans against tables with complex aggregations (for example, several windowing aggregates with different partitioning, ordering, and frame clauses)
  • Write operations:
    • A table that is frequently modified with DML statements
    • A table that is ingested with massive data loads
    • A table that requires frequent maintenance with VACUUM or VACUUM REINDEX operations

If your table is accessed in a way that meets these criteria, then DISTSTYLE ALL is unlikely to be a valid option.

Do the query patterns utilize potential DISTKEY columns in JOIN conditions?

If the table participates in JOIN operations and has appropriate DISTKEY columns, then we need to decide between KEY or ALL distribution styles. Considering only how the table participates in JOIN operations, and no other outside factors, these criteria apply:

  • ALL distribution is most appropriate when any of these are true:
  • KEY distribution is most appropriate when

 

Determining the best DISTKEY column

If you’ve determined that DISTSTYLE KEY is best for your table, the next step is to determine which column serves as the ideal DISTKEY column. Of the columns you’ve flagged as appropriate potential DISTKEY columns in phase 1, you’ll want to identify which has the largest impact on your particular workload.

For tables with only a single candidate column, or for workloads that only use one of the candidate columns in JOINs, the choice is obvious. For workloads with mixed JOIN conditions against the same table, the most optimal column is determined based on your business requirements.

For example, common scenarios to encounter and questions to ask yourself about how you want to distribute are the following:

  • My transformation SQL code and reporting workload benefit from different columns. Do I want to facilitate my transformation job or reporting performance?
  • My dashboard queries and structured reports leverage different JOIN conditions. Do I value interactive query end user experience over business-critical report SLAs?
  • Should I distribute on column_A that occurs in a JOIN condition thousands of times daily for less important analytics, or on column_B that is referenced only tens of times daily for more important analytics? Would I rather improve a 5 second query to 2 seconds 1,000 times per day, or improve a 60-minute query to 24 minutes twice per day?

Your business requirements and where you place value answer these questions, so there is no simple way to offer guidance that covers all scenarios. If you have a scenario with mixed JOIN conditions and no real winner in value, you can always test multiple distribution key options and measure what works best for you. Or you can materialize multiple copies of the table distributed on differing columns and route queries to disparate tables based on query requirements. If you end up attempting the latter approach, pgbouncer-rr is a great utility to simplify the routing of queries for your end users.

Next Steps

Choosing optimal DISTSTYLE and DISTKEY options for your table ensures that your data is distributed evenly for parallel processing, and that data redistribution during query execution is minimal—which ensures your complex analytical workloads perform well over multipetabyte datasets.

By following the process detailed preceding, you can identify the ideal DISTSTYLE and DISTKEY for your specific tables. The final step is to simply rebuild the tables to apply these optimizations. This rebuild can be performed at any time. However, if you intend to continue reading through parts 3, 4, and 5 of the Advanced Table Design Playbook, you might want to wait until the end before you issue the table rebuilds. Otherwise, you might find yourself rebuilding these tables multiple times to implement optimizations identified in later installments.

In Part 3 of our table design playbook, I’ll describe how to use table properties related to table sorting styles and sort keys for another significant performance gain.


Amazon Redshift Engineering’s Advanced Table Design Playbook

Part 1: Preamble, Prerequisites, and Prioritization
Part 2: Distribution Styles and Distribution Keys
Part 3: Compound and Interleaved Sort Keys (December 6, 2016)
Part 4: Compression Encodings (December 7, 2016)
Part 5: Table Data Durability (December 8, 2016)


About the author


christophersonZach Christopherson is a Palo Alto based Senior Database Engineer at AWS.
He assists Amazon Redshift users from all industries in fine-tuning their workloads for optimal performance. As a member of the Amazon Redshift service team, he also influences and contributes to the development of new and existing service features. In his spare time, he enjoys trying new restaurants with his wife, Mary, and caring for his newborn daughter, Sophia.

 


Related

Top 10 Performance Tuning Techniques for Amazon Redshift (Updated Nov. 28, 2016)

o_redshift_update_1

Christmas Special: The MagPi 52 is out now!

Post Syndicated from Lucy Hattersley original https://www.raspberrypi.org/blog/magpi-christmas-special/

The MagPi Christmas Special is out now.

For the festive season, the official magazine of the Raspberry Pi community is having a maker special. This edition is packed with fun festive projects!

The MagPi issue 52 cover

The MagPi issue 52

Click here to download the MagPi Christmas Special

Here are just some of the fun projects inside this festive issue:

  • Magazine tree: turn the special cover into a Christmas tree, using LED lights to create a shiny, blinky display
  • DIY decorations: bling out your tree with NeoPixels and code
  • Santa tracker: follow Santa Claus around the world with a Raspberry Pi
  • Christmas crackers: the best low-cost presents for makers and hackers
  • Yuletide game: build Sliders, a fab block-sliding game with a festive feel.

Sliders

A Christmas game from the MagPi No.52

Inside the MagPi Christmas special

If you’re a bit Grinchy when it comes to Christmas, there’s plenty of non-festive fun to be found too:

  • Learn to use VNC Viewer
  • Find out how to build a sunrise alarm clock
  • Read our in-depth guide to Amiga emulation
  • Discover the joys of parallel computing

There’s also a huge amount of community news this month. The MagPi has an exclusive feature on Pioneers, our new programme for 12- to 15-year-olds, and news about Astro Pi winning the Arthur Clarke Award.

The Pioneers

The MagPi outlines our new Pioneers programme in detail

After that, we see some of the most stylish projects ever. Inside is the beautiful Sisyphus table; that’s a moving work of art, a facial recognition door lock, and a working loom controlled by a Raspberry Pi.

The MagPi 52 Sisyphus Project Focus

The MagPi interviews the maker of this amazing Sisyphus table

If that wasn’t enough, we also have a big feature on adding sensors to your robots. These can be used to built a battle-bot ready for the upcoming Pi Wars challenge.

The MagPi team wishes you all a merry Christmas! You can grab The MagPi 52 in stores today: it’s in WHSmith, Tesco, Sainsbury’s, and Asda in the UK, and it will be in Micro Center and selected Barnes & Noble stores when it comes to the US. You can also buy the print edition online from our store, and it’s available digitally on our Android and iOS app.

Get a free Pi Zero
Want to make sure you never miss an issue? Subscribe today and get a Pi Zero bundle featuring the new, camera-enabled Pi Zero, and a cable bundle that includes the camera adapter.

If you subscribe to The MagPi before 7 December 2016, you will get a free Pi Zero in time for Christmas.

The post Christmas Special: The MagPi 52 is out now! appeared first on Raspberry Pi.

The Linux Foundation Technical Advisory Board election

Post Syndicated from corbet original http://lwn.net/Articles/704407/rss

The Linux Foundation’s Technical
Advisory Board
provides the development community (primarily the kernel
development community) with a voice in the Foundation’s decision-making
process. Among other things, the TAB chair holds a seat on the
Foundation’s board of directors. The next TAB election will be held on
November 2 at the Kernel Summit in Santa Fe, NM; five TAB members (½
of the total) will be selected there. The nomination process is open until
voting begins; anybody interested in serving on the TAB is encouraged to
throw their hat into the ring.

Welcome JC – Our New Office Admin!

Post Syndicated from Yev original https://www.backblaze.com/blog/welcome-jc-new-office-admin/

jc
As the Backblaze office grows we need someone to heard all of our cats (well, in our case dogs). That responsibility used to fall to a bunch of people who were all really really busy with their own workload. Now, that person is JC! And she’s doing an awesome job – we even have a new refrigerator (our previous one was broken for about a year). Lets learn a bit more about JC, shall we?

What is your Backblaze Title?
Office Administrator, or She-Makes-Sure-All-Employees-Are-Hydrated-Caffeinated-And-Fed.

Where are you originally from?
West Philadelphia born and raised…just kidding. I’m a San Jose, California native.

What attracted you to Backblaze?
My friend Chris has worked here for a couple of years. He’s always talked about the great work environment. Backblaze offered a small, family-like atmosphere and chance to grow with and impact a company. You don’t find too many opportunities like this.

What do you expect to learn while being at Backblaze?
More about cloud storage and backup. If someone could teach me how to play ukulele while I’m here that would be great, too.

Where else have you worked?
Just about every mall in the Santa Clara County area. Memorable stores include Hot Topic, Toys “R” Us, and Starbucks. I also enjoy working for local theater companies and have moonlighted as a House Manager, Box Office Manager, Backstage Manager, and Marketing Assistant.

Where did you go to school?
I spent some time going to Sacramento State University before transferring to San Jose State University where I earned my B.S. in Psychology, and a secondary B.A. in Theater Arts.

What’s your dream job?
Actress/Full-time Vlogger. I really enjoy performing and entertaining.

Favorite place you’ve traveled?
The United Kingdom. I was incredibly lucky to take a senior trip and do a theater tour of the UK. I enjoyed all the tourist sites as well as getting some time to enjoy productions in the West End and see a performance by the Royal Shakespeare Company.

Favorite hobby?
Did I mention theater? Aside from the performing arts I also enjoy playing World of Warcraft and RPGs.

Of what achievement are you most proud?
A couple of years ago I completed a half marathon. It’s now a new goal of mine to finish a full marathon.

Star Trek or Star Wars?
Both? I feel the Star Wars movies are far superior over the Star Trek movies, and I grew up watching Star Trek: The Next Generation.

Coke or Pepsi?
Pepsi. This is what happens when you go to a California State University. They have an agreement with Pepsi and that’s all you get.

Favorite food?
Burritos! It can breakfast, lunch or dinner. You can fill it with leftovers. You can get creative and throw in all kinds of crazy combinations. Did you know they make sushi burritos? And it is, by far, the most convenient food to eat whilst driving.

Why do you like certain things?
Well, on a physical level my reward center is activated in my basal ganglia portion of my brain, and dopamine is released creating a sense of pleasure or reward. On an emotional/spiritual level I usually like things because I have a connection with them.

Anything else you’d like you’d like to tell us?
I often talk about my cat, Disneyland, or YouTube. I’m obsessed.

We keep hiring people that love Disneyland. We might have to have a company off-site there eventually. Thank you for keeping our shiny new fridge stocked and for helping us keep our office under control!

The post Welcome JC – Our New Office Admin! appeared first on Backblaze Blog | Cloud Storage & Cloud Backup.

Watch the AWS Summit – Santa Clara Keynote in Real Time on July 13

Post Syndicated from Craig Liebendorfer original https://blogs.aws.amazon.com/security/post/Tx1UMV1L79BHDWJ/Watch-the-AWS-Summit-Santa-Clara-Keynote-in-Real-Time-on-July-13

Join us online Wednesday, July 13, at 10:00 A.M. Pacific Time for the AWS Summit – Santa Clara Livestream! This keynote presentation, given by Dr. Matt Wood, AWS General Manager of Product Strategy, will highlight the newest AWS features and services, and select customer stories. Don’t miss this live presentation!

Join us in person at the Santa Clara Convention Center
If you are in the Santa Clara area and would like to attend the free Summit, you still have time. Register now to attend.

The Summit includes:

  • More than 50 technical sessions, including these security-related sessions:

    • Automating Security Operations in AWS (Deep Dive)
    • Securing Cloud Workloads with DevOps Automation
    • Deep Dive on AWS IoT
    • Getting Started with AWS Security (Intro)
    • Network Security and Access Control within AWS (Intro)
  • Training opportunities in Hands-on Labs.
  • Full-day training bootcamps. Registration is $600.
  • The opportunity to learn best practices and get questions answered from AWS engineers, expert customers, and partners.
  • Networking opportunities with your cloud and IT peers.

– Craig 

P.S. Can’t make the Santa Clara event? Check out our other AWS Summit locations. If you have summit questions, please contact us at [email protected]

Annalisa Joins the Support Crew

Post Syndicated from Yev original https://www.backblaze.com/blog/annalisa-joins-support-crew/

Blog-Annalisa

As Backblaze continues to grow, we’re constantly on the lookout for talented folks in all of our departments. Our latest hire is Annalisa, who is joining our Support team as a Junior Technical Support Representative. Lets learn a bit more about Annalisa, shall well?

What is your Backblaze Title?
Junior Technical Support Representative

Where are you originally from?
I was born and raised in West Des Moines, IA, but have spent the last decade with Baton Rouge, LA being my homebase.

What attracted you to Backblaze?
The simplicity of the product and the atmosphere/attitude of the employees is what attracted me to Backblaze.

What do you expect to learn while being at Backblaze?
I expect to learn as much as possible about how cloud backup actually works.

Where else have you worked?
Since school, I’ve worked in Nashville, New Orleans, Santa Monica, Los Angeles, and Baton Rouge. I am ready to not move again.

Where did you go to school?
Middle Tennessee State University for Sound Engineering.

What’s your dream job?
Someday I would like to be the ADR Sound Mixer on big budget movies, or mix an album for Justin Timberlake.

Favorite place you’ve traveled?
My favorite place to visit is Philadelphia. The scenery is beautiful and there is so much history everywhere.

Favorite hobby?
I love creating unique Doctor Who or Marvel Cross-stitch.

Of what achievement are you most proud?
Creating and fully executing a sound system in a gym with no complaints from the audience.

Coke or Pepsi?
Diet Pepsi

Why do you like certain things?
I like things that make me think or end up making me learn something.

Annalisa joins the select few folks in the office that prefer Pepsi to Coke – but we’ll keep on stocking it. Welcome Annalisa, and if the sound quality in our videos gets better, you’ll know who to thank!

The post Annalisa Joins the Support Crew appeared first on Backblaze Blog | The Life of a Cloud Backup Company.

Surviving the Zombie Apocalypse with Serverless Microservices

Post Syndicated from Aaron Kao original https://aws.amazon.com/blogs/compute/surviving-the-zombie-apocalypse-with-serverless-microservices/

Run Apps without the Bite!

by: Kyle Somers – Associate Solutions Architect

Let’s face it, managing servers is a pain! Capacity management and scaling is even worse. Now imagine dedicating your time to SysOps during a zombie apocalypse — barricading the door from flesh eaters with one arm while patching an OS with the other.

This sounds like something straight out of a nightmare. Lucky for you, this doesn’t have to be the case. Over at AWS, we’re making it easier than ever to build and power apps at scale with powerful managed services, so you can focus on your core business – like surviving – while we handle the infrastructure management that helps you do so.

Join the AWS Lambda Signal Corps!

At AWS re:Invent in 2015, we piloted a workshop where participants worked in groups to build a serverless chat application for zombie apocalypse survivors, using Amazon S3, Amazon DynamoDB, Amazon API Gateway, and AWS Lambda. Participants learned about microservices design patterns and best practices. They then extended the functionality of the serverless chat application with various add-on functionalities – such as mobile SMS integration, and zombie motion detection – using additional services like Amazon SNS and Amazon Elasticsearch Service.

Between the widespread interest in serverless architectures and AWS Lambda by our customers, we’ve recognized the excitement around this subject. Therefore, we are happy to announce that we’ll be taking this event on the road in the U.S. and abroad to recruit new developers for the AWS Lambda Signal Corps!

 

Help us save humanity! Learn More and Register Here!

 

Washington, DC | March 10 – Mission Accomplished!

San Francisco, CA @ AWS Loft | March 24 – Mission Accomplished!

New York City, NY @ AWS Loft | April 13 – Mission Accomplished!

London, England @ AWS Loft | April 25

Austin, TX | April 26

Atlanta, GA | May 4

Santa Monica, CA | June 7

Berlin, Germany | July 19

San Francisco, CA @ AWS Loft | August 16

New York City, NY @ AWS Loft | August 18

 

If you’re unable to join us at one of these workshops, that’s OK! In this post, I’ll show you how our survivor chat application incorporates some important microservices design patterns and how you can power your apps in the same way using a serverless architecture.


 

What Are Serverless Architectures?

At AWS, we know that infrastructure management can be challenging. We also understand that customers prefer to focus on delivering value to their business and customers. There’s a lot of undifferentiated heavy lifting to be building and running applications, such as installing software, managing servers, coordinating patch schedules, and scaling to meet demand. Serverless architectures allow you to build and run applications and services without having to manage infrastructure. Your application still runs on servers, but all the server management is done for you by AWS. Serverless architectures can make it easier to build, manage, and scale applications in the cloud by eliminating much of the heavy lifting involved with server management.

Key Benefits of Serverless Architectures

  • No Servers to Manage: There are no servers for you to provision and manage. All the server management is done for you by AWS.
  • Increased Productivity: You can now fully focus your attention on building new features and apps because you are freed from the complexities of server management, allowing you to iterate faster and reduce your development time.
  • Continuous Scaling: Your applications and services automatically scale up and down based on size of the workload.

What Should I Expect to Learn at a Zombie Microservices Workshop?

The workshop content we developed is designed to demonstrate best practices for serverless architectures using AWS. In this post we’ll discuss the following topics:

  • Which services are useful when designing a serverless application on AWS (see below!)
  • Design considerations for messaging, data transformation, and business or app-tier logic when building serverless microservices.
  • Best practices demonstrated in the design of our zombie survivor chat application.
  • Next steps for you to get started building your own serverless microservices!

Several AWS services were used to design our zombie survivor chat application. Each of these services are managed and highly scalable. Let’s take a quick at look at which ones we incorporated in the architecture:

  • AWS Lambda allows you to run your code without provisioning or managing servers. Just upload your code (currently Node.js, Python, or Java) and Lambda takes care of everything required to run and scale your code with high availability. You can set up your code to automatically trigger from other AWS services or call it directly from any web or mobile app. Lambda is used to power many use cases, such as application back ends, scheduled administrative tasks, and even big data workloads via integration with other AWS services such as Amazon S3, DynamoDB, Redshift, and Kinesis.
  • Amazon Simple Storage Service (Amazon S3) is our object storage service, which provides developers and IT teams with secure, durable, and scalable storage in the cloud. S3 is used to support a wide variety of use cases and is easy to use with a simple interface for storing and retrieving any amount of data. In the case of our survivor chat application, it can even be used to host static websites with CORS and DNS support.
  • Amazon API Gateway makes it easy to build RESTful APIs for your applications. API Gateway is scalable and simple to set up, allowing you to build integrations with back-end applications, including code running on AWS Lambda, while the service handles the scaling of your API requests.
  • Amazon DynamoDB is a fast and flexible NoSQL database service for all applications that need consistent, single-digit millisecond latency at any scale. It is a fully managed cloud database and supports both document and key-value store models. Its flexible data model and reliable performance make it a great fit for mobile, web, gaming, ad tech, IoT, and many other applications.

Overview of the Zombie Survivor Chat App

The survivor chat application represents a completely serverless architecture that delivers a baseline chat application (written using AngularJS) to workshop participants upon which additional functionality can be added. In order to deliver this baseline chat application, an AWS CloudFormation template is provided to participants, which spins up the environment in their account. The following diagram represents a high level architecture of the components that are launched automatically:

High-Level Architecture of Survivor Serverless Chat App

  • Amazon S3 bucket is created to store the static web app contents of the chat application.
  • AWS Lambda functions are created to serve as the back-end business logic tier for processing reads/writes of chat messages.
  • API endpoints are created using API Gateway and mapped to Lambda functions. The API Gateway POST method points to a WriteMessages Lambda function. The GET method points to a GetMessages Lambda function.
  • A DynamoDB messages table is provisioned to act as our data store for the messages from the chat application.

Serverless Survivor Chat App Hosted on Amazon S3

With the CloudFormation stack launched and the components built out, the end result is a fully functioning chat app hosted in S3, using API Gateway and Lambda to process requests, and DynamoDB as the persistence for our chat messages.

With this baseline app, participants join in teams to build out additional functionality, including the following:

  • Integration of SMS/MMS via Twilio. Send messages to chat from SMS.
  • Motion sensor detection of nearby zombies with Amazon SNS and Intel® Edison and Grove IoT Starter Kit. AWS provides a shared motion sensor for the workshop, and you consume its messages from SNS.
  • Help-me panic button with IoT.
  • Integration with Slack for messaging from another platform.
  • Typing indicator to see which survivors are typing.
  • Serverless analytics of chat messages using Amazon Elasticsearch Service (Amazon ES).
  • Any other functionality participants can think of!

As a part of the workshop, AWS provides guidance for most of these tasks. With these add-ons completed, the architecture of the chat system begins to look quite a bit more sophisticated, as shown below:

Architecture of Survivor Chat with Additional Add-on Functionality

Architectural Tenants of the Serverless Survivor Chat

For the most part, the design patterns you’d see in a traditional server-yes environment you will also find in a serverless environment. No surprises there. With that said, it never hurts to revisit best practices while learning new ones. So let’s review some key patterns we incorporated in our serverless application.

Decoupling Is Paramount

In the survivor chat application, Lambda functions are serving as our tier for business logic. Since users interact with Lambda at the function level, it serves you well to split up logic into separate functions as much as possible so you can scale the logic tier independently from the source and destinations upon which it serves.

As you’ll see in the architecture diagram in the above section, the application has separate Lambda functions for the chat service, the search service, the indicator service, etc. Decoupling is also incorporated through the use of API Gateway, which exposes our back-end logic via a unified RESTful interface. This model allows us to design our back-end logic with potentially different programming languages, systems, or communications channels, while keeping the requesting endpoints unaware of the implementation. Use this pattern and you won’t cry for help when you need to scale, update, add, or remove pieces of your environment.

Separate Your Data Stores

Treat each data store as an isolated application component of the service it supports. One common pitfall when following microservices architectures is to forget about the data layer. By keeping the data stores specific to the service they support, you can better manage the resources needed at the data layer specifically for that service. This is the true value in microservices.

In the survivor chat application, this practice is illustrated with the Activity and Messages DynamoDB tables. The activity indicator service has its own data store (Activity table) while the chat service has its own (Messages). These tables can scale independently along with their respective services. This scenario also represents a good example of statefuless. The implementation of the talking indicator add-on uses DynamoDB via the Activity table to track state information about which users are talking. Remember, many of the benefits of microservices are lost if the components are still all glued together at the data layer in the end, creating a messy common denominator for scaling.

Leverage Data Transformations up the Stack

When designing a service, data transformation and compatibility are big components. How will you handle inputs from many different clients, users, systems for your service? Will you run different flavors of your environment to correspond with different incoming request standards?  Absolutely not!

With API Gateway, data transformation becomes significantly easier through built-in models and mapping templates. With these features you can build data transformation and mapping logic into the API layer for requests and responses. This results in less work for you since API Gateway is a managed service. In the case of our survivor chat app, AWS Lambda and our survivor chat app require JSON while Twilio likes XML for the SMS integration. This type of transformation can be offloaded to API Gateway, leaving you with a cleaner business tier and one less thing to design around!

Use API Gateway as your interface and Lambda as your common backend implementation. API Gateway uses Apache Velocity Template Language (VTL) and JSONPath for transformation logic. Of course, there is a trade-off to be considered, as a lot of transformation logic could be handled in your business-logic tier (Lambda). But, why manage that yourself in application code when you can transparently handle it in a fully managed service through API Gateway? Here are a few things to keep in mind when handling transformations using API Gateway and Lambda:

  • Transform first; then call your common back-end logic.
  • Use API Gateway VTL transformations first when possible.
  • Use Lambda to preprocess data in ways that VTL can’t.

Using API Gateway VTL for Input/Output Data Transformations

 

Security Through Service Isolation and Least Privilege

As a general recommendation when designing your services, always utilize least privilege and isolate components of your application to provide control over access. In the survivor chat application, a permissions-based model is used via AWS Identity and Access Management (IAM). IAM is integrated in every service on the AWS platform and provides the capability for services and applications to assume roles with strict permission sets to perform their least-privileged access needs. Along with access controls, you should implement audit and access logging to provide the best visibility into your microservices. This is made easy with Amazon CloudWatch Logs and AWS CloudTrail. CloudTrail enables audit capability of API calls made on the platform while CloudWatch Logs enables you to ship custom log data to AWS. Although our implementation of Amazon Elasticsearch in the survivor chat is used for analyzing chat messages, you can easily ship your log data to it and perform analytics on your application. You can incorporate security best practices in the following ways with the survivor chat application:

  • Each Lambda function should have an IAM role to access only the resources it needs. For example, the GetMessages function can read from the Messages table while the WriteMessages function can write to it. But they cannot access the Activities table that is used to track who is typing for the indicator service.
  • Each API Gateway endpoint must have IAM permissions to execute the Lambda function(s) it is tied to. This model ensures that Lambda is only executed from the principle that is allowed to execute it, in this case the API Gateway method that triggers the back end function.
  • DynamoDB requires read/write permissions via IAM, which limits anonymous database activity.
  • Use AWS CloudTrail to audit API activity on the platform and among the various services. This provides traceability, especially to see who is invoking your Lambda functions.
  • Design Lambda functions to publish meaningful outputs, as these are logged to CloudWatch Logs on your behalf.

FYI, in our application, we allow anonymous access to the chat API Gateway endpoints. We want to encourage all survivors to plug into the service without prior registration and start communicating. We’ve assumed zombies aren’t intelligent enough to hack into our communication channels. Until the apocalypse, though, stay true to API keys and authorization with signatures, which API Gateway supports!

Don’t Abandon Dev/Test

When developing with microservices, you can still leverage separate development and test environments as a part of the deployment lifecycle. AWS provides several features to help you continue building apps along the same trajectory as before, including these:

  • Lambda function versioning and aliases: Use these features to version your functions based on the stages of deployment such as development, testing, staging, pre-production, etc. Or perhaps make changes to an existing Lambda function in production without downtime.
  • Lambda service blueprints: Lambda comes with dozens of blueprints to get you started with prewritten code that you can use as a skeleton, or a fully functioning solution, to complete your serverless back end. These include blueprints with hooks into Slack, S3, DynamoDB, and more.
  • API Gateway deployment stages: Similar to Lambda versioning, this feature lets you configure separate API stages, along with unique stage variables and deployment versions within each stage. This allows you to test your API with the same or different back ends while it progresses through changes that you make at the API layer.
  • Mock Integrations with API Gateway: Configure dummy responses that developers can use to test their code while the true implementation of your API is being developed. Mock integrations make it faster to iterate through the API portion of a development lifecycle by streamlining pieces that used to be very sequential/waterfall.

Using Mock Integrations with API Gateway

Stay Tuned for Updates!

Now that you’ve got the necessary best practices to design your microservices, do you have what it takes to fight against the zombie hoard? The serverless options we explored are ready for you to get started with and the survivors are counting on you!

Be sure to keep an eye on the AWS GitHub repo. Although I didn’t cover each component of the survivor chat app in this post, we’ll be deploying this workshop and code soon for you to launch on your own! Keep an eye out for Zombie Workshops coming to your city, or nominate your city for a workshop here.

For more information on how you can get started with serverless architectures on AWS, refer to the following resources:

Whitepaper – AWS Serverless Multi-Tier Architectures

Reference Architectures and Sample Code

*Special thanks to my colleagues Ben Snively, Curtis Bray, Dean Bryen, Warren Santner, and Aaron Kao at AWS. They were instrumental to our team developing the content referenced in this post.

Foundation report for 2014

Post Syndicated from Michael "Monty" Widenius original http://monty-says.blogspot.com/2015/01/foundation-report-for-2014.html

2014 was a productive year for the MariaDB Foundation.Here is a list of some of the things MariaDB Foundation employees haveaccomplished during 2014:The 3 full-time MariaDB Foundation developers have worked hard to make MariaDB better:Some 260 commitsSome 25 reviews of code from the MariaDB community.Fixed some 170 bugs and new features. For a full list, please check Jira.Reported some 160 bugs.Some of the main new features Foundation developers have worked on in 2014 are:Porting and improving MariaDB on IBM Power8.Porting Galera to MariaDB 10.1 as a standard feature.Query timeouts (MDEV-4427)Some coding and reviews of Parallel replication in MariaDB 10.1.Working with code from Google and Eperi to get table space and table level encryption for InnoDB and XtraDB.Allowing storage engines to shortcut group by queries (for ScaleDB) (MDEV-6080).Moronga storage engine (reviews and porting help)Connect storage engine (reviews and porting help)Spider storage engine (merging code with MariaDB)Query timeouts (MDEV-4427)Merge INET6_ATON() and INET6_NTOA() from MySQL-5.6 (MDEV-4051)Make “CAST(time_expr AS DATETIME)” compatible…SQL Standard) (MDEV-5372)Command line variable to choose MariaDB-5.3 vs MySQL-5.6 temporal data formats (MDEV-5528)Added syntax CREATE OR REPLACE to tables, databases, stored procedures, UDF:s and Views (MDEV-5491. The original TABLE code was done by Monty, other parts was done as a Google Summer Of Code project by Sriram Patil with Alexander Barkov as a mentor.Upgraded the bundled Perl Compatible Regular Expression library (PCRE) to 8.34 (MDEV-5304)Reduced usage of LOCK_open (MDEV-5403) (MDEV-5492) (MDEV-5587)Ported patches from WebScaleSQL to MariaDB (MDEV-6039)Better preallocation of memory (MDEV-7004)Lock-free hash for table definition cache (MDEV-7324)A lot of speed optimizations (changing mutex usage, better memory allocations, optimized bottlenecks, memory barriers etc).The MariaDB documentation/knowledgebase:has now 3685 articles about MariaDB and MySQL. Foundation employees added during 2014 223 new ones and did 6045 edits.Some of the main new articles from us are:All the system and status variables for all storage engines and plugins should be documented, including variable differences between MariaDB 5.5 versus MariaDB 10.0 and also MariaDB 10.0 versus MySQL 5.6.Updated documentation to changes related to MariaDB 10.1Upgrading from MariaDB 5.5 to MariaDB 10.0SpiderOQGRAPHGaleraSphinxMroongaInformation Schema TablesCommon MariaDB QueriesC APImysql database tablesOverview of MariaDB logsOLD_MODEEncryption of tables and table spaces in MariaDB 10.1Some 10 blog posts (This we need to do better..)We also have a lot of outside contributors and translators. Thanks a lot to all of you!We also visited and talked about MariaDB at a lot of conferences:February: Community events in Japan & Korea.April: The first MariaDB Foundation conference. This was a free for all event and we made videos of all presentations!April: Talk and booth at Percona live in Santa Clara.April: Talks at Linux Fest BellinghamJuly: Booth and BoF at Oscon PortlandOctober: Talk at All your Base at Oxford.October Talk about MySQL and MariaDB for China entrepreneurs in Beijing as part of China Finland Golden Bridge.November: Talk at Codemesh in London.November: Talks at PHP Buenos AiresNovember: Talk about open source business models at Build stuff” in Vilnius.November: Keynote and talk at CodeMotion Milan.In addition I had several talks at different companies who were moving big installations to MariaDB and needed advice.We where also able to finalize the MariaDB trademark agreement between the MariaDB corporation and the MariaDB Foundation. This ensures that that anyone can be part of MariaDB development on equal terms. The actual trademark agreement can be found here.On the personnel side, we were sad to see Simon Phipps leave the position as CEO of the Foundation.One the plus side, we just had 2 new persons join the MariaDB foundation this week:We are happy to have Otto Kekäläinen join us as the new CEO for the MariaDB foundation! Otto has in the past done a great work to get MariaDB into Debian and I am looking forward to his work on improving everything we do in the MariaDB foundation.Vicențiu Ciorbaru has joined the MariaDB foundation as a developer. In the past Vicențiu added ROLES to MariaDB, as part of a Google Summer of Code project and he is now interested to start working on the MariaDB optimizer. A special thanks to Jean-Paul Smets at Nexedi for sponsoring his work at the foundation!Last, I want to give my thanks to the MariaDB foundation members who made all the foundation work possible for 2014:AutomatticMariaDB corporation (former SkySQL Ab)ParallelsZeinmaxFor 2015 we welcome a new member, Visma. Visma will be part of the foundation board and will help push MariaDB development forwards.As the above shows, the MariaDB Foundation is not only a guarantee that MariaDB will always be an actively developed open source project, we also do a lot of development and practical work. This is however only possible if we have active members who sponsor our work!If you are interested in helping us, either as a member, sponsor, or by giving development resources to the MariaDB foundation, please email us at foundation at mariadb.org !

Foundation report for 2014

Post Syndicated from Michael "Monty" Widenius original http://monty-says.blogspot.com/2015/01/foundation-report-for-2014.html

2014 was a productive year for the MariaDB Foundation.Here is a list of some of the things MariaDB Foundation employees haveaccomplished during 2014:The 3 full-time MariaDB Foundation developers have worked hard to make MariaDB better:Some 260 commitsSome 25 reviews of code from the MariaDB community.Fixed some 170 bugs and new features. For a full list, please check Jira.Reported some 160 bugs.Some of the main new features Foundation developers have worked on in 2014 are:Porting and improving MariaDB on IBM Power8.Porting Galera to MariaDB 10.1 as a standard feature.Query timeouts (MDEV-4427)Some coding and reviews of Parallel replication in MariaDB 10.1.Working with code from Google and Eperi to get table space and table level encryption for InnoDB and XtraDB.Allowing storage engines to shortcut group by queries (for ScaleDB) (MDEV-6080).Moronga storage engine (reviews and porting help)Connect storage engine (reviews and porting help)Spider storage engine (merging code with MariaDB)Query timeouts (MDEV-4427)Merge INET6_ATON() and INET6_NTOA() from MySQL-5.6 (MDEV-4051)Make “CAST(time_expr AS DATETIME)” compatible…SQL Standard) (MDEV-5372)Command line variable to choose MariaDB-5.3 vs MySQL-5.6 temporal data formats (MDEV-5528)Added syntax CREATE OR REPLACE to tables, databases, stored procedures, UDF:s and Views (MDEV-5491. The original TABLE code was done by Monty, other parts was done as a Google Summer Of Code project by Sriram Patil with Alexander Barkov as a mentor.Upgraded the bundled Perl Compatible Regular Expression library (PCRE) to 8.34 (MDEV-5304)Reduced usage of LOCK_open (MDEV-5403) (MDEV-5492) (MDEV-5587)Ported patches from WebScaleSQL to MariaDB (MDEV-6039)Better preallocation of memory (MDEV-7004)Lock-free hash for table definition cache (MDEV-7324)A lot of speed optimizations (changing mutex usage, better memory allocations, optimized bottlenecks, memory barriers etc).The MariaDB documentation/knowledgebase:has now 3685 articles about MariaDB and MySQL. Foundation employees added during 2014 223 new ones and did 6045 edits.Some of the main new articles from us are:All the system and status variables for all storage engines and plugins should be documented, including variable differences between MariaDB 5.5 versus MariaDB 10.0 and also MariaDB 10.0 versus MySQL 5.6.Updated documentation to changes related to MariaDB 10.1Upgrading from MariaDB 5.5 to MariaDB 10.0SpiderOQGRAPHGaleraSphinxMroongaInformation Schema TablesCommon MariaDB QueriesC APImysql database tablesOverview of MariaDB logsOLD_MODEEncryption of tables and table spaces in MariaDB 10.1Some 10 blog posts (This we need to do better..)We also have a lot of outside contributors and translators. Thanks a lot to all of you!We also visited and talked about MariaDB at a lot of conferences:February: Community events in Japan & Korea.April: The first MariaDB Foundation conference. This was a free for all event and we made videos of all presentations!April: Talk and booth at Percona live in Santa Clara.April: Talks at Linux Fest BellinghamJuly: Booth and BoF at Oscon PortlandOctober: Talk at All your Base at Oxford.October Talk about MySQL and MariaDB for China entrepreneurs in Beijing as part of China Finland Golden Bridge.November: Talk at Codemesh in London.November: Talks at PHP Buenos AiresNovember: Talk about open source business models at Build stuff” in Vilnius.November: Keynote and talk at CodeMotion Milan.In addition I had several talks at different companies who were moving big installations to MariaDB and needed advice.We where also able to finalize the MariaDB trademark agreement between the MariaDB corporation and the MariaDB Foundation. This ensures that that anyone can be part of MariaDB development on equal terms. The actual trademark agreement can be found here.On the personnel side, we were sad to see Simon Phipps leave the position as CEO of the Foundation.One the plus side, we just had 2 new persons join the MariaDB foundation this week:We are happy to have Otto Kekäläinen join us as the new CEO for the MariaDB foundation! Otto has in the past done a great work to get MariaDB into Debian and I am looking forward to his work on improving everything we do in the MariaDB foundation.Vicențiu Ciorbaru has joined the MariaDB foundation as a developer. In the past Vicențiu added ROLES to MariaDB, as part of a Google Summer of Code project and he is now interested to start working on the MariaDB optimizer. A special thanks to Jean-Paul Smets at Nexedi for sponsoring his work at the foundation!Last, I want to give my thanks to the MariaDB foundation members who made all the foundation work possible for 2014:AutomatticMariaDB corporation (former SkySQL Ab)ParallelsZeinmaxFor 2015 we welcome a new member, Visma. Visma will be part of the foundation board and will help push MariaDB development forwards.As the above shows, the MariaDB Foundation is not only a guarantee that MariaDB will always be an actively developed open source project, we also do a lot of development and practical work. This is however only possible if we have active members who sponsor our work!If you are interested in helping us, either as a member, sponsor, or by giving development resources to the MariaDB foundation, please email us at foundation at mariadb.org !