Security updates have been issued by Debian (chromium-browser, evince, pdns-recursor, and simplesamlphp), Fedora (ceph, dhcp, erlang, exim, fedora-arm-installer, firefox, libvirt, openssh, pdns-recursor, rubygem-yard, thunderbird, wordpress, and xen), Red Hat (rh-mysql57-mysql), SUSE (kernel), and Ubuntu (openssl).
Post Syndicated from Yev original https://www.backblaze.com/blog/2017-holiday-gift-guide-backblaze-style/
Here at Backblaze we have a lot of folks who are all about technology. With the holiday season fast approaching, you might have all of your gift buying already finished — but if not, we put together a list of things that the employees here at Backblaze are pretty excited about giving (and/or receiving) this year.
It’s no secret that having a smart home is the new hotness, and many of the items below can be used to turbocharge your home’s ascent into the future:
The holidays are all about eating pie — well why not get a pie of a different type for the DIY fan in your life!
An inexpensive way to keep a close eye on all your favorite people…and intruders!
Have trouble falling asleep? Try this portable white noise machine. Also great for the office!
Amazon Echo Dot
Need a cheap way to keep track of your schedule or play music? The Echo Dot is a great entry into the smart home of your dreams!
These little fellows make it easy to Wifi-ify your entire home, even if it’s larger than the average shoe box here in Silicon Valley. Google Wifi acts as a mesh router and seamlessly covers your whole dwelling. Have a mansion? Buy more!
Like the Amazon Echo Dot, this is the Google variant. It’s more expensive (similar to the Amazon Echo) but has better sound quality and is tied into the Google ecosystem.
This is a smart thermostat. What better way to score points with the in-laws than installing one of these bad boys in their home — and then making it freezing cold randomly in the middle of winter from the comfort of your couch!
Homes aren’t the only things that should be smart. Your body should also get the chance to be all that it can be:
You’ve seen these all over the place, and the truth is they do a pretty good job of making sounds appear in your ears.
Bose SoundLink Wireless Headphones
If you like over-the-ear headphones, these noise canceling ones work great, are wireless and lovely. There’s no better way to ignore people this holiday season!
Garmin Fenix 5 Watch
This watch is all about fitness. If you enjoy fitness. This watch is the fitness watch for your fitness needs.
The Apple Watch is a wonderful gadget that will light up any movie theater this holiday season.
Nokia Steel Health Watch
If you’re into mixing analogue and digital, this is a pretty neat little gadget.
Fossil Smart Watch
This stylish watch is a pretty neat way to dip your toe into smartwatches and activity trackers.
Pebble Time Steel Smart Watch
Some people call this the greatest smartwatch of all time. Those people might be named Yev. This watch is great at sending you notifications from your phone, and not needing to be charged every day. Bellissimo!
A few of the holiday gift suggestions that we got were a bit off-kilter, but we do have a lot of interesting folks in the office. Hopefully, you might find some of these as interesting as they do:
Wireless Qi Charger
Wireless chargers are pretty great in that you don’t have to deal with dongles. There are even kits to make your electronics “wirelessly chargeable” which is pretty great!
Self-Heating Coffee Mug
Love coffee? Hate lukewarm coffee? What if your coffee cup heated itself? Brilliant!
Yeast. It makes beer. And bread! Sometimes you need to stir it. What cooler way to stir your yeast than with this industrial stirrer?
This one is self explanatory. You know the old rhyme: happy butts, everyone’s happy!
Good luck out there this holiday season!
Security updates have been issued by CentOS (postgresql), Debian (firefox-esr, kernel, libxcursor, optipng, thunderbird, wireshark, and xrdp), Fedora (borgbackup, ca-certificates, collectd, couchdb, curl, docker, erlang-jiffy, fedora-arm-installer, firefox, git, linux-firmware, mupdf, openssh, thunderbird, transfig, wildmidi, wireshark, xen, and xrdp), Mageia (firefox and optipng), openSUSE (erlang, libXfont, and OBS toolchain), Oracle (kernel), Slackware (openssl), and SUSE (kernel and OBS toolchain).
Post Syndicated from Andy original https://torrentfreak.com/terrarium-tv-the-best-android-movie-streaming-app-faces-uncertain-future-171210/
In early 2014, Popcorn Time turned the consumer end of the piracy world upside down. Utilizing a BitTorrent backend and a beautiful interface, Popcorn Time certainly lived up to its promise of being the Netflix for Pirates. Adopted by millions of users, it soon became a household name.
In the months and years to come, Popcorn Time grabbed hundreds of headlines. However, aside from the app’s success, much of what followed was negative in tone as the entertainment industries struggled to contain this new kid on the block.
Since then, of course, Kodi and its numerous illicit third-party addons have become massive news, stepping over Popcorn Time to become the most talked about and consumer-friendly of piracy tools. In the background, however, other applications have been making steady and indeed somewhat stealthy progress.
One of these applications is Terrarium TV. Built exclusively for the Android platform and equally at home on a phone, tablet, streaming stick or set-top box, this software has gained a cult but significant following. For those out of the loop, it will be the most important piracy app they’ve never heard of, despite its Facebook page alone attracting close to 200,000 members.
In many respects, Terrarium TV resembles Popcorn Time. It has a beautiful Netflix-style interface, pulling movie and TV show artwork and metadata from several sources to make what some consider to be the best all-in-one streaming app for Android, period. While Kodi is no doubt powerful and Popcorn Time has one hell of a reputation, Terrarium TV makes viewing simplicity itself. And it really does cater to everyone.
If people are worried about Popcorn Time due to its BitTorrent-based streaming, Terrarium TV has that covered. Every single stream offered by the app is conjured up from public sources such as file-hosting sites and even GoogleVideo. On the whole, streaming is of an extremely high-quality with dozens of sources offered for most content, whether that’s the latest Hollywood movies, blockbuster TV shows, or decades-old rarities.
The quality is impressive too. While 4K rips are best left to the BitTorrent crowd with bandwidth to spare, Terrarium TV manages to conjure up a bewildering range of content in an impressive array of qualities. HD is commonplace and barely a search goes by without a corresponding source alongside. And with multi-language subtitle and Chromecast support, the icing is placed on top of what is an extremely competent cake.
But despite all the accolades, Terrarium TV has an uncertain future.
Over a week ago TerrariumTV.com – the site from where the application has been seamlessly delivered for some time – suddenly disappeared without trace. There was no announcement on Facebook, no announcement on Twitter. Even the moderators on the fairly active Terrarium TV subreddit seemed to have few ideas as to what was going on.
Theories are numerous but most center around the developer, who’s resident in Asia, going on some kind of hiatus. Why that would require the Terrarium TV website to be taken down isn’t clear. Neither does it explain why the Terrarium TV site Github repo was taken down too.
But alongside the ‘break’ theory is one that legal trouble, either actual or simply the fear of it, is what’s underlying the apparent limbo in which the app now sits. That was confirmed this week by the developer, who told one of the app’s subreddit moderators that he’d be lying low, at least for a while.
“I’ve decided to shut down the official website and maybe soon will also shut down the Github repository hosting the apk files in order to avoid any potential legal issues,” he said.
“There has been no cease and desist letters, no lawyers at the door, no seizing of the website. Ad free is not involved. It has been purely a precautionary measure. I want to take a break for a while (maybe a few weeks) first.”
After a short exchange in the summer, Terrarium TV’s developer didn’t return our recent requests for comment but if he had, we’d have certainly asked him about the future development of the app, framed around the Popcorn Time situation.
Despite many legal attacks, the open source nature of Popcorn Time allowed the project to ‘fork’ in several different directions, with various teams continuing development. Terrarium TV, on the other hand, is closed source meaning that when it’s gone, it’s possibly gone for good.
At the moment it’s still available for download from sources listed in the sidebar of its dedicated subreddit but whether the dev will decide to revisit the project again is unclear at this point. If he does not, it seems likely that the system will degrade over time although at the moment it carries out its tasks automatically, which is impressive in itself.
In the meantime, its hundreds of thousands of users will just have to cross everything – and wait.
Post Syndicated from Robert Graham original http://blog.erratasec.com/2017/12/libertarians-are-against-net-neutrality.html
This post claims to be by a libertarian in support of net neutrality. As a libertarian, I need to debunk this. “Net neutrality” is a case of one-hand clapping, you rarely hear the competing side, and thus, that side may sound attractive. This post is about the other side, from a libertarian point of view.
That post just repeats the common, and wrong, left-wing talking points. I mean, there might be a libertarian case for some broadband regulation, but this isn’t it.
This thing they call “net neutrality” is just left-wing politics masquerading as some sort of principle. It’s no different than how people claim to be “pro-choice”, yet demand forced vaccinations. Or, it’s no different than how people claim to believe in “traditional marriage” even while they are on their third “traditional marriage”.
Properly defined, “net neutrality” means no discrimination of network traffic. But nobody wants that. A classic example is how most internet connections have faster download speeds than uploads. This discriminates against upload traffic, harming innovation in upload-centric applications like DropBox’s cloud backup or BitTorrent’s peer-to-peer file transfer. Yet activists never mention this, or other types of network traffic discrimination, because they no more care about “net neutrality” than Trump or Gingrich care about “traditional marriage”.
Instead, when people say “net neutrality”, they mean “government regulation”. It’s the same old debate between who is the best steward of consumer interest: the free-market or government.
Specifically, in the current debate, they are referring to the Obama-era FCC “Open Internet” order and reclassification of broadband under “Title II” so they can regulate it. Trump’s FCC is putting broadband back to “Title I”, which means the FCC can’t regulate most of its “Open Internet” order.
Don’t be tricked into thinking the “Open Internet” order is anything but intensely politically. The premise behind the order is the Democrat’s firm believe that it’s government who created the Internet, and all innovation, advances, and investment ultimately come from the government. It sees ISPs as inherently deceitful entities who will only serve their own interests, at the expense of consumers, unless the FCC protects consumers.
It says so right in the order itself. It starts with the premise that broadband ISPs are evil, using illegitimate “tactics” to hurt consumers, and continues with similar language throughout the order.
A good contrast to this can be seen in Tim Wu’s non-political original paper in 2003 that coined the term “net neutrality”. Whereas the FCC sees broadband ISPs as enemies of consumers, Wu saw them as allies. His concern was not that ISPs would do evil things, but that they would do stupid things, such as favoring short-term interests over long-term innovation (such as having faster downloads than uploads).
The political depravity of the FCC’s order can be seen in this comment from one of the commissioners who voted for those rules:
FCC Commissioner Jessica Rosenworcel wants to increase the minimum broadband standards far past the new 25Mbps download threshold, up to 100Mbps. “We invented the internet. We can do audacious things if we set big goals, and I think our new threshold, frankly, should be 100Mbps. I think anything short of that shortchanges our children, our future, and our new digital economy,” Commissioner Rosenworcel said.
This is indistinguishable from communist rhetoric that credits the Party for everything, as this booklet from North Korea will explain to you.
But what about monopolies? After all, while the free-market may work when there’s competition, it breaks down where there are fewer competitors, oligopolies, and monopolies.
There is some truth to this, in individual cities, there’s often only only a single credible high-speed broadband provider. But this isn’t the issue at stake here. The FCC isn’t proposing light-handed regulation to keep monopolies in check, but heavy-handed regulation that regulates every last decision.
Advocates of FCC regulation keep pointing how broadband monopolies can exploit their renting-seeking positions in order to screw the customer. They keep coming up with ever more bizarre and unlikely scenarios what monopoly power grants the ISPs.
But the never mention the most simplest: that broadband monopolies can just charge customers more money. They imagine instead that these companies will pursue a string of outrageous, evil, and less profitable behaviors to exploit their monopoly position.
The FCC’s reclassification of broadband under Title II gives it full power to regulate ISPs as utilities, including setting prices. The FCC has stepped back from this, promising it won’t go so far as to set prices, that it’s only regulating these evil conspiracy theories. This is kind of bizarre: either broadband ISPs are evilly exploiting their monopoly power or they aren’t. Why stop at regulating only half the evil?
The answer is that the claim “monopoly” power is a deception. It starts with overstating how many monopolies there are to begin with. When it issued its 2015 “Open Internet” order the FCC simultaneously redefined what they meant by “broadband”, upping the speed from 5-mbps to 25-mbps. That’s because while most consumers have multiple choices at 5-mbps, fewer consumers have multiple choices at 25-mbps. It’s a dirty political trick to convince you there is more of a problem than there is.
In any case, their rules still apply to the slower broadband providers, and equally apply to the mobile (cell phone) providers. The US has four mobile phone providers (AT&T, Verizon, T-Mobile, and Sprint) and plenty of competition between them. That it’s monopolistic power that the FCC cares about here is a lie. As their Open Internet order clearly shows, the fundamental principle that animates the document is that all corporations, monopolies or not, are treacherous and must be regulated.
“But corporations are indeed evil”, people argue, “see here’s a list of evil things they have done in the past!”
No, those things weren’t evil. They were done because they benefited the customers, not as some sort of secret rent seeking behavior.
For example, one of the more common “net neutrality abuses” that people mention is AT&T’s blocking of FaceTime. I’ve debunked this elsewhere on this blog, but the summary is this: there was no network blocking involved (not a “net neutrality” issue), and the FCC analyzed it and decided it was in the best interests of the consumer. It’s disingenuous to claim it’s an evil that justifies FCC actions when the FCC itself declared it not evil and took no action. It’s disingenuous to cite the “net neutrality” principle that all network traffic must be treated when, in fact, the network did treat all the traffic equally.
Another frequently cited abuse is Comcast’s throttling of BitTorrent.Comcast did this because Netflix users were complaining. Like all streaming video, Netflix backs off to slower speed (and poorer quality) when it experiences congestion. BitTorrent, uniquely among applications, never backs off. As most applications become slower and slower, BitTorrent just speeds up, consuming all available bandwidth. This is especially problematic when there’s limited upload bandwidth available. Thus, Comcast throttled BitTorrent during prime time TV viewing hours when the network was already overloaded by Netflix and other streams. BitTorrent users wouldn’t mind this throttling, because it often took days to download a big file anyway.
When the FCC took action, Comcast stopped the throttling and imposed bandwidth caps instead. This was a worse solution for everyone. It penalized heavy Netflix viewers, and prevented BitTorrent users from large downloads. Even though BitTorrent users were seen as the victims of this throttling, they’d vastly prefer the throttling over the bandwidth caps.
In both the FaceTime and BitTorrent cases, the issue was “network management”. AT&T had no competing video calling service, Comcast had no competing download service. They were only reacting to the fact their networks were overloaded, and did appropriate things to solve the problem.
Mobile carriers still struggle with the “network management” issue. While their networks are fast, they are still of low capacity, and quickly degrade under heavy use. They are looking for tricks in order to reduce usage while giving consumers maximum utility.
The biggest concern is video. It’s problematic because it’s designed to consume as much bandwidth as it can, throttling itself only when it experiences congestion. This is what you probably want when watching Netflix at the highest possible quality, but it’s bad when confronted with mobile bandwidth caps.
With small mobile devices, you don’t want as much quality anyway. You want the video degraded to lower quality, and lower bandwidth, all the time.
That’s the reasoning behind T-Mobile’s offerings. They offer an unlimited video plan in conjunction with the biggest video providers (Netflix, YouTube, etc.). The catch is that when congestion occurs, they’ll throttle it to lower quality. In other words, they give their bandwidth to all the other phones in your area first, then give you as much of the leftover bandwidth as you want for video.
While it sounds like T-Mobile is doing something evil, “zero-rating” certain video providers and degrading video quality, the FCC allows this, because they recognize it’s in the customer interest.
Mobile providers especially have great interest in more innovation in this area, in order to conserve precious bandwidth, but they are finding it costly. They can’t just innovate, but must ask the FCC permission first. And with the new heavy handed FCC rules, they’ve become hostile to this innovation. This attitude is highlighted by the statement from the “Open Internet” order:
And consumers must be protected, for example from mobile commercial practices masquerading as “reasonable network management.”
This is a clear declaration that free-market doesn’t work and won’t correct abuses, and that that mobile companies are treacherous and will do evil things without FCC oversight.
Ignoring the rhetoric for the moment, the debate comes down to simple left-wing authoritarianism and libertarian principles. The Obama administration created a regulatory regime under clear Democrat principles, and the Trump administration is rolling it back to more free-market principles. There is no principle at stake here, certainly nothing to do with a technical definition of “net neutrality”.
The 2015 “Open Internet” order is not about “treating network traffic neutrally”, because it doesn’t do that. Instead, it’s purely a left-wing document that claims corporations cannot be trusted, must be regulated, and that innovation and prosperity comes from the regulators and not the free market.
It’s not about monopolistic power. The primary targets of regulation are the mobile broadband providers, where there is plenty of competition, and who have the most “network management” issues. Even if it were just about wired broadband (like Comcast), it’s still ignoring the primary ways monopolies profit (raising prices) and instead focuses on bizarre and unlikely ways of rent seeking.
If you are a libertarian who nonetheless believes in this “net neutrality” slogan, you’ve got to do better than mindlessly repeating the arguments of the left-wing. The term itself, “net neutrality”, is just a slogan, varying from person to person, from moment to moment. You have to be more specific. If you truly believe in the “net neutrality” technical principle that all traffic should be treated equally, then you’ll want a rewrite of the “Open Internet” order.
In the end, while libertarians may still support some form of broadband regulation, it’s impossible to reconcile libertarianism with the 2015 “Open Internet”, or the vague things people mean by the slogan “net neutrality”.
Security updates have been issued by Arch Linux (cacti, curl, exim, lib32-curl, lib32-libcurl-compat, lib32-libcurl-gnutls, lib32-libxcursor, libcurl-compat, libcurl-gnutls, libofx, libxcursor, procmail, samba, shadowsocks-libev, and thunderbird), Debian (tor), Fedora (kernel, moodle, mupdf, python-sanic, qbittorrent, qpid-cpp, and rb_libtorrent), Mageia (git, lame, memcached, nagios, perl-Catalyst-Plugin-Static-Simple, php-phpmailer, shadowsocks-libev, and varnish), openSUSE (binutils, libressl, lynx, openssl, tor, wireshark, and xen), Red Hat (thunderbird), Scientific Linux (kernel, qemu-kvm, and thunderbird), SUSE (kernel, ncurses, openvpn-openssl1, and xen), and Ubuntu (curl, evince, and firefox).
Post Syndicated from Roderick Bauer original https://www.backblaze.com/blog/managing-digital-photos-and-videos/
NAS + CLOUD GIVEAWAY FROM MORRO DATA AND BACKBLAZE
Backblaze and Morro Data have teamed up to offer a hardware and software package giveaway that combines the best of NAS and the cloud for managing your photos and videos. You’ll find information about how to enter this promotion at the end of this post.
Whether you’re a serious amateur photographer, an Instagram fanatic, or a professional videographer, you’ve encountered the challenge of accessing, organizing, and storing your growing collection of digital photos and videos. The problems are similar for both amateur and professional — they vary chiefly in scale and cost — and the choices for addressing this challenge increase in number and complexity every day.
In this post we’ll be talking about the basics of managing digital photos and videos and trying to define the goals for a good digital asset management system (DAM). There’s a lot to cover, and we can’t get to all of it in one post. We will write more on this topic in future posts.
To start off, what is digital asset management (DAM)? In his book, The DAM Book: Digital Asset Management for Photographers, author Peter Krogh describes DAM as a term that refers to your entire digital photography ecosystem and how you work with it. It comprises the choices you make about every component of your digital photography practice.
Anyone considering how to manage their digital assets will need to consider the following questions:
- How do I like to work, and need to work if I have clients, partners, or others with whom I need to cooperate?
- What are the software and hardware options I need to consider to set up an efficient system that suits my needs?
- How do DAS (direct-attached storage), NAS (network-attached storage), the cloud, and other storage solutions fit into a working system?
- Is there a difference between how and where I back up and archive my files?
- How do I find media files in my collection?
- How do I handle a digital archive that just keeps growing and growing?
- How do I make sure that the methods and system I choose won’t lock me into a closed-end, proprietary system?
Tell us what you’re using for digital media management
Earlier this week we published a post entitled What’s the Best Solution for Managing Digital Photos and Videos? in which we asked our readers to tell us how they manage their media files and what they would like to have in an ideal system. We’ll write a post after the first of the year based on the replies we receive. We encourage you to visit this week’s post and contribute your comments to the conversation.
Getting Started with Digital Asset Management
Whether you have hundreds, thousands, or millions of digital media files, you’re going to need a plan on how to manage them. Let’s start with the goals for what a good digital media management plan should look like.
Goals of a Good Digital Media Management System
- 1) Don’t lose your files
- At the very least, your system should preserve files you wish to keep for future use. A good system will be reliable, support maintaining multiple copies of your data, and will integrate well with your data backup strategy. You should analyze each step of how you handle your cameras, memory cards, disks, and other storage media to understand the points at which your data is most vulnerable and how to minimize the possibility of data loss.
- 2) Find media when you need it
- Your system should enable you to find files when you need them.
- 3) Work economically
- You want a system that meets your budget and doesn’t waste your time.
- 4) Edit or Enhance the images or video
- You’ll want the ability to make changes, change formats, and repurpose your media for different uses.
- 5) Share media in ways you choose
- A good system will help you share your files with clients, friends, and family, giving you choices of different media, formats, and control over access and privacy.
- 6) Doesn’t lock your media into a proprietary system
- Your system shouldn’t lock you into file formats, proprietary protocols, or make it difficult or impossible to get your media out of a particular vendor’s environment. You want a system that uses common and open formats and protocols to maintain the compatibility of your media with as yet unknown hardware and software you might want to use in the future.
Media Storage Options
Photographers and videographers differ in aspects of their workflow, and amateurs and professionals have different needs and options, but there are some common elements that are typically found in a digital media workflow:
- Data is collected in a digital camera
- Data is copied from the camera to a computer, a transport device, or a storage device
- Data is brought into a computer system where original files are typically backed up and copies made for editing and enhancement (depending on type of system)
- Data files are organized into folders, and metadata added or edited to aid in record keeping and finding files in the future
- Files are edited and enhanced, with backups made during the process
- File formats might be changed manually or automatically depending on system
- Versions are created for client review, sharing, posting, publishing, or other uses
- File versions are archived either manually or automatically
- Files await possible future retrieval and use
These days, most of our digital media devices have multiple options for getting the digital media out of the camera. Those options can include Wi-Fi, direct cable connection, or one of a number of types and makes of memory cards. If your digital media device of choice is a smartphone, then you’re used to syncing your recent photos with your computer or a cloud service. If you sync with Apple Photos/iCloud or Google Photos, then one of those services may fulfill just about all your needs for managing your digital media.
If you’re a serious amateur or professional, your solution is more complex. You likely transfer your media from the camera to a computer or storage device (perhaps waiting to erase the memory cards until you’re sure you’ve safely got multiple copies of your files). The computer might already contain your image or video editing tools, or you might use it as a device to get your media back to your home or studio.
If you’ve got a fast internet connection, you might transfer your files to the cloud for safekeeping, to send them to a co-worker so she can start working on them, or to give your client a preview of what you’ve got. The cloud is also useful if you need the media to be accessible from different locations or on various devices.
If you’ve been working for a while, you might have data stored in some older formats such as CD, DVD, DVD-RAM, Zip, Jaz, or other format. Besides the inevitable degradation that occurs with older media, just finding a device to read the data can be a challenge, and it doesn’t get any easier as time passes. If you have data in older formats that you wish to save, you should transfer and preserve that data as soon as possible.
Let’s address the different types of storage devices and approaches.
Direct-attached Storage (DAS)
DAS includes any type of drive that is internal to your computer and connected via the host bus adapter (HBA), and using a common bus protocol such as ATA, SATA, or SCSI; or externally connected to the computer through, for example, USB or Thunderbolt.
Solid-state drives (SSD) are popular these days for their speed and reliability. In a system with different types of drives, it’s best to put your OS, applications, and video files on the fastest drive (typically the SSD), and use the slower drives when speed is not as critical.
A DAS device is directly accessible only from the host to which the DAS is attached, and only when the host is turned on, as the DAS incorporates no networking hardware or environment. Data on DAS can be shared on a network through capabilities provided by the operating system used on the host.
DAS can include a single drive attached via a single cable, multiple drives attached in a series, or multiple drives combined into a virtual unit by hardware and software, an example of which is RAID (Redundant Array of Inexpensive [or Independent] Disks). Storage virtualization such as RAID combines multiple physical disk drive components into one or more logical units for the purposes of data redundancy, performance improvement, or both.
Network-attached Storage (NAS)
A popular option these days is the use of network-attached storage (NAS) for storing working data, backing up data, and sharing data with co-workers. Compared to general purpose servers, NAS can offer several advantages, including faster data access, easier administration, and simple configuration through a web interface.
Users have the choice of a wide number of NAS vendors and storage approaches from vendors such as Morro Data, QNAP, Synology, Drobo, and many more.
NAS uses file-based protocols such as NFS (popular on UNIX systems), SMB/CIFS (Server Message Block/Common Internet File System used with MS Windows systems), AFP (used with Apple Macintosh computers), or NCP (used with OES and Novell NetWare). Multiple protocols are often supported by a single NAS device. NAS devices frequently include RAID or similar capability, providing virtualized storage and often performance improvements.
NAS devices are popular for digital media files due to their large capacities, data protection capabilities, speed, expansion options through adding more and bigger drives, and the ability to share files on a local office or home network or more widely on the internet. NAS devices often include the capability to back up the data on the NAS to another NAS or to the cloud, making them a great hub for a digital media management system.
The cloud is becoming increasingly attractive as a component of a digital asset management system due to a number of inherent advantages:
- Cloud data centers employ redundant technologies to protect the integrity of the stored data
- Data stored in the cloud can be shared, if desired
- Cloud storage is limitless, as opposed to DAS and most NAS implementations
- Cloud storage can be accessed through a wide range of interfaces, and APIs (Application Programming Interfaces), making cloud storage extremely flexible
- Cloud storage supports an extensive ecosystem of add-on hardware, software, and applications to enhance your DAM. Backblaze’s B2 Cloud Storage, for example, has a long list of integrations with media-oriented partners such as Axle video, Cantemo, Cubix, and others
Anyone working with digital media will tell you that the biggest challenge with the cloud is the large amount of data that must be transferred to the cloud, especially if someone already has a large library of media that exists on drives that they want to put into the cloud. Internet access speeds are getting faster, but not fast enough for users like Drew Geraci (known for his incredible time lapse photography and other work, including the opening to Netflix’s House of Cards), who told me he can create one terabyte of data in just five minutes when using nine 8K cameras simultaneously.
While we wait for everyone to get 10GB broadband transfer speeds, there are other options, such as Backblaze’s Fireball, which enables B2 Cloud Storage users to copy up to 40TB of data to a drive and send it directly to Backblaze.
There are technologies available that can accelerate internet TCP/IP speeds and enable faster data transfers to and from cloud storage such as Backblaze B2. We’ll be writing about these technologies in a future post.
A recent entry into the storage space is Morro Data and their CloudNAS solution. Files are stored in the cloud, cached locally on a CloudNAS device as needed, and synced globally among the other CloudNAS systems in a given organization. To the user, all of their files are listed in one catalog, but they could be stored locally or in the cloud. Another advantage is that uploads to the cloud are done behind the scenes as time and network permit. A file stays local until such time as it it safely stored in the B2 Cloud then it is removed from the CloudNAS device, depending on how often it is accessed. There are more details on the CloudNAS solution in our A New Twist on Data Backup: CloudNAS blog post. (See below for how to enter our Backblaze/Morro Data giveaway.)
Cataloging and Searching Your Media
A key component of any DAM system is the ability to find files when you need them. You’ll want the ability to catalog all of your digital media, assign keywords and metadata that make sense for the way you work, and have that catalog available and searchable even when the digital files themselves are located on various drives, in the cloud, or even disconnected from your current system.
Adobe’s Lightroom is a popular application for cataloging and managing image workflow. Lightroom can handle an enormous number of files, and has a flexible catalog that can be stored locally and used to search for files that have been archived to different storage devices. Users debate whether one master catalog or multiple catalogs are the best way to work in Lightroom. In any case, it’s critical that you back up your DAM catalogs as diligently as you back up your digital media.
The latest version of Lightroom, Lightroom CC (distinguished from Lightroom CC Classic), is coupled with Adobe’s Creative Cloud service. In addition to the subscription plan for Lightroom and other Adobe Suite applications, you’ll need to choose and pay a subscription fee for how much storage you wish to use in Adobe’s Creative Cloud. You don’t get a choice of other cloud vendors.
Another popular option for image editing is Phase One Capture One, and Phase One Media Pro SE for cataloging and management. Macphun’s Luminar is available for both Macintosh and Windows. Macphun has announced that will launch a digital asset manager component for Luminar in 2018 that will compete with Adobe’s offering for a complete digital image workflow.
Peter Krogh’s book, The DAM Book: Digital Asset Management for Photographers, and his other books on using Lightroom for DAM, outline an approach for creating a folder hierarchy, assigning keywords and metadata, and using collections to manage your photos. You can view a YouTube video on his recommendations at Get Your DAM Workflow Under Control with Peter Krogh.
Working with Your Media
Any media management system needs to include or work seamlessly with the editing and enhancement tools you use for photos or videos. We’re already talked about some cataloging solutions that include image editing, as well. Some of the mainstream photo apps, such as Google Photos and Apple Photos include rudimentary to mid-level editing tools. It’s up to the more capable applications to deliver the power needed for real photo or video editing, e.g. Adobe Photoshop, Adobe Lightroom, Macphun’s Luminar, and Phase One Capture One for photography, and Adobe Premiere, AppleFinal Cut Pro, or Avid Media Composer (among others) for video editing.
Ensuring Future Compatibility for Your Media
Images come out of your camera in a variety of formats. Camera makers have their proprietary raw file formats (CR2 from Canon, NEF from Nikon, for example), and Adobe has a proprietary, but open, standard for digital images called DNG (Digital Negative) that is used in Lightroom and products from other vendors, as well.
Whichever you choose, be aware that you are betting that whichever format you use will be supported years down the road when you go back to your files and want to open a file with whatever will be your future photo/video editing setup. So always think of the future and consider the solution that is most likely to still be supported in future applications.
There are myriad aspects to a digital asset management system, and as we said at the outset, many choices to make. We hope you’ll take us up on our request to tell us what you’re using to manage your photos and videos and what an ideal system for you would look like. We want to make Backblaze Backup and B2 Cloud Storage more useful to our customers, and your input will help us do that.
In the meantime, why not enter the Backblaze + Morro Data Promotion described below. You could win!
ENTER TO WIN A DREAM DIGITAL MEDIA COMBO
Morro Data and Backblaze Team Up to Deliver the Dream Digital Media Backup Solution
Visit Dream Photo Backup to learn about this combination of NAS, software, and the cloud that provides a complete solution for managing, archiving, and accessing your digital media files. You’ll have the opportunity to win Morro Data’s CacheDrive G40 (with 1TB of HDD cache), an annual subscription to CloudNAS Basic Global File Services, and $100 of Backblaze B2 Cloud Storage. The total value of this package is greater than $700. Enter at Dream Photo Backup.
The post An Introduction to Managing Digital Photos and Videos appeared first on Backblaze Blog | Cloud Storage & Cloud Backup.
Security updates have been issued by CentOS (apr and procmail), Debian (curl and xen), Fedora (cacti, git, jbig2dec, lucene4, mupdf, openssh, openssl, quagga, rpm, slurm, webkitgtk4, and xen), Oracle (apr and procmail), Red Hat (apr, java-1.7.1-ibm, java-1.8.0-ibm, procmail, samba4, and tcmu-runner), Scientific Linux (apr, procmail, and samba4), and Ubuntu (curl, openjdk-7, python2.7, and python3.4, python3.5).
Post Syndicated from Andy original https://torrentfreak.com/mashup-site-hit-with-domain-suspension-following-ifpi-copyright-complaint-171127/
There are hundreds of stunning examples online, many created in hobbyist circles, with dedicated communities sharing their often brilliant work.
However, the majority of mashups have something in common – they’re created without any permission from the copyright holders’ of the original tracks. As such they remain controversial, as mashup platform Sowndhaus has just discovered.
This Canada-based platform allows users to upload, share and network with other like-minded mashup enthusiasts. It has an inbuilt player, somewhat like Soundcloud, through which people can play a wide range of user-created mashups. However, sometime last Tuesday, Sowndhaus’ main domain, Sowndhaus.com, became unreachable.
The site’s operators say that they initially believed there was some kind of configuration issue. Later, however, they discovered that their domain had been “purposefully de-listed” from its DNS servers by its registrar.
“DomainBox had received a DMCA notification from the IFPI (International Federation of the Phonographic Industry) and immediately suspended our .com domain,” Sowndhaus’ operators report.
At this point it’s worth noting that while Sowndhaus is based and hosted in Canada, DomainBox is owned by UK-based Mesh Digital Limited, which is in turn owned by GoDaddy. IFPI, however, reportedly sent a US-focused DMCA notice to the registrar which noted that the music group had “a good faith belief” that activity on Sowndhaus “is not authorized by the copyright owner, its agent, or the law.”
While mashups have always proved controversial, Sowndhaus believe that they operate well within Canadian law.
“We have a good faith belief that the audio files allegedly ‘infringing copyright’ in the DMCA notification are clearly transformative works and meet all criteria for ‘Non-commercial User-generated Content’ under Section 29.21 of the Copyright Act (Canada), and as such are authorized by the law,” the site says.
“Our service, servers, and files are located in Canada which has a ‘Notice and Notice regime’ and where DMCA (a US law) has no jurisdiction. However, the jurisdiction for our .com domain is within the US/EU and thus subject to its laws.”
Despite a belief that the site operates lawfully, Sowndhaus took a decision to not only take down the files listed in IFPI’s complaint but also to ditch its .com domain completely. While this convinced DomainBox to give control of the domain back to the mashup platform, Sowndhaus has now moved to a completely new domain (sowndhaus.audio), to avoid further issues.
“We neither admit nor accept that any unlawful activity or copyright infringement with respect to the DMCA claim had taken place, or has ever been permitted on our servers, or that it was necessary to remove the files or service under Section 29.21 of the Copyright Act (Canada) with which we have always been, and continue to be, in full compliance,” the site notes.
“The use of copyright material as Non-commercial User-generated Content is authorized by law in Canada, where our service resides. We believe that the IFPI are well aware of this, are aware of the jurisdiction of our service, and therefore that their DMCA notification is a misrepresentation of copyright.”
Aside from what appears to have been a rapid suspension of Sowndhaus’ .com domain, the site says that it is being held to a higher standard of copyright protection that others operating under the DMCA.
Unlike YouTube, for example, Sowndhaus says it pro-actively removes files found to infringe copyright. It also bans users who use the site to commit piracy, as per its Terms of Service.
“This is a much stronger regime than would be required under the DMCA guidelines where users generally receive warnings and strikes before being banned, and where websites complying with the DMCA and seeking to avoid legal liability do not actively seek out cases of infringement, leading to some cases of genuine piracy remaining undetected on their services,” the site says.
However, the site remains defiant in respect of the content it hosts, noting that mashups are transformative works that use copyright content “in new and creative ways to form new works of art” and as such are legal for non-commercial purposes.
That hasn’t stopped it from being targeted by copyright holders in the past, however.
This year three music-based organizations (IFPI, RIAA, and France’s SCPP) have sent complaints to Google about the platform, targeting close to 200 URLs. However, at least for more recent complaints, Google hasn’t been removing the URLs from its indexes.
Noting that corporations are using their powers “to hinder, stifle, and silence protected new forms of artistic expression with no repercussions”, Sowndhaus says that it is still prepared to work with copyright holders but wishes they would “reconsider their current policies and accept non-commercial transformative works as legitimate art forms with legal protections and/or exemptions in all jurisdictions.”
While Sowndhaus is now operating from a new domain, the switch is not without its inconveniences. All URLs with links to files on sowndhaus.com are broken but can be fixed by changing the .com to .audio.
DomainBox did not respond to TorrentFreak’s request for comment.
This is a guest post by Rafi Ton, founder and CEO of NUVIAD. NUVIAD is, in their own words, “a mobile marketing platform providing professional marketers, agencies and local businesses state of the art tools to promote their products and services through hyper targeting, big data analytics and advanced machine learning tools.”
At NUVIAD, we’ve been using Amazon Redshift as our main data warehouse solution for more than 3 years.
We store massive amounts of ad transaction data that our users and partners analyze to determine ad campaign strategies. When running real-time bidding (RTB) campaigns in large scale, data freshness is critical so that our users can respond rapidly to changes in campaign performance. We chose Amazon Redshift because of its simplicity, scalability, performance, and ability to load new data in near real time.
Over the past three years, our customer base grew significantly and so did our data. We saw our Amazon Redshift cluster grow from three nodes to 65 nodes. To balance cost and analytics performance, we looked for a way to store large amounts of less-frequently analyzed data at a lower cost. Yet, we still wanted to have the data immediately available for user queries and to meet their expectations for fast performance. We turned to Amazon Redshift Spectrum.
In this post, I explain the reasons why we extended Amazon Redshift with Redshift Spectrum as our modern data warehouse. I cover how our data growth and the need to balance cost and performance led us to adopt Redshift Spectrum. I also share key performance metrics in our environment, and discuss the additional AWS services that provide a scalable and fast environment, with data available for immediate querying by our growing user base.
Amazon Redshift as our foundation
The ability to provide fresh, up-to-the-minute data to our customers and partners was always a main goal with our platform. We saw other solutions provide data that was a few hours old, but this was not good enough for us. We insisted on providing the freshest data possible. For us, that meant loading Amazon Redshift in frequent micro batches and allowing our customers to query Amazon Redshift directly to get results in near real time.
The benefits were immediately evident. Our customers could see how their campaigns performed faster than with other solutions, and react sooner to the ever-changing media supply pricing and availability. They were very happy.
However, this approach required Amazon Redshift to store a lot of data for long periods, and our data grew substantially. In our peak, we maintained a cluster running 65 DC1.large nodes. The impact on our Amazon Redshift cluster was evident, and we saw our CPU utilization grow to 90%.
Why we extended Amazon Redshift to Redshift Spectrum
Redshift Spectrum gives us the ability to run SQL queries using the powerful Amazon Redshift query engine against data stored in Amazon S3, without needing to load the data. With Redshift Spectrum, we store data where we want, at the cost that we want. We have the data available for analytics when our users need it with the performance they expect.
Seamless scalability, high performance, and unlimited concurrency
Scaling Redshift Spectrum is a simple process. First, it allows us to leverage Amazon S3 as the storage engine and get practically unlimited data capacity.
Second, if we need more compute power, we can leverage Redshift Spectrum’s distributed compute engine over thousands of nodes to provide superior performance – perfect for complex queries running against massive amounts of data.
Third, all Redshift Spectrum clusters access the same data catalog so that we don’t have to worry about data migration at all, making scaling effortless and seamless.
Lastly, since Redshift Spectrum distributes queries across potentially thousands of nodes, they are not affected by other queries, providing much more stable performance and unlimited concurrency.
Keeping it SQL
Redshift Spectrum uses the same query engine as Amazon Redshift. This means that we did not need to change our BI tools or query syntax, whether we used complex queries across a single table or joins across multiple tables.
An interesting capability introduced recently is the ability to create a view that spans both Amazon Redshift and Redshift Spectrum external tables. With this feature, you can query frequently accessed data in your Amazon Redshift cluster and less-frequently accessed data in Amazon S3, using a single view.
Leveraging Parquet for higher performance
Parquet is a columnar data format that provides superior performance and allows Redshift Spectrum (or Amazon Athena) to scan significantly less data. With less I/O, queries run faster and we pay less per query. You can read all about Parquet at https://parquet.apache.org/ or https://en.wikipedia.org/wiki/Apache_Parquet.
From a cost perspective, we pay standard rates for our data in Amazon S3, and only small amounts per query to analyze data with Redshift Spectrum. Using the Parquet format, we can significantly reduce the amount of data scanned. Our costs are now lower, and our users get fast results even for large complex queries.
What we learned about Amazon Redshift vs. Redshift Spectrum performance
When we first started looking at Redshift Spectrum, we wanted to put it to the test. We wanted to know how it would compare to Amazon Redshift, so we looked at two key questions:
- What is the performance difference between Amazon Redshift and Redshift Spectrum on simple and complex queries?
- Does the data format impact performance?
During the migration phase, we had our dataset stored in Amazon Redshift and S3 as CSV/GZIP and as Parquet file formats. We tested three configurations:
- Amazon Redshift cluster with 28 DC1.large nodes
- Redshift Spectrum using CSV/GZIP
- Redshift Spectrum using Parquet
We performed benchmarks for simple and complex queries on one month’s worth of data. We tested how much time it took to perform the query, and how consistent the results were when running the same query multiple times. The data we used for the tests was already partitioned by date and hour. Properly partitioning the data improves performance significantly and reduces query times.
First, we tested a simple query aggregating billing data across a month:
We ran the same query seven times and measured the response times (red marking the longest time and green the shortest time):
|Execution Time (seconds)|
|Amazon Redshift||Redshift Spectrum
|Redshift Spectrum Parquet|
For simple queries, Amazon Redshift performed better than Redshift Spectrum, as we thought, because the data is local to Amazon Redshift.
What was surprising was that using Parquet data format in Redshift Spectrum significantly beat ‘traditional’ Amazon Redshift performance. For our queries, using Parquet data format with Redshift Spectrum delivered an average 40% performance gain over traditional Amazon Redshift. Furthermore, Redshift Spectrum showed high consistency in execution time with a smaller difference between the slowest run and the fastest run.
Comparing the amount of data scanned when using CSV/GZIP and Parquet, the difference was also significant:
|Data Scanned (GB)|
Because we pay only for the data scanned by Redshift Spectrum, the cost saving of using Parquet is evident and substantial.
Next, we compared the same three configurations with a complex query.
|Execution Time (seconds)|
|Amazon Redshift||Redshift Spectrum CSV||Redshift Spectrum Parquet|
This time, Redshift Spectrum using Parquet cut the average query time by 80% compared to traditional Amazon Redshift!
Bottom line: For complex queries, Redshift Spectrum provided a 67% performance gain over Amazon Redshift. Using the Parquet data format, Redshift Spectrum delivered an 80% performance improvement over Amazon Redshift. For us, this was substantial.
Optimizing the data structure for different workloads
Because the cost of S3 is relatively inexpensive and we pay only for the data scanned by each query, we believe that it makes sense to keep our data in different formats for different workloads and different analytics engines. It is important to note that we can have any number of tables pointing to the same data on S3. It all depends on how we partition the data and update the table partitions.
For example, we have a process that runs every minute and generates statistics for the last minute of data collected. With Amazon Redshift, this would be done by running the query on the table with something as follows:
(Assuming ‘ts’ is your column storing the time stamp for each event.)
With Redshift Spectrum, we pay for the data scanned in each query. If the data is partitioned by the minute instead of the hour, a query looking at one minute would be 1/60th the cost. If we use a temporary table that points only to the data of the last minute, we save that unnecessary cost.
Creating Parquet data efficiently
On the average, we have 800 instances that process our traffic. Each instance sends events that are eventually loaded into Amazon Redshift. When we started three years ago, we would offload data from each server to S3 and then perform a periodic copy command from S3 to Amazon Redshift.
Recently, Amazon Kinesis Firehose added the capability to offload data directly to Amazon Redshift. While this is now a viable option, we kept the same collection process that worked flawlessly and efficiently for three years.
This changed, however, when we incorporated Redshift Spectrum. With Redshift Spectrum, we needed to find a way to:
- Collect the event data from the instances.
- Save the data in Parquet format.
- Partition the data effectively.
To accomplish this, we save the data as CSV and then transform it to Parquet. The most effective method to generate the Parquet files is to:
- Send the data in one-minute intervals from the instances to Kinesis Firehose with an S3 temporary bucket as the destination.
- Aggregate hourly data and convert it to Parquet using AWS Lambda and AWS Glue.
- Add the Parquet data to S3 by updating the table partitions.
With this new process, we had to give more attention to validating the data before we sent it to Kinesis Firehose, because a single corrupted record in a partition fails queries on that partition.
To store our click data in a table, we considered the following SQL create table command:
The above statement defines a new external table (all Redshift Spectrum tables are external tables) with a few attributes. We stored ‘ts’ as a Unix time stamp and not as Timestamp, and billing data is stored as float and not decimal (more on that later). We also said that the data is partitioned by date and hour, and then stored as Parquet on S3.
First, we need to get the table definitions. This can be achieved by running the following query:
This query lists all the columns in the table with their respective definitions:
Now we can use this data to create a validation schema for our data:
Next, we create a function that uses this schema to validate data:
Near real-time data loading with Kinesis Firehose
On Kinesis Firehose, we created a new delivery stream to handle the events as follows:
This delivery stream aggregates event data every minute, or up to 100 MB, and writes the data to an S3 bucket as a CSV/GZIP compressed file. Next, after we have the data validated, we can safely send it to our Kinesis Firehose API:
Now, we have a single CSV file representing one minute of event data stored in S3. The files are named automatically by Kinesis Firehose by adding a UTC time prefix in the format YYYY/MM/DD/HH before writing objects to S3. Because we use the date and hour as partitions, we need to change the file naming and location to fit our Redshift Spectrum schema.
Automating data distribution using AWS Lambda
We created a simple Lambda function triggered by an S3 put event that copies the file to a different location (or locations), while renaming it to fit our data structure and processing flow. As mentioned before, the files generated by Kinesis Firehose are structured in a pre-defined hierarchy, such as:
All we need to do is parse the object name and restructure it as we see fit. In our case, we did the following (the event is an object received in the Lambda function with all the data about the object written to S3):
Now, we can redistribute the file to the two destinations we need—one for the minute processing task and the other for hourly aggregation:
Kinesis Firehose stores the data in a temporary folder. We copy the object to another folder that holds the data for the last processed minute. This folder is connected to a small Redshift Spectrum table where the data is being processed without needing to scan a much larger dataset. We also copy the data to a folder that holds the data for the entire hour, to be later aggregated and converted to Parquet.
Because we partition the data by date and hour, we created a new partition on the Redshift Spectrum table if the processed minute is the first minute in the hour (that is, minute 0). We ran the following:
After the data is processed and added to the table, we delete the processed data from the temporary Kinesis Firehose storage and from the minute storage folder.
Migrating CSV to Parquet using AWS Glue and Amazon EMR
The simplest way we found to run an hourly job converting our CSV data to Parquet is using Lambda and AWS Glue (and thanks to the awesome AWS Big Data team for their help with this).
Creating AWS Glue jobs
What this simple AWS Glue script does:
- Gets parameters for the job, date, and hour to be processed
- Creates a Spark EMR context allowing us to run Spark code
- Reads CSV data into a DataFrame
- Writes the data as Parquet to the destination S3 bucket
- Adds or modifies the Redshift Spectrum / Amazon Athena table partition for the table
Note: Because Redshift Spectrum and Athena both use the AWS Glue Data Catalog, we could use the Athena client to add the partition to the table.
Here are a few words about float, decimal, and double. Using decimal proved to be more challenging than we expected, as it seems that Redshift Spectrum and Spark use them differently. Whenever we used decimal in Redshift Spectrum and in Spark, we kept getting errors, such as:
S3 Query Exception (Fetch). Task failed due to an internal error. File 'https://s3-external-1.amazonaws.com/nuviad-temp/events/2017-08-01/hour=2/part-00017-48ae5b6b-906e-4875-8cde-bc36c0c6d0ca.c000.snappy.parquet has an incompatible Parquet schema for column 's3://nuviad-events/events.lat'. Column type: DECIMAL(18, 8), Parquet schema:\noptional float lat [i:4 d:1 r:0]\n (https://s3-external-1.amazonaws.com/nuviad-temp/events/2017-08-01/hour=2/part-00017-48ae5b6b-906e-4875-8cde-bc36c0c6d0ca.c000.snappy.parq
We had to experiment with a few floating-point formats until we found that the only combination that worked was to define the column as double in the Spark code and float in Spectrum. This is the reason you see billing defined as float in Spectrum and double in the Spark code.
Creating a Lambda function to trigger conversion
Next, we created a simple Lambda function to trigger the AWS Glue script hourly using a simple Python code:
Using Amazon CloudWatch Events, we trigger this function hourly. This function triggers an AWS Glue job named ‘convertEventsParquetHourly’ and runs it for the previous hour, passing job names and values of the partitions to process to AWS Glue.
Redshift Spectrum and Node.js
Our development stack is based on Node.js, which is well-suited for high-speed, light servers that need to process a huge number of transactions. However, a few limitations of the Node.js environment required us to create workarounds and use other tools to complete the process.
Node.js and Parquet
The lack of Parquet modules for Node.js required us to implement an AWS Glue/Amazon EMR process to effectively migrate data from CSV to Parquet. We would rather save directly to Parquet, but we couldn’t find an effective way to do it.
One interesting project in the works is the development of a Parquet NPM by Marc Vertes called node-parquet (https://www.npmjs.com/package/node-parquet). It is not in a production state yet, but we think it would be well worth following the progress of this package.
Timestamp data type
The result is that you cannot store Timestamp correctly in Parquet using Node.js. The solution is to store Timestamp as string and cast the type to Timestamp in the query. Using this method, we did not witness any performance degradation whatsoever.
You can benefit from our trial-and-error experience.
Lesson #1: Data validation is critical
As mentioned earlier, a single corrupt entry in a partition can fail queries running against this partition, especially when using Parquet, which is harder to edit than a simple CSV file. Make sure that you validate your data before scanning it with Redshift Spectrum.
Lesson #2: Structure and partition data effectively
One of the biggest benefits of using Redshift Spectrum (or Athena for that matter) is that you don’t need to keep nodes up and running all the time. You pay only for the queries you perform and only for the data scanned per query.
Keeping different permutations of your data for different queries makes a lot of sense in this case. For example, you can partition your data by date and hour to run time-based queries, and also have another set partitioned by user_id and date to run user-based queries. This results in faster and more efficient performance of your data warehouse.
Storing data in the right format
Use Parquet whenever you can. The benefits of Parquet are substantial. Faster performance, less data to scan, and much more efficient columnar format. However, it is not supported out-of-the-box by Kinesis Firehose, so you need to implement your own ETL. AWS Glue is a great option.
Creating small tables for frequent tasks
When we started using Redshift Spectrum, we saw our Amazon Redshift costs jump by hundreds of dollars per day. Then we realized that we were unnecessarily scanning a full day’s worth of data every minute. Take advantage of the ability to define multiple tables on the same S3 bucket or folder, and create temporary and small tables for frequent queries.
Lesson #3: Combine Athena and Redshift Spectrum for optimal performance
Moving to Redshift Spectrum also allowed us to take advantage of Athena as both use the AWS Glue Data Catalog. Run fast and simple queries using Athena while taking advantage of the advanced Amazon Redshift query engine for complex queries using Redshift Spectrum.
Redshift Spectrum excels when running complex queries. It can push many compute-intensive tasks, such as predicate filtering and aggregation, down to the Redshift Spectrum layer, so that queries use much less of your cluster’s processing capacity.
Lesson #4: Sort your Parquet data within the partition
We achieved another performance improvement by sorting data within the partition using sortWithinPartitions(sort_field). For example:
We were extremely pleased with using Amazon Redshift as our core data warehouse for over three years. But as our client base and volume of data grew substantially, we extended Amazon Redshift to take advantage of scalability, performance, and cost with Redshift Spectrum.
Redshift Spectrum lets us scale to virtually unlimited storage, scale compute transparently, and deliver super-fast results for our users. With Redshift Spectrum, we store data where we want at the cost we want, and have the data available for analytics when our users need it with the performance they expect.
About the Author
With 7 years of experience in the AdTech industry and 15 years in leading technology companies, Rafi Ton is the founder and CEO of NUVIAD. He enjoys exploring new technologies and putting them to use in cutting edge products and services, in the real world generating real money. Being an experienced entrepreneur, Rafi believes in practical-programming and fast adaptation of new technologies to achieve a significant market advantage.
Post Syndicated from Roy Ben-Alta original https://aws.amazon.com/blogs/big-data/aws-big-data-analytics-sessions-at-reinvent-2017/
We can’t believe that there are just few days left before re:Invent 2017. If you are attending this year, you’ll want to check out our Big Data sessions! The Big Data and Machine Learning categories are bigger than ever. As in previous years, you can find these sessions in various tracks, including Analytics & Big Data, Deep Learning Summit, Artificial Intelligence & Machine Learning, Architecture, and Databases.
We have great sessions from organizations and companies like Vanguard, Cox Automotive, Pinterest, Netflix, FINRA, Amtrak, AmazonFresh, Sysco Foods, Twilio, American Heart Association, Expedia, Esri, Nextdoor, and many more. All sessions are recorded and made available on YouTube. In addition, all slide decks from the sessions will be available on SlideShare.net after the conference.
This post highlights the sessions that will be presented as part of the Analytics & Big Data track, as well as relevant sessions from other tracks like Architecture, Artificial Intelligence & Machine Learning, and IoT. If you’re interested in Machine Learning sessions, don’t forget to check out our Guide to Machine Learning at re:Invent 2017.
This year’s session catalog contains the following breakout sessions.
Raju Gulabani, VP, Database, Analytics and AI at AWS will discuss the evolution of database and analytics services in AWS, the new database and analytics services and features we launched this year, and our vision for continued innovation in this space. We are witnessing an unprecedented growth in the amount of data collected, in many different forms. Storage, management, and analysis of this data require database services that scale and perform in ways not possible before. AWS offers a collection of database and other data services—including Amazon Aurora, Amazon DynamoDB, Amazon RDS, Amazon Redshift, Amazon ElastiCache, Amazon Kinesis, and Amazon EMR—to process, store, manage, and analyze data. In this session, we provide an overview of AWS database and analytics services and discuss how customers are using these services today.
Deep dive customer use cases
ABD401 – How Netflix Monitors Applications in Near Real-Time with Amazon Kinesis
Thousands of services work in concert to deliver millions of hours of video streams to Netflix customers every day. These applications vary in size, function, and technology, but they all make use of the Netflix network to communicate. Understanding the interactions between these services is a daunting challenge both because of the sheer volume of traffic and the dynamic nature of deployments. In this session, we first discuss why Netflix chose Kinesis Streams to address these challenges at scale. We then dive deep into how Netflix uses Kinesis Streams to enrich network traffic logs and identify usage patterns in real time. Lastly, we cover how Netflix uses this system to build comprehensive dependency maps, increase network efficiency, and improve failure resiliency. From this session, you will learn how to build a real-time application monitoring system using network traffic logs and get real-time, actionable insights.
In this session, learn how Nextdoor replaced their home-grown data pipeline based on a topology of Flume nodes with a completely serverless architecture based on Kinesis and Lambda. By making these changes, they improved both the reliability of their data and the delivery times of billions of records of data to their Amazon S3–based data lake and Amazon Redshift cluster. Nextdoor is a private social networking service for neighborhoods.
ABD205 – Taking a Page Out of Ivy Tech’s Book: Using Data for Student Success
Data speaks. Discover how Ivy Tech, the nation’s largest singly accredited community college, uses AWS to gather, analyze, and take action on student behavioral data for the betterment of over 3,100 students. This session outlines the process from inception to implementation across the state of Indiana and highlights how Ivy Tech’s model can be applied to your own complex business problems.
ABD207 – Leveraging AWS to Fight Financial Crime and Protect National Security
Banks aren’t known to share data and collaborate with one another. But that is exactly what the Mid-Sized Bank Coalition of America (MBCA) is doing to fight digital financial crime—and protect national security. Using the AWS Cloud, the MBCA developed a shared data analytics utility that processes terabytes of non-competitive customer account, transaction, and government risk data. The intelligence produced from the data helps banks increase the efficiency of their operations, cut labor and operating costs, and reduce false positive volumes. The collective intelligence also allows greater enforcement of Anti-Money Laundering (AML) regulations by helping members detect internal risks—and identify the challenges to detecting these risks in the first place. This session demonstrates how the AWS Cloud supports the MBCA to deliver advanced data analytics, provide consistent operating models across financial institutions, reduce costs, and strengthen national security.
ABD208 – Cox Automotive Empowered to Scale with Splunk Cloud & AWS and Explores New Innovation with Amazon Kinesis Firehose
In this session, learn how Cox Automotive is using Splunk Cloud for real time visibility into its AWS and hybrid environments to achieve near instantaneous MTTI, reduce auction incidents by 90%, and proactively predict outages. We also introduce a highly anticipated capability that allows you to ingest, transform, and analyze data in real time using Splunk and Amazon Kinesis Firehose to gain valuable insights from your cloud resources. It’s now quicker and easier than ever to gain access to analytics-driven infrastructure monitoring using Splunk Enterprise & Splunk Cloud.
ABD209 – Accelerating the Speed of Innovation with a Data Sciences Data & Analytics Hub at Takeda
Historically, silos of data, analytics, and processes across functions, stages of development, and geography created a barrier to R&D efficiency. Gathering the right data necessary for decision-making was challenging due to issues of accessibility, trust, and timeliness. In this session, learn how Takeda is undergoing a transformation in R&D to increase the speed-to-market of high-impact therapies to improve patient lives. The Data and Analytics Hub was built, with Deloitte, to address these issues and support the efficient generation of data insights for functions such as clinical operations, clinical development, medical affairs, portfolio management, and R&D finance. In the AWS hosted data lake, this data is processed, integrated, and made available to business end users through data visualization interfaces, and to data scientists through direct connectivity. Learn how Takeda has achieved significant time reductions—from weeks to minutes—to gather and provision data that has the potential to reduce cycle times in drug development. The hub also enables more efficient operations and alignment to achieve product goals through cross functional team accountability and collaboration due to the ability to access the same cross domain data.
ABD210 – Modernizing Amtrak: Serverless Solution for Real-Time Data Capabilities
As the nation’s only high-speed intercity passenger rail provider, Amtrak needs to know critical information to run their business such as: Who’s onboard any train at any time? How are booking and revenue trending? Amtrak was faced with unpredictable and often slow response times from existing databases, ranging from seconds to hours; existing booking and revenue dashboards were spreadsheet-based and manual; multiple copies of data were stored in different repositories, lacking integration and consistency; and operations and maintenance (O&M) costs were relatively high. Join us as we demonstrate how Deloitte and Amtrak successfully went live with a cloud-native operational database and analytical datamart for near-real-time reporting in under six months. We highlight the specific challenges and the modernization of architecture on an AWS native Platform as a Service (PaaS) solution. The solution includes cloud-native components such as AWS Lambda for microservices, Amazon Kinesis and AWS Data Pipeline for moving data, Amazon S3 for storage, Amazon DynamoDB for a managed NoSQL database service, and Amazon Redshift for near-real time reports and dashboards. Deloitte’s solution enabled “at scale” processing of 1 million transactions/day and up to 2K transactions/minute. It provided flexibility and scalability, largely eliminate the need for system management, and dramatically reduce operating costs. Moreover, it laid the groundwork for decommissioning legacy systems, anticipated to save at least $1M over 3 years.
ABD211 – Sysco Foods: A Journey from Too Much Data to Curated Insights
In this session, we detail Sysco’s journey from a company focused on hindsight-based reporting to one focused on insights and foresight. For this shift, Sysco moved from multiple data warehouses to an AWS ecosystem, including Amazon Redshift, Amazon EMR, AWS Data Pipeline, and more. As the team at Sysco worked with Tableau, they gained agile insight across their business. Learn how Sysco decided to use AWS, how they scaled, and how they became more strategic with the AWS ecosystem and Tableau.
ABD217 – From Batch to Streaming: How Amazon Flex Uses Real-time Analytics to Deliver Packages on Time
Reducing the time to get actionable insights from data is important to all businesses, and customers who employ batch data analytics tools are exploring the benefits of streaming analytics. Learn best practices to extend your architecture from data warehouses and databases to real-time solutions. Learn how to use Amazon Kinesis to get real-time data insights and integrate them with Amazon Aurora, Amazon RDS, Amazon Redshift, and Amazon S3. The Amazon Flex team describes how they used streaming analytics in their Amazon Flex mobile app, used by Amazon delivery drivers to deliver millions of packages each month on time. They discuss the architecture that enabled the move from a batch processing system to a real-time system, overcoming the challenges of migrating existing batch data to streaming data, and how to benefit from real-time analytics.
ABD218 – How EuroLeague Basketball Uses IoT Analytics to Engage Fans
IoT and big data have made their way out of industrial applications, general automation, and consumer goods, and are now a valuable tool for improving consumer engagement across a number of industries, including media, entertainment, and sports. The low cost and ease of implementation of AWS analytics services and AWS IoT have allowed AGT, a leader in IoT, to develop their IoTA analytics platform. Using IoTA, AGT brought a tailored solution to EuroLeague Basketball for real-time content production and fan engagement during the 2017-18 season. In this session, we take a deep dive into how this solution is architected for secure, scalable, and highly performant data collection from athletes, coaches, and fans. We also talk about how the data is transformed into insights and integrated into a content generation pipeline. Lastly, we demonstrate how this solution can be easily adapted for other industries and applications.
ABD222 – How to Confidently Unleash Data to Meet the Needs of Your Entire Organization
Where are you on the spectrum of IT leaders? Are you confident that you’re providing the technology and solutions that consistently meet or exceed the needs of your internal customers? Do your peers at the executive table see you as an innovative technology leader? Innovative IT leaders understand the value of getting data and analytics directly into the hands of decision makers, and into their own. In this session, Daren Thayne, Domo’s Chief Technology Officer, shares how innovative IT leaders are helping drive a culture change at their organizations. See how transformative it can be to have real-time access to all of the data that’ is relevant to YOUR job (including a complete view of your entire AWS environment), as well as understand how it can help you lead the way in applying that same pattern throughout your entire company
ABD303 – Developing an Insights Platform – Sysco’s Journey from Disparate Systems to Data Lake and Beyond
Sysco has nearly 200 operating companies across its multiple lines of business throughout the United States, Canada, Central/South America, and Europe. As the global leader in food services, Sysco identified the need to streamline the collection, transformation, and presentation of data produced by the distributed units and systems, into a central data ecosystem. Sysco’s Business Intelligence and Analytics team addressed these requirements by creating a data lake with scalable analytics and query engines leveraging AWS services. In this session, Sysco will outline their journey from a hindsight reporting focused company to an insights driven organization. They will cover solution architecture, challenges, and lessons learned from deploying a self-service insights platform. They will also walk through the design patterns they used and how they designed the solution to provide predictive analytics using Amazon Redshift Spectrum, Amazon S3, Amazon EMR, AWS Glue, Amazon Elasticsearch Service and other AWS services.
ABD309 – How Twilio Scaled Its Data-Driven Culture
As a leading cloud communications platform, Twilio has always been strongly data-driven. But as headcount and data volumes grew—and grew quickly—they faced many new challenges. One-off, static reports work when you’re a small startup, but how do you support a growth stage company to a successful IPO and beyond? Today, Twilio’s data team relies on AWS and Looker to provide data access to 700 colleagues. Departments have the data they need to make decisions, and cloud-based scale means they get answers fast. Data delivers real-business value at Twilio, providing a 360-degree view of their customer, product, and business. In this session, you hear firsthand stories directly from the Twilio data team and learn real-world tips for fostering a truly data-driven culture at scale.
ABD310 – How FINRA Secures Its Big Data and Data Science Platform on AWS
FINRA uses big data and data science technologies to detect fraud, market manipulation, and insider trading across US capital markets. As a financial regulator, FINRA analyzes highly sensitive data, so information security is critical. Learn how FINRA secures its Amazon S3 Data Lake and its data science platform on Amazon EMR and Amazon Redshift, while empowering data scientists with tools they need to be effective. In addition, FINRA shares AWS security best practices, covering topics such as AMI updates, micro segmentation, encryption, key management, logging, identity and access management, and compliance.
ABD331 – Log Analytics at Expedia Using Amazon Elasticsearch Service
Expedia uses Amazon Elasticsearch Service (Amazon ES) for a variety of mission-critical use cases, ranging from log aggregation to application monitoring and pricing optimization. In this session, the Expedia team reviews how they use Amazon ES and Kibana to analyze and visualize Docker startup logs, AWS CloudTrail data, and application metrics. They share best practices for architecting a scalable, secure log analytics solution using Amazon ES, so you can add new data sources almost effortlessly and get insights quickly
ABD316 – American Heart Association: Finding Cures to Heart Disease Through the Power of Technology
Combining disparate datasets and making them accessible to data scientists and researchers is a prevalent challenge for many organizations, not just in healthcare research. American Heart Association (AHA) has built a data science platform using Amazon EMR, Amazon Elasticsearch Service, and other AWS services, that corrals multiple datasets and enables advanced research on phenotype and genotype datasets, aimed at curing heart diseases. In this session, we present how AHA built this platform and the key challenges they addressed with the solution. We also provide a demo of the platform, and leave you with suggestions and next steps so you can build similar solutions for your use cases
ABD319 – Tooling Up for Efficiency: DIY Solutions @ Netflix
At Netflix, we have traditionally approached cloud efficiency from a human standpoint, whether it be in-person meetings with the largest service teams or manually flipping reservations. Over time, we realized that these manual processes are not scalable as the business continues to grow. Therefore, in the past year, we have focused on building out tools that allow us to make more insightful, data-driven decisions around capacity and efficiency. In this session, we discuss the DIY applications, dashboards, and processes we built to help with capacity and efficiency. We start at the ten thousand foot view to understand the unique business and cloud problems that drove us to create these products, and discuss implementation details, including the challenges encountered along the way. Tools discussed include Picsou, the successor to our AWS billing file cost analyzer; Libra, an easy-to-use reservation conversion application; and cost and efficiency dashboards that relay useful financial context to 50+ engineering teams and managers.
ABD312 – Deep Dive: Migrating Big Data Workloads to AWS
Customers are migrating their analytics, data processing (ETL), and data science workloads running on Apache Hadoop, Spark, and data warehouse appliances from on-premise deployments to AWS in order to save costs, increase availability, and improve performance. AWS offers a broad set of analytics services, including solutions for batch processing, stream processing, machine learning, data workflow orchestration, and data warehousing. This session will focus on identifying the components and workflows in your current environment; and providing the best practices to migrate these workloads to the right AWS data analytics product. We will cover services such as Amazon EMR, Amazon Athena, Amazon Redshift, Amazon Kinesis, and more. We will also feature Vanguard, an American investment management company based in Malvern, Pennsylvania with over $4.4 trillion in assets under management. Ritesh Shah, Sr. Program Manager for Cloud Analytics Program at Vanguard, will describe how they orchestrated their migration to AWS analytics services, including Hadoop and Spark workloads to Amazon EMR. Ritesh will highlight the technical challenges they faced and overcame along the way, as well as share common recommendations and tuning tips to accelerate the time to production.
ABD402 – How Esri Optimizes Massive Image Archives for Analytics in the Cloud
Petabyte scale archives of satellites, planes, and drones imagery continue to grow exponentially. They mostly exist as semi-structured data, but they are only valuable when accessed and processed by a wide range of products for both visualization and analysis. This session provides an overview of how ArcGIS indexes and structures data so that any part of it can be quickly accessed, processed, and analyzed by reading only the minimum amount of data needed for the task. In this session, we share best practices for structuring and compressing massive datasets in Amazon S3, so it can be analyzed efficiently. We also review a number of different image formats, including GeoTIFF (used for the Public Datasets on AWS program, Landsat on AWS), cloud optimized GeoTIFF, MRF, and CRF as well as different compression approaches to show the effect on processing performance. Finally, we provide examples of how this technology has been used to help image processing and analysis for the response to Hurricane Harvey.
ABD329 – A Look Under the Hood – How Amazon.com Uses AWS Services for Analytics at Massive Scale
Amazon’s consumer business continues to grow, and so does the volume of data and the number and complexity of the analytics done in support of the business. In this session, we talk about how Amazon.com uses AWS technologies to build a scalable environment for data and analytics. We look at how Amazon is evolving the world of data warehousing with a combination of a data lake and parallel, scalable compute engines such as Amazon EMR and Amazon Redshift.
ABD327 – Migrating Your Traditional Data Warehouse to a Modern Data Lake
In this session, we discuss the latest features of Amazon Redshift and Redshift Spectrum, and take a deep dive into its architecture and inner workings. We share many of the recent availability, performance, and management enhancements and how they improve your end user experience. You also hear from 21st Century Fox, who presents a case study of their fast migration from an on-premises data warehouse to Amazon Redshift. Learn how they are expanding their data warehouse to a data lake that encompasses multiple data sources and data formats. This architecture helps them tie together siloed business units and get actionable 360-degree insights across their consumer base.
MCL202 – Ally Bank & Cognizant: Transforming Customer Experience Using Amazon Alexa
Given the increasing popularity of natural language interfaces such as Voice as User technology or conversational artificial intelligence (AI), Ally® Bank was looking to interact with customers by enabling direct transactions through conversation or voice. They also needed to develop a capability that allows third parties to connect to the bank securely for information sharing and exchange, using oAuth, an authentication protocol seen as the future of secure banking technology. Cognizant’s Architecture team partnered with Ally Bank’s Enterprise Architecture group and identified the right product for oAuth integration with Amazon Alexa and third-party technologies. In this session, we discuss how building products with conversational AI helps Ally Bank offer an innovative customer experience; increase retention through improved data-driven personalization; increase the efficiency and convenience of customer service; and gain deep insights into customer needs through data analysis and predictive analytics to offer new products and services.
MCL317 – Orchestrating Machine Learning Training for Netflix Recommendations
At Netflix, we use machine learning (ML) algorithms extensively to recommend relevant titles to our 100+ million members based on their tastes. Everything on the member home page is an evidence-driven, A/B-tested experience that we roll out backed by ML models. These models are trained using Meson, our workflow orchestration system. Meson distinguishes itself from other workflow engines by handling more sophisticated execution graphs, such as loops and parameterized fan-outs. Meson can schedule Spark jobs, Docker containers, bash scripts, gists of Scala code, and more. Meson also provides a rich visual interface for monitoring active workflows and inspecting execution logs. It has a powerful Scala DSL for authoring workflows as well as the REST API. In this session, we focus on how Meson trains recommendation ML models in production, and how we have re-architected it to scale up for a growing need of broad ETL applications within Netflix. As a driver for this change, we have had to evolve the persistence layer for Meson. We talk about how we migrated from Cassandra to Amazon RDS backed by Amazon Aurora
MCL350 – Humans vs. the Machines: How Pinterest Uses Amazon Mechanical Turk’s Worker Community to Improve Machine Learning
Ever since the term “crowdsourcing” was coined in 2006, it’s been a buzzword for technology companies and social institutions. In the technology sector, crowdsourcing is instrumental for verifying machine learning algorithms, which, in turn, improves the user’s experience. In this session, we explore how Pinterest adapted to an increased reliability on human evaluation to improve their product, with a focus on how they’ve integrated with Mechanical Turk’s platform. This presentation is aimed at engineers, analysts, program managers, and product managers who are interested in how companies rely on Mechanical Turk’s human evaluation platform to better understand content and improve machine learning algorithms. The discussion focuses on the analysis and product decisions related to building a high quality crowdsourcing system that takes advantage of Mechanical Turk’s powerful worker community.
Service sessions: architecture and best practices
Featuring Amazon EMR, Amazon Athena, Amazon Kinesis, AWS Glue, Amazon QuickSight, Amazon Redshift, Amazon Elasticsearch Service, and Amazon DynamoDB.
ABD201 – Big Data Architectural Patterns and Best Practices on AWS
In this session, we simplify big data processing as a data bus comprising various stages: collect, store, process, analyze, and visualize. Next, we discuss how to choose the right technology in each stage based on criteria such as data structure, query latency, cost, request rate, item size, data volume, durability, and so on. Finally, we provide reference architectures, design patterns, and best practices for assembling these technologies to solve your big data problems at the right cost
ABD202 – Best Practices for Building Serverless Big Data Applications
Serverless technologies let you build and scale applications and services rapidly without the need to provision or manage servers. In this session, we show you how to incorporate serverless concepts into your big data architectures. We explore the concepts behind and benefits of serverless architectures for big data, looking at design patterns to ingest, store, process, and visualize your data. Along the way, we explain when and how you can use serverless technologies to streamline data processing, minimize infrastructure management, and improve agility and robustness and share a reference architecture using a combination of cloud and open source technologies to solve your big data problems. Topics include: use cases and best practices for serverless big data applications; leveraging AWS technologies such as Amazon DynamoDB, Amazon S3, Amazon Kinesis, AWS Lambda, Amazon Athena, and Amazon EMR; and serverless ETL, event processing, ad hoc analysis, and real-time analytics.
ABD206 – Building Visualizations and Dashboards with Amazon QuickSight
Just as a picture is worth a thousand words, a visual is worth a thousand data points. A key aspect of our ability to gain insights from our data is to look for patterns, and these patterns are often not evident when we simply look at data in tables. The right visualization will help you gain a deeper understanding in a much quicker timeframe. In this session, we will show you how to quickly and easily visualize your data using Amazon QuickSight. We will show you how you can connect to data sources, generate custom metrics and calculations, create comprehensive business dashboards with various chart types, and setup filters and drill downs to slice and dice the data.
ABD203 – Real-Time Streaming Applications on AWS: Use Cases and Patterns
To win in the marketplace and provide differentiated customer experiences, businesses need to be able to use live data in real time to facilitate fast decision making. In this session, you learn common streaming data processing use cases and architectures. First, we give an overview of streaming data and AWS streaming data capabilities. Next, we look at a few customer examples and their real-time streaming applications. Finally, we walk through common architectures and design patterns of top streaming data use cases.
ABD213 – How to Build a Data Lake with AWS Glue Data Catalog
As data volumes grow and customers store more data on AWS, they often have valuable data that is not easily discoverable and available for analytics. The AWS Glue Data Catalog provides a central view of your data lake, making data readily available for analytics. We introduce key features of the AWS Glue Data Catalog and its use cases. Learn how crawlers can automatically discover your data, extract relevant metadata, and add it as table definitions to the AWS Glue Data Catalog. We will also explore the integration between AWS Glue Data Catalog and Amazon Athena, Amazon EMR, and Amazon Redshift Spectrum.
ABD214 – Real-time User Insights for Mobile and Web Applications with Amazon Pinpoint
With customers demanding relevant and real-time experiences across a range of devices, digital businesses are looking to gather user data at scale, understand this data, and respond to customer needs instantly. This requires tools that can record large volumes of user data in a structured fashion, and then instantly make this data available to generate insights. In this session, we demonstrate how you can use Amazon Pinpoint to capture user data in a structured yet flexible manner. Further, we demonstrate how this data can be set up for instant consumption using services like Amazon Kinesis Firehose and Amazon Redshift. We walk through example data based on real world scenarios, to illustrate how Amazon Pinpoint lets you easily organize millions of events, record them in real-time, and store them for further analysis.
ABD223 – IT Innovators: New Technology for Leveraging Data to Enable Agility, Innovation, and Business Optimization
Companies of all sizes are looking for technology to efficiently leverage data and their existing IT investments to stay competitive and understand where to find new growth. Regardless of where companies are in their data-driven journey, they face greater demands for information by customers, prospects, partners, vendors and employees. All stakeholders inside and outside the organization want information on-demand or in “real time”, available anywhere on any device. They want to use it to optimize business outcomes without having to rely on complex software tools or human gatekeepers to relevant information. Learn how IT innovators at companies such as MasterCard, Jefferson Health, and TELUS are using Domo’s Business Cloud to help their organizations more effectively leverage data at scale.
ABD301 – Analyzing Streaming Data in Real Time with Amazon Kinesis
Amazon Kinesis makes it easy to collect, process, and analyze real-time, streaming data so you can get timely insights and react quickly to new information. In this session, we present an end-to-end streaming data solution using Kinesis Streams for data ingestion, Kinesis Analytics for real-time processing, and Kinesis Firehose for persistence. We review in detail how to write SQL queries using streaming data and discuss best practices to optimize and monitor your Kinesis Analytics applications. Lastly, we discuss how to estimate the cost of the entire system
ABD302 – Real-Time Data Exploration and Analytics with Amazon Elasticsearch Service and Kibana
In this session, we use Apache web logs as example and show you how to build an end-to-end analytics solution. First, we cover how to configure an Amazon ES cluster and ingest data using Amazon Kinesis Firehose. We look at best practices for choosing instance types, storage options, shard counts, and index rotations based on the throughput of incoming data. Then we demonstrate how to set up a Kibana dashboard and build custom dashboard widgets. Finally, we review approaches for generating custom, ad-hoc reports.
ABD304 – Best Practices for Data Warehousing with Amazon Redshift & Redshift Spectrum
Most companies are over-run with data, yet they lack critical insights to make timely and accurate business decisions. They are missing the opportunity to combine large amounts of new, unstructured big data that resides outside their data warehouse with trusted, structured data inside their data warehouse. In this session, we take an in-depth look at how modern data warehousing blends and analyzes all your data, inside and outside your data warehouse without moving the data, to give you deeper insights to run your business. We will cover best practices on how to design optimal schemas, load data efficiently, and optimize your queries to deliver high throughput and performance.
ABD305 – Design Patterns and Best Practices for Data Analytics with Amazon EMR
Amazon EMR is one of the largest Hadoop operators in the world, enabling customers to run ETL, machine learning, real-time processing, data science, and low-latency SQL at petabyte scale. In this session, we introduce you to Amazon EMR design patterns such as using Amazon S3 instead of HDFS, taking advantage of both long and short-lived clusters, and other Amazon EMR architectural best practices. We talk about lowering cost with Auto Scaling and Spot Instances, and security best practices for encryption and fine-grained access control. Finally, we dive into some of our recent launches to keep you current on our latest features.
ABD307 – Deep Analytics for Global AWS Marketing Organization
To meet the needs of the global marketing organization, the AWS marketing analytics team built a scalable platform that allows the data science team to deliver custom econometric and machine learning models for end user self-service. To meet data security standards, we use end-to-end data encryption and different AWS services such as Amazon Redshift, Amazon RDS, Amazon S3, Amazon EMR with Apache Spark and Auto Scaling. In this session, you see real examples of how we have scaled and automated critical analysis, such as calculating the impact of marketing programs like re:Invent and prioritizing leads for our sales teams.
ABD311 – Deploying Business Analytics at Enterprise Scale with Amazon QuickSight
One of the biggest tradeoffs customers usually make when deploying BI solutions at scale is agility versus governance. Large-scale BI implementations with the right governance structure can take months to design and deploy. In this session, learn how you can avoid making this tradeoff using Amazon QuickSight. Learn how to easily deploy Amazon QuickSight to thousands of users using Active Directory and Federated SSO, while securely accessing your data sources in Amazon VPCs or on-premises. We also cover how to control access to your datasets, implement row-level security, create scheduled email reports, and audit access to your data.
ABD315 – Building Serverless ETL Pipelines with AWS Glue
Organizations need to gain insight and knowledge from a growing number of Internet of Things (IoT), APIs, clickstreams, unstructured and log data sources. However, organizations are also often limited by legacy data warehouses and ETL processes that were designed for transactional data. In this session, we introduce key ETL features of AWS Glue, cover common use cases ranging from scheduled nightly data warehouse loads to near real-time, event-driven ETL flows for your data lake. We discuss how to build scalable, efficient, and serverless ETL pipelines using AWS Glue. Additionally, Merck will share how they built an end-to-end ETL pipeline for their application release management system, and launched it in production in less than a week using AWS Glue.
ABD318 – Architecting a data lake with Amazon S3, Amazon Kinesis, and Amazon Athena
Learn how to architect a data lake where different teams within your organization can publish and consume data in a self-service manner. As organizations aim to become more data-driven, data engineering teams have to build architectures that can cater to the needs of diverse users – from developers, to business analysts, to data scientists. Each of these user groups employs different tools, have different data needs and access data in different ways. In this talk, we will dive deep into assembling a data lake using Amazon S3, Amazon Kinesis, Amazon Athena, Amazon EMR, and AWS Glue. The session will feature Mohit Rao, Architect and Integration lead at Atlassian, the maker of products such as JIRA, Confluence, and Stride. First, we will look at a couple of common architectures for building a data lake. Then we will show how Atlassian built a self-service data lake, where any team within the company can publish a dataset to be consumed by a broad set of users.
Companies have valuable data that they may not be analyzing due to the complexity, scalability, and performance issues of loading the data into their data warehouse. However, with the right tools, you can extend your analytics to query data in your data lake—with no loading required. Amazon Redshift Spectrum extends the analytic power of Amazon Redshift beyond data stored in your data warehouse to run SQL queries directly against vast amounts of unstructured data in your Amazon S3 data lake. This gives you the freedom to store your data where you want, in the format you want, and have it available for analytics when you need it. Join a discussion with AWS solution architects to ask question.
ABD330 – Combining Batch and Stream Processing to Get the Best of Both Worlds
Today, many architects and developers are looking to build solutions that integrate batch and real-time data processing, and deliver the best of both approaches. Lambda architecture (not to be confused with the AWS Lambda service) is a design pattern that leverages both batch and real-time processing within a single solution to meet the latency, accuracy, and throughput requirements of big data use cases. Come join us for a discussion on how to implement Lambda architecture (batch, speed, and serving layers) and best practices for data processing, loading, and performance tuning
ABD335 – Real-Time Anomaly Detection Using Amazon Kinesis
Amazon Kinesis Analytics offers a built-in machine learning algorithm that you can use to easily detect anomalies in your VPC network traffic and improve security monitoring. Join us for an interactive discussion on how to stream your VPC flow Logs to Amazon Kinesis Streams and identify anomalies using Kinesis Analytics.
ABD339 – Deep Dive and Best Practices for Amazon Athena
Amazon Athena is an interactive query service that enables you to process data directly from Amazon S3 without the need for infrastructure. Since its launch at re:invent 2016, several organizations have adopted Athena as the central tool to process all their data. In this talk, we dive deep into the most common use cases, including working with other AWS services. We review the best practices for creating tables and partitions and performance optimizations. We also dive into how Athena handles security, authorization, and authentication. Lastly, we hear from a customer who has reduced costs and improved time to market by deploying Athena across their organization.
We look forward to meeting you at re:Invent 2017!
About the Author
Roy Ben-Alta is a solution architect and principal business development manager at Amazon Web Services in New York. He focuses on Data Analytics and ML Technologies, working with AWS customers to build innovative data-driven products.
Post Syndicated from Andy original https://torrentfreak.com/google-apple-order-telegram-to-nuke-channel-over-taylor-swift-piracy-171123/
Financed by Russian Facebook (vKontakte) founder Pavel Durov, Telegram is a multi-platform messaging system that has grown from 100,000 daily users in 2013 to an impressive 100 million users in February 2016.
“Telegram is a messaging app with a focus on speed and security, it’s super-fast, simple and free. You can use Telegram on all your devices at the same time — your messages sync seamlessly across any number of your phones, tablets or computers,” the company’s marketing reads.
One of the attractive things about Telegram is that it allows users to communicate with each other using end-to-end encryption. In some cases, these systems are used for content piracy, of music and other smaller files in particular. This is compounded by the presence of user-programmed bots, which are able to search the web for illegal content and present it in a Telegram channel to which other users can subscribe.
While much of this sharing files under the radar when conducted privately, it periodically attracts attention from copyright holders when it takes place in public channels. That appears to have happened recently when popular channel “Any Suitable Pop” was completely disabled by Telegram, an apparent first following a copyright complaint.
According to channel creator Anton Vagin, the action by Telegram was probably due to the unauthorized recent sharing of the Taylor Swift album ‘Reputation’. However, it was the route of complaint that proves of most interest.
Rather than receiving a takedown notice directly from Big Machine Records, the label behind Swift’s releases, Telegram was forced into action after receiving threats from Apple and Google, the companies that distribute the Telegram app for iOS and Android respectively.
According to a message Vagin received from Telegram support, Apple and Google had received complaints about Swift’s album from Universal Music, the distributor of Big Machine Records. The suggestion was that if Telegram didn’t delete the infringing channel, distribution of the Telegram app via iTunes and Google Play would be at risk. Vagin received no warning notices from any of the companies involved.
According to Russian news outlet VC.ru, which first reported the news, the channel was blocked in Telegram’s desktop applications, as well as in versions for Android, macOS and iOS. However, the channel still existed on the web and via Windows phone applications but all messages within had been deleted.
The fact that Google played a major role in the disappearing of the channel was subsequently confirmed by Telegram founder Pavel Durov, who commented that it was Google who “ultimately demanded the blocking of this channel.”
That Telegram finally caved into the demands of Google and/or Apple doesn’t really come as a surprise. In Telegram’s frequently asked questions section, the company specifically mentions the need to comply with copyright takedown demands in order to maintain distribution via the companies’ app marketplaces.
“Our mission is to provide a secure means of communication that works everywhere on the planet. To do this in the places where it is most needed (and to continue distributing Telegram through the App Store and Google Play), we have to process legitimate requests to take down illegal public content (sticker sets, bots, and channels) within the app,” the company notes.
Putting pressure on Telegram via Google and Apple over piracy isn’t a new development. In the past, representatives of the music industry threatened to complain to the companies over a channel operated by torrent site RuTracker, which was set up to share magnet links.
Post Syndicated from Todd Cignetti original https://aws.amazon.com/blogs/security/easier-certificate-validation-using-dns-with-aws-certificate-manager/
Secure Sockets Layer/Transport Layer Security (SSL/TLS) certificates are used to secure network communications and establish the identity of websites over the internet. Before issuing a certificate for your website, Amazon must validate that you control the domain name for your site. You can now use AWS Certificate Manager (ACM) Domain Name System (DNS) validation to establish that you control a domain name when requesting SSL/TLS certificates with ACM. Previously ACM supported only email validation, which required the domain owner to receive an email for each certificate request and validate the information in the request before approving it.
With DNS validation, you write a CNAME record to your DNS configuration to establish control of your domain name. After you have configured the CNAME record, ACM can automatically renew DNS-validated certificates before they expire, as long as the DNS record has not changed. To make it even easier to validate your domain, ACM can update your DNS configuration for you if you manage your DNS records with Amazon Route 53. In this blog post, I demonstrate how to request a certificate for a website by using DNS validation. To perform the equivalent steps using the AWS CLI or AWS APIs and SDKs, see AWS Certificate Manager in the AWS CLI Reference and the ACM API Reference.
Requesting an SSL/TLS certificate by using DNS validation
In this section, I walk you through the four steps required to obtain an SSL/TLS certificate through ACM to identify your site over the internet. SSL/TLS provides encryption for sensitive data in transit and authentication by using certificates to establish the identity of your site and secure connections between browsers and applications and your site. DNS validation and SSL/TLS certificates provisioned through ACM are free.
Step 1: Request a certificate
If you previously managed certificates in ACM, you will instead see a table with your certificates and a button to request a new certificate. Choose Request a certificate to request a new certificate.
Type the name of your domain in the Domain name box and choose Next. In this example, I type
www.example.com. You must use a domain name that you control. Requesting certificates for domains that you don’t control violates the AWS Service Terms.
Step 2: Select a validation method
With DNS validation, you write a CNAME record to your DNS configuration to establish control of your domain name. Choose DNS validation, and then choose Review.
Step 3: Review your request
Review your request and choose Confirm and request to request the certificate.
Step 4: Submit your request
After a brief delay while ACM populates your domain validation information, choose the down arrow (highlighted in the following screenshot) to display all the validation information for your domain.
ACM displays the CNAME record you must add to your DNS configuration to validate that you control the domain name in your certificate request. If you use a DNS provider other than Route 53 or if you use a different AWS account to manage DNS records in Route 53, copy the DNS CNAME information from the validation information, or export it to a file (choose Export DNS configuration to a file) and write it to your DNS configuration. For information about how to add or modify DNS records, check with your DNS provider. For more information about using DNS with Route 53 DNS, see the Route 53 documentation.
If you manage DNS records for your domain with Route 53 in the same AWS account, choose Create record in Route 53 to have ACM update your DNS configuration for you.
After updating your DNS configuration, choose Continue to return to the ACM table view.
ACM then displays a table that includes all your certificates. The certificate you requested is displayed so that you can see the status of your request. After you write the DNS record or have ACM write the record for you, it typically takes DNS 30 minutes to propagate the record, and it might take several hours for Amazon to validate it and issue the certificate. During this time, ACM shows the Validation status as Pending validation. After ACM validates the domain name, ACM updates the Validation status to Success. After the certificate is issued, the certificate status is updated to Issued. If ACM cannot validate your DNS record and issue the certificate after 72 hours, the request times out, and ACM displays a Timed out validation status. To recover, you must make a new request. Refer to the Troubleshooting Section of the ACM User Guide for instructions about troubleshooting validation or issuance failures.
You now have an ACM certificate that you can use to secure your application or website. For information about how to deploy certificates with other AWS services, see the documentation for Amazon CloudFront, Amazon API Gateway, Application Load Balancers, and Classic Load Balancers. Note that your certificate must be in the US East (N. Virginia) Region to use the certificate with CloudFront.
ACM automatically renews certificates that are deployed and in use with other AWS services as long as the CNAME record remains in your DNS configuration. To learn more about ACM DNS validation, see the ACM FAQs and the ACM documentation.
Brendan Gregg introduces a
set of BPF-based tracing tools on opensource.com.
“Traditional analysis of filesystem performance focuses on block I/O
statistics—what you commonly see printed by the iostat(1) tool and plotted
by many performance-monitoring GUIs. Those statistics show how the disks
are performing, but not really the filesystem. Often you care more about
the filesystem’s performance than the disks, since it’s the filesystem that
applications make requests to and wait for. And the performance of
filesystems can be quite different from that of disks! Filesystems may
serve reads entirely from memory cache and also populate that cache via a
read-ahead algorithm and for write-back caching. xfsslower shows filesystem
performance—what the applications directly experience.”
The AWS Knowledge Center helps answer the questions most frequently asked by AWS Support customers. The following 10 Knowledge Center security articles and videos have been the most viewed this month. It’s likely you’ve wondered about a few of these topics yourself, so here’s a chance to learn the answers!
- How do I create an AWS Identity and Access Management (IAM) policy to restrict access for an IAM user, group, or role to a particular Amazon Virtual Private Cloud (VPC)?
Learn how to apply a custom IAM policy to restrict IAM user, group, or role permissions for creating and managing Amazon EC2 instances in a specified VPC.
- How do I use an MFA token to authenticate access to my AWS resources through the AWS CLI?
One IAM best practice is to protect your account and its resources by using a multi-factor authentication (MFA) device. If you plan use the AWS Command Line Interface (CLI) while using an MFA device, you must create a temporary session token.
- Can I restrict an IAM user’s EC2 access to specific resources?
This article demonstrates how to link multiple AWS accounts through AWS Organizations and isolate IAM user groups in their own accounts.
- I didn’t receive a validation email for the SSL certificate I requested through AWS Certificate Manager (ACM)—where is it?
Can’t find your ACM validation emails? Be sure to check the email address to which you requested that ACM send validation emails.
- How do I create an IAM policy that has a source IP restriction but still allows users to switch roles in the AWS Management Console?
Learn how to write an IAM policy that not only includes a source IP restriction but also lets your users switch roles in the console.
- How do I allow users from another account to access resources in my account through IAM?
If you have the 12-digit account number and permissions to create and edit IAM roles and users for both accounts, you can permit specific IAM users to access resources in your account.
- What are the differences between a service control policy (SCP) and an IAM policy?
Learn how to distinguish an SCP from an IAM policy.
- How do I share my customer master keys (CMKs) across multiple AWS accounts?
To grant another account access to your CMKs, create an IAM policy on the secondary account that grants access to use your CMKs.
- How do I set up AWS Trusted Advisor notifications?
Learn how to receive free weekly email notifications from Trusted Advisor.
- How do I use AWS Key Management Service (AWS KMS) encryption context to protect the integrity of encrypted data?
Encryption context name-value pairs used with AWS KMS encryption and decryption operations provide a method for checking ciphertext authenticity. Learn how to use encryption context to help protect your encrypted data.
The AWS Security Blog will publish an updated version of this list regularly going forward. You also can subscribe to the AWS Knowledge Center Videos playlist on YouTube.
Security updates have been issued by Arch Linux (roundcubemail), Debian (optipng, samba, and vlc), Fedora (compat-openssl10, fedpkg, git, jbig2dec, ldns, memcached, openssl, perl-Net-Ping-External, python-copr, python-XStatic-jquery-ui, rpkg, thunderbird, and xen), SUSE (tomcat), and Ubuntu (db, db4.8, db5.3, linux, linux-raspi2, linux-aws, linux-azure, linux-gcp, and samba).
RDPY is an RDP Security Tool in Twisted Python with RDP Man in the Middle proxy support which can record sessions and Honeypot functionality.
RDPY is a pure Python implementation of the Microsoft RDP (Remote Desktop Protocol) protocol (client and server side). RDPY is built over the event driven network engine Twisted. RDPY support standard RDP security layer, RDP over SSL and NLA authentication (through ntlmv2 authentication protocol).
RDPY RDP Security Tool Features
RDPY provides the following RDP and VNC binaries:
- RDP Man In The Middle proxy which record session
- RDP Honeypot
- RDP Screenshoter
- RDP Client
- VNC Client
- VNC Screenshoter
- RSS Player
RDPY is fully implemented in python, except the bitmap decompression algorithm which is implemented in C for performance purposes.
Data security is paramount in many industries. Organizations that shift their IT infrastructure to the cloud must ensure that their data is protected and that the attack surface is minimized. This post focuses on a method of securely loading a subset of data from one Amazon Redshift cluster to another Amazon Redshift cluster that is located in a different AWS account. You can accomplish this by dynamically controlling the security group ingress rules that are attached to the clusters.
Disclaimer: This solution is just one component of creating a secure infrastructure for loading and migrating data. It should not be thought of as an all-encompassing security solution. This post doesn’t touch on other important security topics, such as connecting to Amazon Redshift using Secure Sockets Layer (SSL) and encrypting data at rest.
The case for creating a segregated data loading account
From a security perspective, it is easier to restrict access to sensitive infrastructure if the respective stages (dev, QA, staging, and prod) are each located in their own isolated AWS account. Another common method for isolating resources is to set up separate virtual private clouds (VPCs) for each stage, all within a single AWS account. Because many services live outside the VPC (for example, Amazon S3, Amazon DynamoDB, and Amazon Kinesis), it requires careful thought to isolate the resources that should be associated with dev, QA, staging, and prod.
The segregated account model setup does create more overhead. But it gives administrators more control without them having to create tags and use cumbersome naming conventions to define a logical stage. In the segregated account model, all the data and infrastructure that are located in an account belong to that particular stage of the release pipeline (dev, QA, staging, or prod).
But where should you put infrastructure that does not belong to one particular stage?
Infrastructure to support deployments or to load data across accounts is best located in another segregated account. By deploying infrastructure or loading data from a separate account, you can’t depend on any existing roles, VPCs, subnets, etc. Any information that is necessary to deploy your infrastructure or load the data must be captured up front. This allows you to perform repeatable processes in a predictable and secure manner. With the recent addition of the StackSets feature in AWS CloudFormation, you can provision and manage infrastructure in multiple AWS accounts and Regions from a single template. This four-part blog series discusses different ways of automating the creation of cross-account roles and capturing account-specific information.
Loading OpenFDA data into Amazon Redshift
Before you get started with loading data from one Amazon Redshift cluster to another, you first need to create an Amazon Redshift cluster and load some data into it. You can use the following AWS CloudFormation template to create an Amazon Redshift cluster. You need to create Amazon Redshift clusters in both the source and target accounts.
After you create your Amazon Redshift clusters, you can go ahead and load some data into the cluster that is located in your source account. One of the great benefits of AWS is the ability to host and share public datasets on Amazon S3. When you test different architectures, these datasets serve as useful resources to get up and running without a lot of effort. For this post, we use the OpenFDA food enforcement dataset because it is a relatively small file and is easy to work with.
In the source account, you need to spin up an Amazon EMR cluster with Apache Spark so that you can unzip the file and format it properly before loading it into Amazon Redshift. The following AWS CloudFormation template provides the EMR cluster that you need.
Note: As an alternative, you can load the data using AWS Glue, which now supports Scala.
To be able to run this Scala code, you must install the spark-redshift driver on your EMR cluster.
Connect to your source Amazon Redshift cluster in your source account, and verify that the data is present by running a quick query:
Opening up the security groups
Now that the data has been loaded in the source Amazon Redshift cluster, it can be moved over to the target Amazon Redshift cluster. Because the security groups that are associated with the two clusters are very restrictive, there is no way to load the data from the centralized data loading AWS account without modifying the ingress rules on both security groups. Here are a few possible options:
- Add an ingress rule to allow all traffic to port 5439 (the default Amazon Redshift port).
- This option is not recommended because you are widening your attack surface significantly and exposing yourself to a potential attack.
- Peer the VPC in the data loader account to the source and target Amazon Redshift VPCs, and modify the ingress rule to allow all traffic from the private IP range of the data loader VPC.
- This solution is reasonably secure but does require some manual setup. Because the ingress rules in the source and target Amazon Redshift clusters allow access from the VPC private IP range, any resources in the data loader account can access both clusters, which is suboptimal.
- Leave long-running Amazon EC2 instances or EMR clusters in the data loader AWS account and manually create specific ingress rules in the source and target Amazon Redshift security groups to allow for those specific IPs.
- This option creates a lot of wasted cost because it requires leaving EC2 instances or an EMR cluster running indefinitely whether or not they are actually being used.
None of these three options is ideal, so let’s explore another option. One of the more powerful features of running EC2 instances in the cloud is the ability to dynamically manage and configure your environment using instance metadata. The AWS Cloud is dynamic by nature and incentivizes you to reduce costs by terminating instances when they are not being used. Therefore, instance metadata can serve as the glue to performing repeatable processes to these dynamic instances.
To load the data from the source Amazon Redshift cluster to the target Amazon Redshift cluster, perform the following steps:
- Spin up an EC2 instance in the data loader account.
- Use instance metadata to look up the IP of the EC2 instance.
- Assume roles in the source and target accounts using the AWS Security Token Service (AWS STS), and create ingress rules for the IP that was retrieved from instance metadata.
- Run a simple Python or Java program to perform a simple transformation and unload the data from the source Amazon Redshift cluster. Then load the results into the target Amazon Redshift cluster.
- Assume roles in the source and target accounts using AWS STS, and remove the ingress rules that were created in step 3.
Once step 5 is completed, you should see that the security groups for both Amazon Redshift clusters don’t allow traffic from any IP. Manually add your IP as an ingress rule to the target Amazon Redshift cluster’s security group on port 5439. When you run the following query, you should see that the data has been populated within the target Amazon Redshift cluster.
This post highlighted the importance of loading data in a secure manner across accounts. It mentioned reasons why you might want to provision infrastructure and load data from a centralized account. Several candidate solutions were discussed. Ultimately, the solution that we chose involved opening up security groups for a single IP and then closing them back up after the data was loaded. This solution minimizes the attack surface to a single IP and can be completely automated.
About the Author
Ryan Hood is a Data Engineer for AWS. He works on big data projects leveraging the newest AWS offerings. In his spare time, he attempts to Sous-vide anything he can find in his refrigerator.
Post Syndicated from The HiveMQ Team original https://www.hivemq.com/blog/hivemq-3-2-8-released/
The HiveMQ team is pleased to announce the availability of HiveMQ 3.2.8. This is a maintenance release for the 3.2 series and brings the following improvements:
- Improved performance for payload disk persistence
- Improved performance for subscription disk persistence
- Improved exception handling in OnSubscribeCallback when an Exception is not caught by a plugin
- Fixed an issue where the metric for discarded messages “QoS 0 Queue not empty” was increased when a client is offline
- Fixed an issue where the convenience methods for a SslCertificate might return null for certain extensions
- Fixed an issue which could lead to the OnPubackReceivedCallback being executed when inflight queue is full
- Fixed an issue where a scheduled background cleanup job could cause an error in the logs
- Fixed an issue which could lead to an IllegalArgumentException when sending a QoS 0 message in a rare edge-case
- Fixed an issue where a error “Exception while handling batched publish request” was logged without reason
You can download the new HiveMQ version here.
We recommend to upgrade if you are an HiveMQ 3.2.x user.
Have a great day,
The HiveMQ Team
Contributed by Otavio Ferreira, Manager, Software Development, AWS Messaging
Like other developers around the world, you may be tackling increasingly complex business problems. A key success factor, in that case, is the ability to break down a large project scope into smaller, more manageable components. A service-oriented architecture guides you toward designing systems as a collection of loosely coupled, independently scaled, and highly reusable services. Microservices take this even further. To improve performance and scalability, they promote fine-grained interfaces and lightweight protocols.
However, the communication among isolated microservices can be challenging. Services are often deployed onto independent servers and don’t share any compute or storage resources. Also, you should avoid hard dependencies among microservices, to preserve maintainability and reusability.
If you apply the pub/sub design pattern, you can effortlessly decouple and independently scale out your microservices and serverless architectures. A pub/sub messaging service, such as Amazon SNS, promotes event-driven computing that statically decouples event publishers from subscribers, while dynamically allowing for the exchange of messages between them. An event-driven architecture also introduces the responsiveness needed to deal with complex problems, which are often unpredictable and asynchronous.
What is event-driven computing?
Given the context of microservices, event-driven computing is a model in which subscriber services automatically perform work in response to events triggered by publisher services. This paradigm can be applied to automate workflows while decoupling the services that collectively and independently work to fulfil these workflows. Amazon SNS is an event-driven computing hub, in the AWS Cloud, that has native integration with several AWS publisher and subscriber services.
Which AWS services publish events to SNS natively?
Several AWS services have been integrated as SNS publishers and, therefore, can natively trigger event-driven computing for a variety of use cases. In this post, I specifically cover AWS compute, storage, database, and networking services, as depicted below.
- Auto Scaling: Helps you ensure that you have the correct number of Amazon EC2 instances available to handle the load for your application. You can configure Auto Scaling lifecycle hooks to trigger events, as Auto Scaling resizes your EC2 cluster.As an example, you may want to warm up the local cache store on newly launched EC2 instances, and also download log files from other EC2 instances that are about to be terminated. To make this happen, set an SNS topic as your Auto Scaling group’s notification target, then subscribe two Lambda functions to this SNS topic. The first function is responsible for handling scale-out events (to warm up cache upon provisioning), whereas the second is in charge of handling scale-in events (to download logs upon termination).
- AWS Elastic Beanstalk: An easy-to-use service for deploying and scaling web applications and web services developed in a number of programming languages. You can configure event notifications for your Elastic Beanstalk environment so that notable events can be automatically published to an SNS topic, then pushed to topic subscribers.As an example, you may use this event-driven architecture to coordinate your continuous integration pipeline (such as Jenkins CI). That way, whenever an environment is created, Elastic Beanstalk publishes this event to an SNS topic, which triggers a subscribing Lambda function, which then kicks off a CI job against your newly created Elastic Beanstalk environment.
- Elastic Load Balancing: Automatically distributes incoming application traffic across Amazon EC2 instances, containers, or other resources identified by IP addresses.You can configure CloudWatch alarms on Elastic Load Balancing metrics, to automate the handling of events derived from Classic Load Balancers. As an example, you may leverage this event-driven design to automate latency profiling in an Amazon ECS cluster behind a Classic Load Balancer. In this example, whenever your ECS cluster breaches your load balancer latency threshold, an event is posted by CloudWatch to an SNS topic, which then triggers a subscribing Lambda function. This function runs a task on your ECS cluster to trigger a latency profiling tool, hosted on the cluster itself. This can enhance your latency troubleshooting exercise by making it timely.
- Amazon S3: Object storage built to store and retrieve any amount of data.You can enable S3 event notifications, and automatically get them posted to SNS topics, to automate a variety of workflows. For instance, imagine that you have an S3 bucket to store incoming resumes from candidates, and a fleet of EC2 instances to encode these resumes from their original format (such as Word or text) into a portable format (such as PDF).In this example, whenever new files are uploaded to your input bucket, S3 publishes these events to an SNS topic, which in turn pushes these messages into subscribing SQS queues. Then, encoding workers running on EC2 instances poll these messages from the SQS queues; retrieve the original files from the input S3 bucket; encode them into PDF; and finally store them in an output S3 bucket.
- Configuring Amazon S3 Event Notifications
- Configuring Amazon S3 Buckets for Amazon SNS Notifications (Walkthrough)
- Messaging Fan-out Pattern for Serverless Architectures Using Amazon SNS (Multimedia Encoding Example)
- Amazon EFS: Provides simple and scalable file storage, for use with Amazon EC2 instances, in the AWS Cloud.You can configure CloudWatch alarms on EFS metrics, to automate the management of your EFS systems. For example, consider a highly parallelized genomics analysis application that runs against an EFS system. By default, this file system is instantiated on the “General Purpose” performance mode. Although this performance mode allows for lower latency, it might eventually impose a scaling bottleneck. Therefore, you may leverage an event-driven design to handle it automatically.Basically, as soon as the EFS metric “Percent I/O Limit” breaches 95%, CloudWatch could post this event to an SNS topic, which in turn would push this message into a subscribing Lambda function. This function automatically creates a new file system, this time on the “Max I/O” performance mode, then switches the genomics analysis application to this new file system. As a result, your application starts experiencing higher I/O throughput rates.
- Amazon Glacier: A secure, durable, and low-cost cloud storage service for data archiving and long-term backup.You can set a notification configuration on an Amazon Glacier vault so that when a job completes, a message is published to an SNS topic. Retrieving an archive from Amazon Glacier is a two-step asynchronous operation, in which you first initiate a job, and then download the output after the job completes. Therefore, SNS helps you eliminate polling your Amazon Glacier vault to check whether your job has been completed, or not. As usual, you may subscribe SQS queues, Lambda functions, and HTTP endpoints to your SNS topic, to be notified when your Amazon Glacier job is done.
- AWS Snowball: A petabyte-scale data transport solution that uses secure appliances to transfer large amounts of data.You can leverage Snowball notifications to automate workflows related to importing data into and exporting data from AWS. More specifically, whenever your Snowball job status changes, Snowball can publish this event to an SNS topic, which in turn can broadcast the event to all its subscribers.As an example, imagine a Geographic Information System (GIS) that distributes high-resolution satellite images to users via Web browser. In this example, the GIS vendor could capture up to 80 TB of satellite images; create a Snowball job to import these files from an on-premises system to an S3 bucket; and provide an SNS topic ARN to be notified upon job status changes in Snowball. After Snowball changes the job status from “Importing” to “Completed”, Snowball publishes this event to the specified SNS topic, which delivers this message to a subscribing Lambda function, which finally creates a CloudFront web distribution for the target S3 bucket, to serve the images to end users.
- Amazon RDS: Makes it easy to set up, operate, and scale a relational database in the cloud.RDS leverages SNS to broadcast notifications when RDS events occur. As usual, these notifications can be delivered via any protocol supported by SNS, including SQS queues, Lambda functions, and HTTP endpoints.As an example, imagine that you own a social network website that has experienced organic growth, and needs to scale its compute and database resources on demand. In this case, you could provide an SNS topic to listen to RDS DB instance events. When the “Low Storage” event is published to the topic, SNS pushes this event to a subscribing Lambda function, which in turn leverages the RDS API to increase the storage capacity allocated to your DB instance. The provisioning itself takes place within the specified DB maintenance window.
- Amazon ElastiCache: A web service that makes it easy to deploy, operate, and scale an in-memory data store or cache in the cloud.ElastiCache can publish messages using Amazon SNS when significant events happen on your cache cluster. This feature can be used to refresh the list of servers on client machines connected to individual cache node endpoints of a cache cluster. For instance, an ecommerce website fetches product details from a cache cluster, with the goal of offloading a relational database and speeding up page load times. Ideally, you want to make sure that each web server always has an updated list of cache servers to which to connect.To automate this node discovery process, you can get your ElastiCache cluster to publish events to an SNS topic. Thus, when ElastiCache event “AddCacheNodeComplete” is published, your topic then pushes this event to all subscribing HTTP endpoints that serve your ecommerce website, so that these HTTP servers can update their list of cache nodes.
- Amazon Redshift: A fully managed data warehouse that makes it simple to analyze data using standard SQL and BI (Business Intelligence) tools.Amazon Redshift uses SNS to broadcast relevant events so that data warehouse workflows can be automated. As an example, imagine a news website that sends clickstream data to a Kinesis Firehose stream, which then loads the data into Amazon Redshift, so that popular news and reading preferences might be surfaced on a BI tool. At some point though, this Amazon Redshift cluster might need to be resized, and the cluster enters a ready-only mode. Hence, this Amazon Redshift event is published to an SNS topic, which delivers this event to a subscribing Lambda function, which finally deletes the corresponding Kinesis Firehose delivery stream, so that clickstream data uploads can be put on hold.At a later point, after Amazon Redshift publishes the event that the maintenance window has been closed, SNS notifies a subscribing Lambda function accordingly, so that this function can re-create the Kinesis Firehose delivery stream, and resume clickstream data uploads to Amazon Redshift.
- AWS DMS: Helps you migrate databases to AWS quickly and securely. The source database remains fully operational during the migration, minimizing downtime to applications that rely on the database.DMS also uses SNS to provide notifications when DMS events occur, which can automate database migration workflows. As an example, you might create data replication tasks to migrate an on-premises MS SQL database, composed of multiple tables, to MySQL. Thus, if replication tasks fail due to incompatible data encoding in the source tables, these events can be published to an SNS topic, which can push these messages into a subscribing SQS queue. Then, encoders running on EC2 can poll these messages from the SQS queue, encode the source tables into a compatible character set, and restart the corresponding replication tasks in DMS. This is an event-driven approach to a self-healing database migration process.
- Amazon Route 53: A highly available and scalable cloud-based DNS (Domain Name System). Route 53 health checks monitor the health and performance of your web applications, web servers, and other resources.You can set CloudWatch alarms and get automated Amazon SNS notifications when the status of your Route 53 health check changes. As an example, imagine an online payment gateway that reports the health of its platform to merchants worldwide, via a status page. This page is hosted on EC2 and fetches platform health data from DynamoDB. In this case, you could configure a CloudWatch alarm for your Route 53 health check, so that when the alarm threshold is breached, and the payment gateway is no longer considered healthy, then CloudWatch publishes this event to an SNS topic, which pushes this message to a subscribing Lambda function, which finally updates the DynamoDB table that populates the status page. This event-driven approach avoids any kind of manual update to the status page visited by merchants.
- AWS Direct Connect (AWS DX): Makes it easy to establish a dedicated network connection from your premises to AWS, which can reduce your network costs, increase bandwidth throughput, and provide a more consistent network experience than Internet-based connections.You can monitor physical DX connections using CloudWatch alarms, and send SNS messages when alarms change their status. As an example, when a DX connection state shifts to 0 (zero), indicating that the connection is down, this event can be published to an SNS topic, which can fan out this message to impacted servers through HTTP endpoints, so that they might reroute their traffic through a different connection instead. This is an event-driven approach to connectivity resilience.
In addition to SNS, event-driven computing is also addressed by Amazon CloudWatch Events, which delivers a near real-time stream of system events that describe changes in AWS resources. With CloudWatch Events, you can route each event type to one or more targets, including:
- SNS topics
- Amazon SQS queues
- Amazon EC2 instances
- Amazon ECS tasks
- Amazon Kinesis Streams
- Amazon Kinesis Firehose delivery streams
- AWS Lambda functions
- AWS Step Functions state machines
- AWS CodePipeline pipelines
Many AWS services publish events to CloudWatch. As an example, you can get CloudWatch Events to capture events on your ETL (Extract, Transform, Load) jobs running on AWS Glue and push failed ones to an SQS queue, so that you can retry them later.
Amazon SNS is a pub/sub messaging service that can be used as an event-driven computing hub to AWS customers worldwide. By capturing events natively triggered by AWS services, such as EC2, S3 and RDS, you can automate and optimize all kinds of workflows, namely scaling, testing, encoding, profiling, broadcasting, discovery, failover, and much more. Business use cases presented in this post ranged from recruiting websites, to scientific research, geographic systems, social networks, retail websites, and news portals.
Start now by visiting Amazon SNS in the AWS Management Console, or by trying the AWS 10-Minute Tutorial, Send Fan-out Event Notifications with Amazon SNS and Amazon SQS.