Security updates for Tuesday

2021-12-21

Post Syndicated from original https://lwn.net/Articles/879360/rss

Security updates have been issued by Mageia (log4j), openSUSE (chromium, log4j, netdata, and nextcloud), Oracle (kernel and kernel-container), Red Hat (kernel, kernel-rt, log4j, openssl, postgresql:12, postgresql:13, and virt:rhel and virt-devel:rhel), Slackware (httpd), SUSE (xorg-x11-server), and Ubuntu (firefox).

THG Podcast: Tin Cans: Destroyers and the First Submarines Sunk by the US in Both World Wars

2021-12-21 The History Guy: History Deserves to Be Remembered

Post Syndicated from The History Guy: History Deserves to Be Remembered original https://www.youtube.com/watch?v=RfYjXBlPewM

Излезе от печат сборникът „Ще ви разкажа за моя спасител”

2021-12-21

Post Syndicated from original https://bivol.bg/%D0%B8%D0%B7%D0%BB%D0%B5%D0%B7%D0%B5-%D0%BE%D1%82-%D0%BF%D0%B5%D1%87%D0%B0%D1%82-%D1%81%D0%B1%D0%BE%D1%80%D0%BD%D0%B8%D0%BA%D1%8A%D1%82-%D1%89%D0%B5-%D0%B2%D0%B8-%D1%80%D0%B0%D0%B7%D0%BA.html

вторник 21 декември 2021

Излезе от печат сборникът „Ще ви разкажа за моя спасител” с творби, получили награди и номинации на Международния литературен ученически конкурс „Който спаси един човешки живот, спасява цяла вселена”, организиран…

Raspberry Pi computers are speeding to the International Space Station

2021-12-21 Olympia Brown

Post Syndicated from Olympia Brown original https://www.raspberrypi.org/blog/astro-pi-rocket-launch-21-space-raspberry-pi-computer/

This morning, our two new Astro Pi units launched into space. Actual, real-life space. The new Astro Pi units each consist of a Raspberry Pi computer with a Raspberry Pi High Quality Camera and a host of sensors, all housed inside a special space-ready case that makes the hardware suitable for the International Space Station (ISS).

The journey to space for two special Raspberry Pi computers

Today’s launch is the culmination of a huge piece of work we’ve done for the European Space Agency to get the new Astro Pi units ready to become part of the European Astro Pi Challenge.

After lift-off from Launch Complex 39A at Kennedy Space Center in Florida, the new Astro Pi units are currently travelling on a SpaceX Falcon 9 rocket carrying the Dragon 2 spacecraft, the module atop the rocket. You can watch the launch again here.

SpaceX’s Falcon 9 rocket carrying the Crew Dragon spits fire as it lifts off from Kennedy Space Center in Florida. — A SpaceX rocket is delivering the special Raspberry Pi computers to the ISS today. © SpaceX

Also travelling with our Astro Pi units are food and some Christmas presents for the astronauts on board the ISS, materials for a study of the delivery of cancer drugs; a bioprinter for experiments investigating wound healing; and materials for a study of how detergents work in microgravity.

The Dragon 2 spacecraft will berth with the ISS tomorrow, with NASA astronauts Raja Chari and Tom Marshburn monitoring its arrival. ESA astronaut Matthias Maurer and another colleague will be there to unpack its cargo. You can watch the process of unpacking tomorrow, Wed 22 December, at 8.30am GMT / 9.30am CET. In the new year, Matthias will be switching our Astro Pi units on and getting them ready to run the code written by young people participating in the European Astro Pi Challenge. The new Astro Pi units will replace Astro Pi units Ed and Izzy, which have been on the ISS for 6 years — ever since the very first Astro Pi Challenge with British ESA astronaut Tim Peake in 2015.

The International Space Station. — The International Space Station, where the special Raspberry Pi computers will arrive tomorrow, © ESA–L. Parmitano, CC BY-SA 3.0 IGO

We’re looking forward to seeing the amazing experiments this year’s Astro Pi Mission Space Lab teams will perform on the new hardware, and what they’ll discover about life on Earth and in space. We also can’t wait to see what the young people participating in Astro Pi Mission Zero will name the new Astro Pi units!

Building space-ready Astro Pi units

None of us on the team working on the Astro Pi Challenge here at the Foundation are aerospace engineers. While building the new Astro Pi units, we’ve learned so much.

Animation of how the components of the Mark 2 Astro Pi hardware unit fit together.

To get the Astro Pis ready to be loaded onto the rocket has been a project of more than three years. That’s because, in addition to manufacturing the Astro Pi units, we also had to ensure they pass the necessary safety and certification process. The official name for this is the Safety Gate process. It’s been set up by ESA and NASA to ensure that any items sent to the ISS are safe to operate on board the station.

For the three separate safety panels the Astro Pi units needed to get through, we put the units through different tests and completed various safety reports. The tests included:

A vibration test: To make sure the Astro Pi units survive the rigours of the launch, we tested them using the sophisticated rigs at Airbus in Portsmouth. These rigs are capable of simulating the vibrations produced by various different launch vehicles. We needed to test all possible options, because the Astro Pi units didn’t have a confirmed vehicle to travel to the ISS yet.

A vibration test of the new Raspberry Pi-powered Astro Pi units at Airbus in Portsmouth

A thermal test: To make sure no harm can possibly come to the crew from the Astro Pi units, we needed to check that the touch temperature of the Astro Pi units’ surface is never above 45°C.

A heat test of the new Raspberry Pi-powered Astro Pi units.

A test for sharp edges: Each Astro Pi unit also needed to be manually inspected by someone wearing a latex glove who carefully feels the case for sharp edges.

Testing the new Raspberry Pi-powered Astro Pi units for sharp edges using a latex glove.

Stringent, military-grade electromagnetic emissions and susceptibility tests: These are required to guarantee that the Astro Pi units won’t interfere with any ISS systems, and that the units themselves are not affected by other equipment on board.

EMC test of the new Raspberry Pi-powered Astro Pi units.

We built two additional Astro Pi units and sent them to NASA so that they could test that plugging the units into the ISS power grid wouldn’t cause a power overload.

For almost all of these tests, we created custom software to do things like stress the Astro Pi units’ processors, saturate the network links, and generally make the units work as hard as possible.

To accompany these safety and test reports, we also had to create the Flight Safety Data Package (FSDP), which contains exact technical information about every component of the Astro Pi hardware, and about all the necessary safety controls to qualify the use of certain materials and safely manage operation of the units. The current FSDP paperwork stands at over 700 pages, which thankfully we haven’t had to actually print out!

Young people’s code will run on the new Astro Pi units next year — is yours on board?

All of this work culminated today in the Astro Pis being launched up into space from Cape Canaveral. And we’re doing all this so that more young people can take part in the European Astro Pi Challenge and send messages to the ISS astronauts using code as part of Mission Zero, or write code for new, ambitious experiments to run on the ISS as part of Mission Space Lab.

Young people can take part in Astro Pi Mission Zero right now! Mission Zero is a beginners’ coding activity for all young people under the age of 19 in ESA member and associate states. It gives them the chance to write code to show their own message to the astronauts on board the ISS using the Astro Pi units. And this time, Mission Zero participants can also vote to name the new Astro Pi units!

I want to help a young person take part in Mission Zero

To participate, young people follow our step-by-step instructions to write their Mission Zero code. As an adult supporting a young person on Mission Zero, all you need to do is sign up as a mentor to get them a registration code for their Mission Zero entry. Once your young person’s code has run in space, we’ll send you a special certificate for them showing where the ISS, and the Astro Pi computers, were when their code ran.

Inspire a young person to learn about coding and space science today with Astro Pi Mission Zero!

The post Raspberry Pi computers are speeding to the International Space Station appeared first on Raspberry Pi.

Z9 CAR Autofocus 🏎️

2021-12-21 Matt Granger

Post Syndicated from Matt Granger original https://www.youtube.com/watch?v=2bXYTbnHk8U

Comic for 2021.12.21

2021-12-21 Explosm.net

Post Syndicated from Explosm.net original http://explosm.net/comics/6060/

New Cyanide and Happiness Comic

[$] Content blockers and Chrome’s Manifest V3

2021-12-21

Post Syndicated from original https://lwn.net/Articles/879063/rss

A clarion call from the Electronic Frontier Foundation (EFF) warning about upcoming changes to the Chrome
browser’s extension API was not the first such—from the EFF or from
others. The time of the switch to Manifest
V3, as the new API is known, is growing closer; privacy advocates are
concerned that it will preclude a number of techniques that browser
extensions use for features like ad and tracker blocking. Part of the
concern stems from the fact that Google is both the developer of a popular
web browser and the operator of an enormous advertising network so its
incentives seem, at least, plausibly misaligned.

Our Response to the Log4j Vulnerability

2021-12-20 Mark Potter

Post Syndicated from Mark Potter original https://www.backblaze.com/blog/our-response-to-the-log4j-vulnerability/

When the director of the Cybersecurity and Infrastructure Agency calls a vulnerability “one of the most serious I’ve seen in my entire career, if not the most serious,” ears perk up.

The director was referring to the Apache Log4j vulnerability that was discovered this month. Some more colorful phrases used to describe the Log4j incident include: “a grave threat,” “a design failure of catastrophic proportions,” something that will “haunt the internet for years.”

The vulnerability proceeded to set off five-alarm fires in IT, security, and operations departments around the world. Or should have, at least. Researchers estimate at least 840,000 attacks have since been launched via the vulnerability since it was discovered. That is to say, if you’re using a software or cloud vendor that hasn’t made some kind of statement or taken corrective action, you should be asking them why not.

At Backblaze, we made the decision to take our servers temporarily offline in order to hunt down potential threats, apply the appropriate security patches, and test those patches to help prevent our systems from being compromised. This post explains why we made the decision, outlines the actions we took to meet our objective of securing customer data as well as our environment, and provides more insight into our process.

What Is the Log4j Vulnerability?

As reported by ArsTechnica, a zero-day vulnerability was discovered in the Apache Log4j logging library that enables attackers to take control of vulnerable servers. Though it may not be an immediately recognizable name, Log4j is widely used throughout the world by companies like Apple, Twitter, and Tesla as well as the game Minecraft. The library allows developers to easily log application events. The Cybersecurity & Infrastructure Security Agency (CISA) urged users to apply patches immediately to address the vulnerabilities.

Our Decision

Upon learning of the Log4j vulnerability, our team took swift action to investigate and assess available options to address the potential impacts since Log4j is leveraged widely in our environment. As part of our investigation, our internal team used a nondestructive form of the exploit to confirm our vulnerability. We also noted close to 80,000 unsuccessful Log4j exploit attempts on our sites in a 12-hour period. The level of activity, along with our success using the exploit (albeit with internal knowledge of our own systems), was very concerning to us.

Although we were not aware of any unauthorized access to our systems due to the Log4j vulnerability, out of an abundance of caution, we decided it was in our customers’ best interest to take systems offline until they could be patched. The decision to take our systems offline was not one we took lightly. However, our Incident Management Guidelines are quite clear. In a crisis where tradeoffs must be made, our descending list of priorities (all of which are very important to us) is as follows:

Health & Safety.
Data Integrity & Confidentiality.
Service Availability.
Service Performance.

Protecting customer data integrity is second only to health and safety and above service availability. That said, the decision to temporarily bring all services down was unprecedented in the 14-year history of Backblaze. This was an extraordinary case where we made a decision to take a necessary action to address an imminent risk of a vulnerability with a Common Vulnerability Scoring System (CVSS) score of 10.0—the highest possible score. We believe that we needed to take preventative steps to protect customer data by temporarily taking our services offline until the security patching process was complete.

What Actions Have We Taken?

A recap of recent actions is outlined below:

Upon learning of the Log4j vulnerability, our Security team took immediate action to investigate.
Based on our assessment of the potential threat, we decided to temporarily take our services offline to apply a security patch to prevent our systems from potentially being compromised.
We announced our systems had been taken offline at 5:20 p.m. PT on December 10, 2021.*
We announced our systems were back online and functioning normally at 3:01 a.m. PT on December 11, 2021.
Based on our investigation, we also determined that there was no evidence of our systems being compromised or unauthorized access to customer data or files due to the Log4j vulnerability.

*We decided not to announce downtime publicly until after our systems were offline to avoid any elevation of priority to those targeting our services. Accordingly, we did not make a public announcement until after the servers were disconnected.

Was Backblaze Compromised?

We have not found any evidence of system compromise or unauthorized access to customer data or files at this time.

Next Steps

As is part of our incident response process, we always look for ways to do better and identify areas for improvement. In this case, two top priorities moving forward would be to improve how we can apply security patches faster and reduce downtime.

Thank you to our customers for your understanding as we navigated this challenging incident.

The post Our Response to the Log4j Vulnerability appeared first on Backblaze Blog | Cloud Storage & Cloud Backup.

Beware The CopyLEFT Trolls (Techdirt)

2021-12-20

Post Syndicated from original https://lwn.net/Articles/879263/rss

Techdirt looks
at the problem of copyleft trolls, and those who target users of
Creative Commons materials in particular.

However, in the end, they are still licenses, and those licenses
are still backed by copyright — which means that if you don’t
abide by the specifics of the Creative Commons license, you could
very much be liable for copyright infringement. Enter the copyleft
trolls. They search for those using CC-licensed works, but not
following the exact terms of the license, and then resort to the
typical copyright troll shakedown game.

Build a modern data architecture on AWS with Amazon AppFlow, AWS Lake Formation, and Amazon Redshift.

2021-12-20 Dr. Yannick Misteli

Post Syndicated from Dr. Yannick Misteli original https://aws.amazon.com/blogs/big-data/build-a-modern-data-architecture-on-aws-with-amazon-appflow-aws-lake-formation-and-amazon-redshift/

This is a guest post written by Dr. Yannick Misteli, lead cloud platform and ML engineering in global product strategy (GPS) at Roche.

Recently the Roche Data Insights (RDI) initiative was launched to achieve our vision using new ways of working and collaboration in order to build shared, interoperable data & insights with federated governance. Furthermore, a simplified & integrated data landscape shall be established in order to empower insights communities. One of the first domains to engage in this program is the Go-to-Market (GTM) area which comprises sales, marketing, medical access and market affairs in Roche. GTM domain enables Roche to understand customers and to ultimately create and deliver valuable services that meet their needs. GTM as a domain extends beyond health care professionals (HCPs) to a larger healthcare ecosystem consisting of patients, communities, health authorities, payers, providers, academia, competitors, so on and so forth. Therefore, Data & Analytics are key in supporting the internal and external stakeholders in their decision-making processes through actionable insights.

Roche GTM built a modern data and machine learning (ML) platform on AWS while utilizing DevOps best practices. The mantra of everything as code (EaC) was key in building a fully automated, scalable data lake and data warehouse on AWS.

In this this post, you learn about how Roche used AWS products and services such as Amazon AppFlow, AWS Lake Formation, and Amazon Redshift to provision and populate their data lake; how they sourced, transformed, and loaded data into the data warehouse; and how they realized best practices in security and access control.

In the following sections, you dive deep into the scalable, secure, and automated modern data platform that Roche has built. We demonstrate how to automate data ingestion, security standards, and utilize DevOps best practices to ease management of your modern data platform on AWS.

Data platform architecture

The following diagram illustrates the data platform architecture.

The architecture contains the following components:

The core infrastructure is deployed from a GitLab Runner container using the AWS Cloud Development Kit (AWS CDK)
Amazon AppFlow flows gather objects from Salesforce and places them into an Amazon Simple Storage Service (Amazon S3) bucket
Processed data from the S3 bucket is also loaded back to Salesforce using Amazon AppFlow
AWS Glue jobs load data from external databases (for example Oracle) and APIs to the S3 bucket
Invocation of DBT ELT jobs using Argo (an open source project) on Amazon Elastic Kubernetes Service (Amazon EKS) are launched to transform and cleanse data in external Amazon Redshift tables
An Amazon EKS cluster running Kubeflow interacts with an Amazon Redshift cluster for data science ML experimentation and use cases
Lake Formation secures the S3 buckets, AWS Glue databases, and tables aligned to specific AWS Identity and Access Management (IAM) roles
AWS Firewall Manager secures the entire AWS Cloud perimeter

Lake Formation security

We use Lake Formation to secure all data as it lands in the data lake. Separating each data lake layer into distinct S3 buckets and prefixes enables fine-grained access control policies that Lake Formation implements. This concept also extends to locking down access to specific rows and columns and applying policies to specific IAM roles and users. Governance and access to data lake resources is difficult to manage, but Lake Formation simplifies this process for administrators.

To secure access to the data lake using Lake Formation, the following steps are automated using the AWS CDK with customized constructs:

Register the S3 data buckets and prefixes, and corresponding AWS Glue databases with Lake Formation.
Add data lake administrators (GitLab runner IAM deployment role and administrator IAM role).
Grant the AWS Glue job IAM roles access to the specific AWS Glue databases.
Grant the AWS Lambda IAM role access to the Amazon AppFlow databases.
Grant the listed IAM roles access to the corresponding tables in the AWS Glue databases.

AWS Glue Data Catalog

The AWS Glue Data Catalog is the centralized registration and access point for all databases and tables that are created in both the data lake and in Amazon Redshift. This provides centralized transparency to all resources along with their schemas and the location of all data that is referenced. This is a critical aspect for any data operations performed within the lake house platform.

Data sourcing and ingestion

Data is sourced and loaded into the data lake through the use of AWS Glue jobs and Amazon AppFlow. The ingested data is made available in the Amazon Redshift data warehouse through Amazon Redshift Spectrum using external schemas and tables. The process of creating the external schemas and linking it to the Data Catalog is outlined later in this post.

Amazon AppFlow Salesforce ingestion

Amazon AppFlow is a fully-managed integration service that allows you to pull data from sources such as Salesforce, SAP, and Zendesk. Roche integrates with Salesforce to load Salesforce objects securely into their data lake without needing to write any custom code. Roche also pushes ML results back to Salesforce using Amazon AppFlow to facilitate the process.

Salesforce objects are first fully loaded into Amazon S3 and then are flipped to a daily incremental load to capture deltas. The data lands in the raw zone bucket in Parquet format using the date as a partition. The Amazon AppFlow flows are created through the use of a YAML configuration file (see the following code). This configuration is consumed by the AWS CDK deployment to create the corresponding flows.

appflow:
  flow_classes:
    salesforce:
      source: salesforce
      destination: s3
      incremental_load: 1
      schedule_expression: "rate(1 day)"
      s3_prefix: na
      connector_profile: roche-salesforce-connector-profile1,roche-salesforce-connector-profile2
      description: appflow flow flow from Salesforce
      environment: all
  - name: Account
    incremental_load: 1
    bookmark_col: appflow_date_str
  - name: CustomSalesforceObject
    pii: 0
    bookmark_col: appflow_date_str
    upsert_field_list: upsertField
    s3_prefix: prefix
    source: s3
    destination: salesforce
    schedule_expression: na
    connector_profile: roche-salesforce-connector-profile

The YAML configuration makes it easy to select whether data should be loaded from an S3 bucket back to Salesforce or from Salesforce to an S3 bucket. This configuration is subsequently read by the AWS CDK app and corresponding stacks to translate into Amazon AppFlow flows.

The following options are specified in the preceding YAML configuration file:

source – The location to pull data from (Amazon S3, Salesforce)
destination – The destination to put data to (Amazon S3, Salesforce)
object_name – The name of the Salesforce object to interact with
incremental_load – A Boolean specifying if the load should be incremental or full (0 means full, 1 means incremental)
schedule_expression – The cron or rate expression to run the flow (na makes it on demand)
s3_prefix – The prefix to push or pull the data from in the S3 bucket
connector_profile – The Amazon AppFlow connector profile name to use when connecting to Salesforce (can be a CSV list)
environment – The environment to deploy this Amazon AppFlow flow to (all means deploy to dev and prod, dev means development environment, prod means production environment)
upsert_field_list – The set of Salesforce object fields (can be a CSV list) to use when performing an upsert operation back to Salesforce (only applicable when loaded data back from an S3 bucket back to Salesforce)
bookmark_col – The name of the column to use in the Data Catalog for registering the daily load date string partition

Register Salesforce objects to the Data Catalog

Complete the following steps to register data loaded into the data lake with the Data Catalog and link it to Amazon Redshift:

Gather Salesforce object fields and corresponding data types.
Create a corresponding AWS Glue database in the Data Catalog.
Run a query against Amazon Redshift to create an external schema that links to the AWS Glue database.
Create tables and partitions in the AWS Glue database and tables.

Data is accessible via the Data Catalog and the Amazon Redshift cluster.

Amazon AppFlow dynamic field gathering

To construct the schema of the loaded Salesforce object in the data lake, you invoke the following Python function. The code utilizes an Amazon AppFlow client from Boto3 to dynamically gather the Salesforce object fields to construct the Salesforce object’s schema.

import boto3

client = boto3.client('appflow')

def get_salesforce_object_fields(object_name: str, connector_profile: str):
    """
    Gathers the Salesforce object and its corresponding fields.

    Parameters:
        salesforce_object_name (str) = the name of the Salesforce object to consume.
        appflow_connector_profile (str) = the name of AppFlow Connector Profile.

    Returns:
        object_schema_list (list) =  a list of the object's fields and datatype (a list of dictionaries).
    """
    print("Gathering Object Fields")

    object_fields = []

    response = client.describe_connector_entity(
        connectorProfileName=connector_profile,
        connectorEntityName=object_name,
        connectorType='Salesforce'
    )

    for obj in response['connectorEntityFields']:
        object_fields.append(
            {'field': obj['identifier'], 'data_type': obj['supportedFieldTypeDetails']['v1']['fieldType']})

    return object_fields

We use the function for both the creation of the Amazon AppFlow flow via the AWS CDK deployment and for creating the corresponding table in the Data Catalog in the appropriate AWS Glue database.

Create an Amazon CloudWatch Events rule, AWS Glue table, and partition

To add new tables (one per Salesforce object loaded into Amazon S3) and partitions into the Data Catalog automatically, you create an Amazon CloudWatch Events rule. This function enables you to query the data in both AWS Glue and Amazon Redshift.

After the Amazon AppFlow flow is complete, it invokes a CloudWatch Events rule and a corresponding Lambda function to either create a new table in AWS Glue or add a new partition with the corresponding date string for the current day. The CloudWatch Events rule looks like the following screenshot.

The invoked Lambda function uses the Amazon SageMaker Data Wrangler Python package to interact with the Data Catalog. Using the preceding function definition, the object fields and their data types are accessible to pass to the following function call:

import awswrangler as wr

def create_external_parquet_table(
    database_name: str, 
    table_name: str, 
    s3_path: str, 
    columns_map: dict, 
    partition_map: dict
):
    """
    Creates a new external table in Parquet format.

    Parameters:
        database_name (str) = the name of the database to create the table in.
        table_name (str) = the name of the table to create.
        s3_path (str) = the S3 path to the data set.
        columns_map (dict) = a dictionary object containing the details of the columns and their data types from appflow_utility.get_salesforce_object_fields
        partition_map (dict) = a map of the paritions for the parquet table as {'column_name': 'column_type'}
    
    Returns:
        table_metadata (dict) = metadata about the table that was created.
    """

    column_type_map = {}

    for field in columns_map:
        column_type_map[field['name']] = field['type']

    return wr.catalog.create_parquet_table(
        database=database_name,
        table=table_name,
        path=s3_path,
        columns_types=column_type_map,
        partitions_types=partition_map,
        description=f"AppFlow ingestion table for {table_name} object"
    )

If the table already exists, the Lambda function creates a new partition to account for the date in which the flow completed (if it doesn’t already exist):

import awswrangler as wr

def create_parquet_table_date_partition(
    database_name: str, 
    table_name: str, 
    s3_path: str, 
    year: str, 
    month: str, 
    day: str
):
    """
    Creates a new partition by the date (YYYY-MM-DD) on an existing parquet table.

    Parameters:
        database_name (str) = the name of the database to create the table in.
        table_name (str) = the name of the table to create.
        s3_path (str) = the S3 path to the data set.
        year(str) = the current year for the partition (YYYY format).
        month (str) = the current month for the partition (MM format).
        day (str) = the current day for the partition (DD format).
    
    Returns:
        table_metadata (dict) = metadata about the table that has a new partition
    """

    date_str = f"{year}{month}{day}"
    
    return wr.catalog.add_parquet_partitions(
        database=database_name,
        table=table_name,
        partitions_values={
            f"{s3_path}/{year}/{month}/{day}": [date_str]
        }
    )
    
def table_exists(
    database_name: str, 
    table_name: str
):
    """
    Checks if a table exists in the Glue catalog.

    Parameters:
        database_name (str) = the name of the Glue Database where the table should be.
        table_name (str) = the name of the table.
    
    Returns:
        exists (bool) = returns True if the table exists and False if it does not exist.
    """

    try:
        wr.catalog.table(database=database_name, table=table_name)
        return True
    except ClientError as e:
        return False

Amazon Redshift external schema query

An AWS Glue database is created for each Amazon AppFlow connector profile that is present in the preceding configuration. The objects that are loaded from Salesforce into Amazon S3 are registered as tables in the Data Catalog under the corresponding database. To link the database in the Data Catalog with an external Amazon Redshift schema, run the following query:

CREATE EXTERNAL SCHEMA ${connector_profile_name}_ext from data catalog
database '${appflow_connector_profile_name}'
iam_role 'arn:aws:iam::${AWS_ACCOUNT_ID}:role/RedshiftSpectrumRole'
region 'eu-west-1';

The specified iam_role value must be an IAM role created ahead of time and must have the appropriate access policies specified to query the Amazon S3 location.

Now, all the tables available in the Data Catalog can be queried using SQL locally in Amazon Redshift Spectrum.

Amazon AppFlow Salesforce destination

Roche trains and invokes ML models using data found in the Amazon Redshift data warehouse. After the ML models are complete, the results are pushed back into Salesforce. Through the use of Amazon AppFlow, we can achieve the data transfer without writing any custom code. The schema of the results must match the schema of the corresponding Salesforce object, and the format of the results must be written in either JSON lines or CSV format in order to be written back into Salesforce.

AWS Glue Jobs

To source on-premises data feeds into the data lake, Roche has built a set of AWS Glue jobs in Python. There are various external sources including databases and APIs that are directly loaded into the raw zone S3 bucket. The AWS Glue jobs are run on a daily basis to load new data. The data that is loaded follows the partitioning scheme of YYYYMMDD format in order to more efficiently store and query datasets. The loaded data is then converted into Parquet format for more efficient querying and storage purposes.

Amazon EKS and KubeFlow

To deploy ML models on Amazon EKS, Roche uses Kubeflow on Amazon EKS. The use of Amazon EKS as the backbone infrastructure makes it easy to build, train, test, and deploy ML models and interact with Amazon Redshift as a data source.

Firewall Manager

As an added layer of security, Roche takes extra precautions through the use of Firewall Manager. This allows Roche to explicitly deny or allow inbound and outbound traffic through the use of stateful and stateless rule sets. This also enables Roche to allow certain outbound access to external websites and deny websites that they don’t want resources inside of their Amazon VPC to have access to. This is critical especially when dealing with any sensitive datasets to ensure that data is secured and has no chance of being moved externally.

CI/CD

All the infrastructure outlined in the architecture diagram was automated and deployed to multiple AWS Regions using a continuous integration and continuous delivery (CI/CD) pipeline with GitLab Runners as the orchestrator. The GitFlow model was used for branching and invoking automated deployments to the Roche AWS accounts.

Infrastructure as code and AWS CDK

Infrastructure as code (IaC) best practices were used to facilitate the creation of all infrastructure. The Roche team uses the Python AWS CDK to deploy, version, and maintain any changes that occur to the infrastructure in their AWS account.

AWS CDK project structure

The top level of the project structure in GitLab includes the following folders (while not limited to just these folders) in order to keep infrastructure and code all in one location.

To facilitate the various resources that are created in the Roche account, the deployment was broken into the following AWS CDK apps, which encompass multiple stacks:

core
data_lake
data_warehouse

The core app contains all the stacks related to account setup and account bootstrapping, such as:

VPC creation
Initial IAM roles and policies
Security guardrails

The data_lake app contains all the stacks related to creating the AWS data lake, such as:

Lake Formation setup and registration
AWS Glue database creation
S3 bucket creation
Amazon AppFlow flow creation
AWS Glue job setup

The data_warehouse app contains all the stacks related to setting up the data warehouse infrastructure, such as:

Amazon Redshift cluster
Load balancer to Amazon Redshift cluster
Logging

The AWS CDK project structure described was chosen to keep the deployment flexible and to logically group together stacks that relied on each other. This flexibility allows for deployments to be broken out by function and deployed only when truly required and needed. This decoupling of different parts of the provisioning maintains flexibility when deploying.

AWS CDK project configuration

Project configurations are flexible and extrapolated away as YAML configuration files. For example, Roche has simplified the process of creating a new Amazon AppFlow flow and can add or remove flows as needed simply by adding a new entry into their YAML configuration. The next time the GitLab runner deployment occurs, it picks up the changes on AWS CDK synthesis to generate a new change set with the new set of resources. This configuration and setup keeps things dynamic and flexible while decoupling configuration from code.

Network architecture

The following diagram illustrates the network architecture.

We can break down the architecture into the following:

All AWS services are deployed in two Availability Zones (except Amazon Redshift)
Only private subnets have access to the on-premises Roche environment
Services are deployed in backend subnets
Perimeter protection using AWS Network Firewall
A network load balancer publishes services to the on premises environment

Network security configurations

Infrastructure, configuration, and security are defined as code in AWS CDK, and Roche uses a CI/CD pipeline to manage and deploy them. Roche has an AWS CDK application to deploy the core services of the project: VPC, VPN connectivity, and AWS security services (AWS Config, Amazon GuardDuty, and AWS Security Hub). The VPC contains four network layers deployed in two Availability Zones, and they have VPC endpoints to access AWS services like Amazon S3, Amazon DynamoDB, and Amazon Simple Queue Service (Amazon SQS). They limit internet access using AWS Network Firewall.

The infrastructure is defined as code and the configuration is segregated. Roche performed the VPC setup by running the CI/CD pipeline to deploy their infrastructure. The configuration is in a specific external file; if Roche wants to change any value of the VPC, they need to simply modify this file and run the pipeline again (without typing any new lines of code). If Roche wants to change any configurations, they don’t want to have to change any code. It makes it simple for Roche to make changes and simply roll them out to their environment, making the changes more transparent and easier to configure. Traceability of the configuration is more transparent and it makes it simpler for approving the changes.

The following code is an example of the VPC configuration:

"test": {
        "vpc": {
            "name": "",
            "cidr_range": "192.168.40.0/21",
            "internet_gateway": True,
            "flow_log_bucket": shared_resources.BUCKET_LOGGING,
            "flow_log_prefix": "vpc-flow-logs/",
        },
        "subnets": {
            "private_subnets": {
                "private": ["192.168.41.0/25", "192.168.41.128/25"],
                "backend": ["192.168.42.0/23", "192.168.44.0/23"],
            },
            "public_subnets": {
                "public": {
                    "nat_gateway": True,
                    "publics_ip": True,
                    "cidr_range": ["192.168.47.64/26", "192.168.47.128/26"],
                }
            },
            "firewall_subnets": {"firewall": ["192.168.47.0/28", "192.168.47.17/28"]},
        },
        ...
         "vpc_endpoints": {
            "subnet_group": "backend",
            "services": [
                "ec2",
                "ssm",
                "ssmmessages",
                "sns",
                "ec2messages",
                "glue",
                "athena",
                "secretsmanager",
                "ecr.dkr",
                "redshift-data",
                "logs",
                "sts",
            ],
            "gateways": ["dynamodb", "s3"],
            "subnet_groups_allowed": ["backend", "private"],
        },
        "route_53_resolvers": {
            "subnet": "private",
        ...

The advantages of this approach are as follows:

No need to modify the AWS CDK constructor and build new code to change VPC configuration
Central point to manage VPC configuration
Traceability of changes and history of the configuration through Git
Redeploy all the infrastructure in a matter of minutes in other Regions or accounts

Operations and alerting

Roche has developed an automated alerting system if any part of the end-to-end architecture encounters any issues, focusing on any issues when loading data from AWS Glue or Amazon AppFlow. All logging is published to CloudWatch by default for debugging purposes.

The operational alerts have been built for the following workflow:

AWS Glue jobs and Amazon AppFlow flows ingest data.
If a job fails, it emits an event to a CloudWatch Events rule.
The rule is triggered and invokes an Lambda function to send failure details to an Amazon Simple Notification Service (Amazon SNS) topic.
The SNS topic has a Lambda subscriber that gets invoked:
1. The Lambda function reads out specific webhook URLs from AWS Secrets Manager.
2. The function fires off an alert to the specific external systems.
The external systems receive the message and the appropriate parties are notified of the issue with details.

The following architecture outlines the alerting mechanisms built for the lake house platform.

Conclusion

The GTM (Go-To-Market) domain has been successful in enabling their business stakeholders, data engineers and data scientists providing a platform that is extendable to many use-cases that Roche faces. It is a key enabler and an accelerator for the GTM organization in Roche. Through a modern data platform, Roche is now able to better understand customers and ultimately create and deliver valuable services that meet their needs. It extends beyond health care professionals (HCPs) to a larger healthcare ecosystem. The platform and infrastructure in this blog help to support and accelerate both internal and external stakeholders in their decision-making processes through actionable insights.

The steps in this post can help you plan to build a similar modern data strategy using AWS managed services to ingest data from sources like Salesforce, automatically create metadata catalogs and share data seamlessly between the data lake and data warehouse, and create alerts in the event of an orchestrated data workflow failure. In part 2 of this post, you learn about how the data warehouse was built using an agile data modeling pattern and how ELT jobs were quickly developed, orchestrated, and configured to perform automated data quality testing.

Special thanks go to the Roche team: Joao Antunes, Krzysztof Slowinski, Krzysztof Romanowski, Bartlomiej Zalewski, Wojciech Kostka, Patryk Szczesnowicz, Igor Tkaczyk, Kamil Piotrowski, Michalina Mastalerz, Jakub Lanski, Chun Wei Chan, Andrzej Dziabowski for their project delivery and support with this post.

About The Authors

Dr. Yannick Misteli, Roche – Dr. Yannick Misteli is leading cloud platform and ML engineering teams in global product strategy (GPS) at Roche. He is passionate about infrastructure and operationalizing data-driven solutions, and he has broad experience in driving business value creation through data analytics.

Simon Dimaline, AWS – Simon Dimaline has specialised in data warehousing and data modelling for more than 20 years. He currently works for the Data & Analytics team within AWS Professional Services, accelerating customers’ adoption of AWS analytics services.

Matt Noyce, AWS – Matt Noyce is a Senior Cloud Application Architect in Professional Services at Amazon Web Services. He works with customers to architect, design, automate, and build solutions on AWS for their business needs.

Chema Artal Banon, AWS – Chema Artal Banon is a Security Consultant at AWS Professional Services and he works with AWS’s customers to design, build, and optimize their security to drive business. He specializes in helping companies accelerate their journey to the AWS Cloud in the most secure manner possible by helping customers build the confidence and technical capability.

A special Thank You goes out to the following people whose expertise made this post possible from AWS:

Thiyagarajan Arumugam – Principal Analytics Specialist Solutions Architect
Taz Sayed – Analytics Tech Leader
Glenith Paletta – Enterprise Service Manager
Mike Murphy – Global Account Manager
Natacha Maheshe – Senior Product Marketing Manager
Derek Young – Senior Product Manager
Jamie Campbell – Amazon AppFlow Product Manager
Kamen Sharlandjiev – Senior Solutions Architect – Amazon AppFlow
Sunil Jethwani – Principal Customer Delivery Architect
Vinay Shukla – Amazon Redshift Principal Product Manager
Nausheen Sayed – Program Manager

AI Bullet Camera Announced!

2021-12-20 Crosstalk Solutions

Post Syndicated from Crosstalk Solutions original https://www.youtube.com/watch?v=-eFuIpAckOE

How Student Athletes Shattered the Amateurism Myth—The Experiment Podcast

2021-12-20 The Atlantic

Post Syndicated from The Atlantic original https://www.youtube.com/watch?v=nNEKtOCR7cE

Simplify setup of Amazon Detective with AWS Organizations

2021-12-20 Karthik Ram

Post Syndicated from Karthik Ram original https://aws.amazon.com/blogs/security/simplify-setup-of-amazon-detective-with-aws-organizations/

Amazon Detective makes it easy to analyze, investigate, and quickly identify the root cause of potential security issues or suspicious activities by collecting log data from your AWS resources. Amazon Detective simplifies the process of a deep dive into a security finding from other AWS security services, such as Amazon GuardDuty and AWS SecurityHub. Detective uses machine learning, statistical analysis, and graph theory to build a linked set of data that enables customers to easily conduct faster and more efficient security investigations.

In this post you will learn about the new AWS Organizations integration with Amazon Detective, where new and existing Detective customers can delegate any account in their organization to be the delegated Detective administrator account, and can centrally manage the Detective behavior graph database for an organization with up to 1,200 accounts.

Customers tell us that they want to manage security findings and investigations across multiple AWS Accounts. Depending on the customer this can be 100s or 1000s of accounts. AWS Organizations integration with security services, including GuardDuty, Security Hub and AWS IAM Access Analyzer comes in handy by helping customers centralize management and governance of their environments as they scale and grow their AWS accounts and resources. Adding to the list, Detective is now integrated with AWS Organizations to simplify security posture management across all existing and future AWS accounts across an organization. Organizations integration is available in all AWS Regions that Detective supports.

Detective is aware of your existing delegated administrator accounts for other AWS Security services such as GuardDuty or Security Hub. Using this awareness, Detective recommends that you choose the same account as the administrator account for Detective, as shown in Figure 1. For a more complete walk though of how to enable your accounts, visit the AWS Detective Documentation.

Figure 1. Setting delegated administrator

You can then use the same account to manage all of your security services. AWS recommends you align your Detective administrator account with your GuardDuty and SecurityHub administrator accounts, to enable seamless integration between Detective and those services.

In GuardDuty or Security Hub, when viewing details for a GuardDuty finding, you can pivot from the finding details to the Detective finding profile.
In Detective, when investigating a GuardDuty finding, you can choose the option to archive that finding.

Once designated, the chosen account becomes the administrator account for the organization behavior graph. They can enable any organization account as a member account in the organization behavior graph, and can configure Detective to automatically enable organization accounts when they join the organization.

Figure 2. Auto-enabling Organization accounts

The Detective administrator account can also manually invite other accounts to join the organization behavior graph.

Figure 3. Inviting accounts to join the Organization behavior graph

From Detective, the administrator account can centrally conduct security investigations across the organization

Considerations for AWS Organizations support

Some considerations and recommendations around Organizations support for Detective:

Detective allows up to 1,200 member accounts in each behavior graph.
The Detective administrator account becomes the administrator account for the organization’s behavior graph.
An account can be a member account of multiple behavior graphs in the same Region. An account can accept multiple invitations. An organization account can be enabled as a member account in the organization behavior graph, and can then also accept invitations to other behavior graphs.
An account can only be the administrator account of one behavior graph per Region, but can be an administrator account in different Regions.
Changes to an organization are not immediately reflected in Detective. For most changes, such as new and removed organization accounts, it can take up to an hour for Detective to be notified.

Other recent updates from Amazon Detective

Additional support for all GuardDuty findings

With the recent expansion of security investigation support for Amazon Simple Storage Service (S3) and DNS-related findings on Amazon GuardDuty, Amazon Detective now provides full coverage of all detections from GuardDuty. Security analysts can now easily investigate and analyze the root cause of all GuardDuty findings using Detective, using the Investigate in Detective option in GuardDuty and Security Hub for further investigation.

New resource focused view

In addition to these integrations with AWS Organization and GuardDuty, Detective now makes it even easier for a security analyst to investigate entities and behaviors using a revamped user experience as seen in Figure 4. Amazon Detective presents a unified view of user and resource interactions over time, with all the context and details in one place, to help you quickly analyze the root cause of a security finding.

Figure 4. New resource focused view

New finding overview

The new finding overview provides an expanded set of details for each finding, and provides links to the profiles for each involved entity as seen in the right panel in Figure 4. With this unified view, you can visualize all of the details and context in one place, while identifying the underlying reasons for the findings. This resource-focused view helps you understand the connections between resources affected by a security finding, and further helps you drill down into relevant historical activity to quickly determine the root cause.

Integration with Splunk

Amazon Detective, in coordination with the Splunk Trumpet project, has released the ability to pivot from an Amazon GuardDuty finding in Splunk directly to an Amazon Detective entity profile. Customers can now quickly identify the root cause of potential security issues or suspicious activities. This setting can be enabled on the Splunk Trumpet project installation page by selecting Detective GuardDuty URLs from the AWS CloudWatch Events dropdown.

Amazon Detective’s interactive visualizations make it easy to investigate and analyze issues more thoroughly and effectively, with minimal effort. Using these visualizations, customers can easily filter large sets of event data into specific timelines, with all the details, context, and guidance needed to help you to investigate quickly. For example; Amazon Detective enables you to view login attempts on a geolocation map, drill down into relevant historical activity, quickly determine a root cause, and if necessary, take action to resolve the issue.

Amazon Detective makes it easy to analyze, investigate, and quickly identify the root cause of potential security issues. To get started, enable a 30-day free trial of Amazon Detective with just a few clicks in the AWS Management console. See the AWS Regions page for a list of all Regions where Detective is available. To learn more, visit the Amazon Detective product page.

If you have feedback about this post, submit comments in the Comments section below.

Want more AWS Security news? Follow us on Twitter.

New features from Apache Hudi 0.7.0 and 0.8.0 available on Amazon EMR

2021-12-20 Udit Mehrotra

Post Syndicated from Udit Mehrotra original https://aws.amazon.com/blogs/big-data/new-features-from-apache-hudi-0-7-0-and-0-8-0-available-on-amazon-emr/

Apache Hudi is an open-source transactional data lake framework that greatly simplifies incremental data processing and data pipeline development by providing record-level insert, update, and delete capabilities. This record-level capability is helpful if you’re building your data lakes on Amazon Simple Storage Service (Amazon S3) or Hadoop Distributed File System (HDFS). You can use it to comply with data privacy regulations and simplify data ingestion pipelines that deal with late-arriving or updated records from streaming data sources, or to ingest data using change data capture (CDC) from transactional systems. Apache Hudi is integrated with open-source big data analytics frameworks like Apache Spark, Apache Hive, Presto, and Trino. It allows you to maintain data in Amazon S3 or HDFS in open formats like Apache Parquet and Apache Avro.

Starting with release version 5.28.0, Amazon EMR installs the Hudi component by default when you install Spark, Hive, Presto, or Trino. Since the inclusion of Apache Hudi within Amazon EMR, there has been several improvements and bug fixes that have been added to Apache Hudi. Apache Hudi graduated as a top-level Apache project on June 2020.

In this post, we provide a summary of some of the key new features and capabilities included since Apache Hudi release versions 0.7.0 and 0.8.0. These new features and capabilities of Hudi are available since Amazon EMR releases 5.33.0 and 6.3.0:

Clustering
Metadata-based file listing
Amazon CloudWatch integration
Optimistic Concurrency Control
Amazon EMR configuration support and improvements
Apache Flink integration
Kafka commit callbacks
Other improvements

Clustering

We see more use cases that need high throughput ingestion to data lakes. However, faster data ingestion often leads to smaller data file sizes that often adversely affects query performance, because a large number of small files increases the costly I/O operations required to return results. Another concern that we see is that the organization of data during ingestion is different from the organization that would be most efficient when querying the data. For example, it’s convenient to ingest ecommerce orders by OrderDate as they come in, but when queried, it’s better if orders for a single customer are stored together.

Apache Hudi version 0.7.0 introduces a new feature that allows you to cluster the Hudi tables. Clustering in Hudi is a framework that provides a pluggable strategy to change and reorganize the data layout while also optimizing the file sizes. With clustering, you can now optimize query performance without having to trade-off data ingest throughput.

You can use clustering to rewrite the data using different methods as per the different use case requirements:

Improve query performance with data locality – This changes the data layout on disk by sorting the data on one or many user-specified columns. With this approach, we can improve query performance by using the Parquet file format’s ability to perform predicate push-down and skip the unwanted files and Parquet row groups. This strategy can also control the file size to avoid small files.
Improve data freshness – This requirement assumes that the data locality is not important or taken care of already at the time of ingestion. It’s ideal for use cases where fresh data is important, where data is ingested using several small files and stitched or merged later using the clustering framework.

You can run the clustering table service asynchronously or synchronously. It also introduces the new action type REPLACE, which identifies the clustering action in the Hudi metadata timeline.

In the following example, we create two Copy on Write (CoW) Hudi tables: amazon_reviews and amazon_reviews_clustered using Amazon EMR release version 6.3.0.

We use spark-shell to create the Hudi tables. Start the Spark shell by running the following on the Amazon EMR primary node:

spark-shell --conf "spark.serializer=org.apache.spark.serializer.KryoSerializer" --conf "spark.sql.hive.convertMetastoreParquet=false" --jars /usr/lib/hudi/hudi-spark-bundle.jar

We then create the Hudi table amazon_reviews using the BULK_INSERT operation and without clustering enabled:

import org.apache.hudi.DataSourceWriteOptions
import org.apache.hudi.config.HoodieWriteConfig
import org.apache.hudi.HoodieDataSourceHelpers
import org.apache.spark.sql.SaveMode

val srcPath = "s3://amazon-reviews-pds/parquet/"
val tableName = "amazon_reviews"
val tablePath = "s3://emr-hudi-test-data/hudi/hudi_080/" + tableName

val inputDF = spark.read.format("parquet").load(srcPath)

inputDF.write.format("hudi")
  .option(HoodieWriteConfig.TABLE_NAME, tableName)
  .option(DataSourceWriteOptions.OPERATION_OPT_KEY, DataSourceWriteOptions.BULK_INSERT_OPERATION_OPT_VAL)
  .option(DataSourceWriteOptions.TABLE_TYPE_OPT_KEY, DataSourceWriteOptions.COW_TABLE_TYPE_OPT_VAL)
  .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY, "review_id")
  .option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY, "product_category")
  .option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY, "review_date")
  .option(DataSourceWriteOptions.HIVE_SYNC_ENABLED_OPT_KEY, "true")
  .option(DataSourceWriteOptions.HIVE_DATABASE_OPT_KEY, "hudi_test")
  .option(DataSourceWriteOptions.HIVE_TABLE_OPT_KEY, tableName)
 .option(DataSourceWriteOptions.HIVE_PARTITION_FIELDS_OPT_KEY, "product_category")
.option(DataSourceWriteOptions.HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY, "org.apache.hudi.hive.MultiPartKeysValueExtractor")
  .mode(SaveMode.Overwrite)
  .save(tablePath)

We then create the Hudi table amazon_reviews_clustered using BULK_INSERT operation and inline clustering enabled and sorted by columns star_rating and total_votes:

import org.apache.hudi.DataSourceWriteOptions
import org.apache.hudi.config.HoodieWriteConfig
import org.apache.hudi.config.HoodieClusteringConfig
import org.apache.hudi.HoodieDataSourceHelpers
import org.apache.spark.sql.SaveMode

val srcPath = "s3://amazon-reviews-pds/parquet/"
val tableName = "amazon_reviews_clustered"
val tablePath = "s3://emr-hudi-test-data/hudi/hudi_080/" + tableName

val inputDF = spark.read.format("parquet").load(srcPath)

inputDF.write
  .format("hudi")
  .option(HoodieWriteConfig.TABLE_NAME, tableName)
  .option(DataSourceWriteOptions.OPERATION_OPT_KEY, DataSourceWriteOptions.BULK_INSERT_OPERATION_OPT_VAL)
  .option(DataSourceWriteOptions.TABLE_TYPE_OPT_KEY, DataSourceWriteOptions.COW_TABLE_TYPE_OPT_VAL)
  .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY, "review_id")
  .option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY, "product_category")
  .option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY, "review_date")
  .option(HoodieClusteringConfig.INLINE_CLUSTERING_PROP, "true")
.option(HoodieClusteringConfig.INLINE_CLUSTERING_MAX_COMMIT_PROP, "0")
  .option(HoodieClusteringConfig.CLUSTERING_TARGET_PARTITIONS, "43")
  .option(HoodieClusteringConfig.CLUSTERING_MAX_NUM_GROUPS, "100")
.option(HoodieClusteringConfig.CLUSTERING_SORT_COLUMNS_PROPERTY, "star_rating,total_votes")
  .option(DataSourceWriteOptions.HIVE_SYNC_ENABLED_OPT_KEY, "true")
  .option(DataSourceWriteOptions.HIVE_DATABASE_OPT_KEY, "hudi_test")
  .option(DataSourceWriteOptions.HIVE_TABLE_OPT_KEY, tableName)
 .option(DataSourceWriteOptions.HIVE_PARTITION_FIELDS_OPT_KEY, "product_category")
.option(DataSourceWriteOptions.HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY, "org.apache.hudi.hive.MultiPartKeysValueExtractor")
  .mode(SaveMode.Overwrite)
  .save(tablePath)

Let’s query these two tables and validate performance difference. To validate the performance, we will use Spark SQL CLI – a convenient tool to run the Hive metastore service in local mode and execute queries input from the command line. To start the Spark SQL CLI, we execute the following command:

spark-sql --conf "spark.serializer=org.apache.spark.serializer.KryoSerializer" —conf "spark.hadoop.mapreduce.input.pathFilter.class=org.apache.hudi.hadoop.HoodieROTablePathFilter" —jars /usr/lib/hudi/hudi-spark-bundle.jar

We restart the Spark SQL CLI (spark-sql) session between each run in order to avoid caching or warm executors, which may impact query performance.

Let’s run the query against the non-clustered Hudi table by running the following in the spark-sql interface:

spark-sql> USE hudi_test;
spark-sql> select review_id from amazon_reviews where star_rating > 3 and total_votes > 10;

Let’s also run the same query on our clustered table from the spark-sql interface:

spark-sql> USE hudi_test;
spark-sql> select review_id from amazon_reviews_clustered where star_rating > 3 and total_votes > 10;

Let’s compare the underlying file scan performance for the two different Hudi tables. The following screenshot is the output from the Spark UI, which shows the changes in the files scanned for the same number of output rows. First we see the files scanned for the unclustered Hudi table.

Next, we see the files scanned for the clustered Hudi table.

The number of files scanned by Spark dropped from 1,542 files for the unclustered Hudi dataset to 85 files for the clustered Hudi dataset for the exact same data. Also, the number of records scanned reduced from 160,796,570 to 78,845,795.

We compared the performance of the preceding query for the amazon_reviews (non-clustered) and amazon_reviews_clustered (clustered) Hudi dataset, across Spark SQL, Hive, and PrestoDB. The cluster configuration used was 1 leader (m5.4xlarge) and 2 cores (m5.4xlarge).

The following chart provides the query performance comparison using different engines for the Hudi table, which are unclustered, and for the Hudi table, which is clustered.

We found that with clustering enabled for the Hudi table, the query performance increased for all three query engines, ranging from 28% to 63%. The following table provides the details for the query performance for the Hudi table, both with clustering enabled and disabled.

Query Engine	Non-clustered Table	Clustered Table	Query Runtime Improvement
	Time (in seconds)	Time (in seconds)
Spark SQL	21.6	15.4	28.7 %
Hive	96.3	47	51.3 %
PrestoDB	11.7	4.3	63.25 %

Metadata-based file listing

Hudi write operations like compaction, cleaning, and global index, as well as queries, perform a file system listing to get the current view of the partitions and files in the dataset. For small datasets, this shouldn’t impact the performance drastically. However, when working with large data, this listing operation can impact the performance negatively when reading the files. For example, with HDFS as the underlying data store, the list operation for a large number of files or partitions can overwhelm HDFS NameNode and affect the stability of job. In cases where Amazon S3 is used as the underlying data store, O(N) calls for N partitions with a large number of files is time-consuming and can also result in throttling errors.

With Apache Hudi version 0.7.0, you can change this behavior by enabling metadata-based listing for Hudi tables. This partitions and files list is stored in an internal metadata table, which is implemented using a Hudi Merge on Read (MoR) table. This metadata table can take all the advantages of the Hudi MoR table, which includes the capability of low-latency updates, and the ability to atomically commit metadata updates and easily roll back if write fails. It also makes it easy to keep metadata in sync with the Hudi table because both use a timeline for traceability. This index of the file list is stored using HFiles for base and log file format for delta updates. The HFile format allows point-lookups of specific records based on record key. The goal is to reduce O(N) list calls for N partitions to O(1) get call to read the metadata.

We compared query performance for a Hudi dataset with metadata listing enabled vs. not enabled. For this example, we used a larger dataset of 3 TB with Amazon EMR release version 6.3.0. We used the following code snippet to create the metadata enabled and not enabled dataset by setting the HoodieMetadataConfig.METADATA_ENABLE_PROP (hoodie.metadata.enable) config:

val srcPath = "s3://gbrahmi-demo/3-tb-data_store_sales-parquet/"
val tableName = "tpcds_store_sales_3TB_hudi_080"
val tablePath = "s3://emr-hudi-test-data/hudi/hudi_080/" + tableName

val inputDF = spark.read.format("parquet").load(srcPath)

inputDF.write
  .format("hudi")
  .option(HoodieWriteConfig.TABLE_NAME, tableName)
  .option(HoodieMetadataConfig.METADATA_ENABLE_PROP, "true")
  .option(DataSourceWriteOptions.OPERATION_OPT_KEY, DataSourceWriteOptions.BULK_INSERT_OPERATION_OPT_VAL)
  .option(DataSourceWriteOptions.TABLE_TYPE_OPT_KEY, DataSourceWriteOptions.COW_TABLE_TYPE_OPT_VAL)
  .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY, "ss_item_sk,ss_ticket_number")
  .option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY, "ss_sold_date_sk")
  .option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY, "ss_ticket_number")
  .option(DataSourceWriteOptions.KEYGENERATOR_CLASS_OPT_KEY, "org.apache.hudi.keygen.ComplexKeyGenerator")
  .option(DataSourceWriteOptions.HIVE_SYNC_ENABLED_OPT_KEY, "true")
  .option(DataSourceWriteOptions.HIVE_DATABASE_OPT_KEY, "hudi_test")
  .option(DataSourceWriteOptions.HIVE_TABLE_OPT_KEY, tableName)
 .option(DataSourceWriteOptions.HIVE_PARTITION_FIELDS_OPT_KEY, "ss_sold_date_sk")
.option(DataSourceWriteOptions.HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY, "org.apache.hudi.hive.MultiPartKeysValueExtractor")
  .mode(SaveMode.Overwrite)
  .save(tablePath)

On the query engine side, we can enable it via the following methods:

Spark data source:

spark.read.format("hudi")
  .option(HoodieMetadataConfig.METADATA_ENABLE_PROP, "true")
  .load(tablePath + "/*")

Spark SQL CLI:

spark-sql --conf "spark.hadoop.hoodie.metadata.enable=true"
 --jars /usr/lib/hudi/hudi-spark-bundle.jar

Hive:

hive> SET hoodie.metadata.enable = true;

PrestoDB:

presto:default> set session hive.prefer_metadata_to_list_hudi_files=true;

We used the following query to compare query performance via Hive and PrestoDB:

select count(*) from tpcds_store_sales_3TB_hudi_080 where ss_quantity > 50;

The following chart provides the query performance comparison.

We found that with metadata listing, query execution runtime decreased by around 25% for the Hive engine, and by around 32% for PrestoDB. The following table provides the details of query execution runtime with and without metadata listing.

Query Engine	Metadata Disabled	Metadata Enabled	Query Runtime Improvement
	Time (in seconds)	Time (in seconds)
Hive	415.28533	310.02367	25.35%
Presto	72	48.6	32.50%

Metadata listing considerations

With Hudi 0.7.0 and 0.8.0, you may not observe noticeable improvements for queries via Spark SQL (with metadata listing), because Hudi relies on Spark’s InMemoryFileIndex to do the actual file listing and can’t use the metadata. You may observe improvements because HoodieROPathFilter uses the metadata for its filtering. However, with Hudi release 0.9.0, we’re introducing a custom FileIndex implementation for Hudi to use metadata for file listing instead of relying on Spark. Therefore, from 0.9.0, you will observe noticeable performance improvements for Spark SQL queries.

Amazon CloudWatch integration

Apache Hudi provides MetricsReporter implementations like JmxMetricsReporter, MetricsGraphiteReporter, and DatadogMetricsReporter, which you can use to publish metrics to user-specified sinks. Amazon EMR, with its release 6.4.0 having Hudi 0.8.0, has introduced CloudWatchMetricsReporter, which you can use to publish these metrics to Amazon CloudWatch. It helps publish Hudi writer metrics like commit duration, rollback duration, file-level metrics (number of files added or deleted per commit), record-level metrics (records inserted or updated per commit) and partition-level metrics (partitions inserted or updated per commit). This is useful in debugging Hudi jobs, as well as making decisions around cluster scaling.

You can enable the CloudWatch metric via the following configurations:

hoodie.metrics.on = true
hoodie.metrics.reporter.type = CLOUDWATCH

The following table summarizes additional configurations that you can change if needed.

Configuration	Description	Value
hoodie.metrics.cloudwatch.report.period.seconds	Frequency (in seconds) at which to report metrics to CloudWatch	Default value is 60 seconds, which is fine for the default 1-minute resolution offered by CloudWatch
hoodie.metrics.cloudwatch.metric.prefix	Prefix to be added to each metric name	Default value is empty (no prefix)
hoodie.metrics.cloudwatch.namespace	CloudWatch namespace under which metrics are published	Default value is `Hudi`
hoodie.metrics.cloudwatch.maxDatumsPerRequest	Maximum number of datums to be included in one request to CloudWatch	Default value is 20, which is same as the CloudWatch default

The following screenshot shows some of the metrics published for a particular Hudi table, including the type of metric and its name. These are dropwizard metrics; gauge represents the exact value at a point in time, and counter represents a simple incrementing or decrementing integer.

The following graph of the gauge metric represents the total records written to a table over time.

The following graph of the counter metric represents the number of commits increasing over time.

Optimistic Concurrency Control

A major feature that has been introduced with Hudi 0.8.0, and available since Amazon EMR release 6.4.0, is Optimistic Concurrency Control (OCC) to enable multiple writers to concurrently ingest data into the same Hudi table. This is file-level OCC, which means that for any two commits (or writers) happening to the same table at the same time, both are allowed to succeed if they don’t have writes to overlapping files. The feature requires acquiring locks, for which you can use either Zookeeper or HiveMetastore. For more information about the guarantees provided, see Concurrency Control.

Amazon EMR clusters have Zookeeper installed, which you can use as a lock provider to perform concurrent writes from the same cluster. To make it easier to use, Amazon EMR preconfigures the lock provider in the newly introduced /etc/hudi/conf/hudi-defaults.conf file (see the next section) via the following properties:

hoodie.write.lock.provider=org.apache.hudi.client.transaction.lock.ZookeeperBasedLockProvider
hoodie.write.lock.zookeeper.url=<EMR Zookeeper URL>
hoodie.write.lock.zookeeper.port=<EMR Zookeeper Port>
hoodie.write.lock.zookeeper.base_path=/hudi

Although the lock provider is preconfigured, enabling of OCC still needs to be handled by the users either via Hudi job options or at cluster level via the Amazon EMR Configurations API:

hoodie.write.concurrency.mode = optimistic_concurrency_control
hoodie.cleaner.policy.failed.writes = LAZY (Performs cleaning of failed writes lazily instead of inline with every write)
hoodie.write.lock.zookeeper.lock_key = <Key to uniquely identify the Hudi table> (Table Name is a good option)

Amazon EMR configuration support and improvements

Amazon EMR release 6.4.0 has introduced the ability to configure and reconfigure Hudi via the configurations feature. Hudi configurations that are needed across jobs and tables can now be configured at cluster level via the hudi-defaults classification or /etc/hudi/conf/hudi-defaults.conf file, similar to other applications like Spark and Hive. The following code is an example of the hudi-defaults classification to enable metadata-based listing and CloudWatch metrics:

[{
  "Classification": "hudi-defaults",
  "Properties": {
    "hoodie.metadata.enable": "true",
    "hoodie.metadata.insert.parallelism": "3000",
    "hoodie.metrics.on": "true",
    "hoodie.metrics.reporter.type": "CLOUDWATCH"
  }
}]

Amazon EMR automatically configures suitable defaults for a few configs, to improve the user experience by removing the need for customers having to pass them:

HIVE_URL_OPT_KEY is configured to the cluster’s Hive server URL and no longer needs to be specified. This is particularly useful when running a job in Spark cluster mode, where users previously had to determine and themselves specify the Amazon EMR primary IP.
HBase specific configurations, which are useful for using HBase index with Hudi.
Zookeeper lock provider specific configuration, as discussed under concurrency control, which makes it easier to use OCC.

Additional changes have been introduced to reduce the number of configurations that users need to pass, and to infer automatically where possible:

The partitionBy API can now be used to specify partition column.
When enabling Hive Sync, it’s no longer mandatory to pass HIVE_TABLE_OPT_KEY, HIVE_PARTITION_FIELDS_OPT_KEY, or HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY. These configs can be inferred from the Hudi table name and partition fields itself.
KEYGENERATOR_CLASS_OPT_KEY is not mandatory to pass if you’re using SimpleKeyGenerator or ComplexKeyGenerator, and can be inferred depending on whether there are single or multiple record key columns.

Apache Flink integration

Apache Hudi started off with a very tight integration with Apache Spark. With release version 0.7.0, we now have integrations available to ingest data using Apache Flink. It required decoupling Spark from the internal table format, writers, and table services code in a way that can be used by other evolving engines in the industry like Flink.

Hudi 0.7.0 provides initial Flink support via HooodieFlinkStreamer, which you can use to write CoW tables by streaming data from a Kafka topic using Apache Flink. For example, you can use the following Flink command to start reading the topic ExampleTopic from the Kafka brokers broker-1, broker-2, and broker-3 running on port 9092:

./bin/flink run -c org.apache.hudi.HoodieFlinkStreamer \
  -m yarn-cluster -d -yjm 1024 -ytm 1024 -p 4 -ys 3 \
  -ynm hudi_on_flink_example \
  /usr/lib/hudi/hudi-flink-bundle.jar \
  --kafka-topic ExampleTopic \
  --kafka-group-id <kafka-group-id> \
  --kafka-bootstrap-servers broker-1:9092,broker-2:9092,broker-3:9092 \
  --table-type COPY_ON_WRITE \
  --target-table hudi_flink_table \
  --target-base-path s3://emr-hudi-test-data/hudi/hudi_070/hudi_flink_table \
  --props hdfs:///hudi/flink/config/hudi-jobConf.properties \
  --checkpoint-interval 6000 \
  --flink-checkpoint-path hdfs:///hudi/hudi-flink-checkpoint-dir

With Hudi 0.8.0, there have been major improvements in Flink integration performance and scalability, as well as the introduction of new features like SQL connector for both source and sink, writer for MoR, batch reader for CoW and MoR, streaming reader for MoR, and state-backed indexing with bootstrap support. For more information about Flink integration design, see Apache Hudi meets Apache Flink. To get started with Flink SQL, see Flink Guide.

Kafka commit callbacks

The previous version (0.6.0) of Apache Hudi introduced write commit callback functionality. With this functionality, Hudi can send a callback message every time a successful commit arrives to the Hudi dataset. The write commit callback supported HTTP method in the previous release. With Apache Hudi release version 0.7.0, Hudi now supports write commit callback for Kafka as well. Using Kafka for sending the callback messages for every successful commit can now enable you to build asynchronous data pipelines or business processing logic every time the Hudi dataset sees a new commit. You can now build incremental ETL pipelines for processing new events that arrive in the Hudi data lake.

The implementation of Kafka commit callback uses HoodieWriteCommitKafkaCallback as the hoodie.write.commit.callback.class. Besides setting the commit callback class, you can also set up additional parameters for the Kafka bootstrap servers and the topic configurations.

The following is a code snippet where commit callback messages are published to the Kafka topic ExampleTopic hosted on the Kafka brokers b-1.demo-hudi.xxxxxx.xxx.kafka.us-east-1.amazonaws.com, b-2.demo-hudi.xxxxxx.xxx.kafka.us-east-1.amazonaws.com, and b-3.demo-hudi.xxxxxx.xxx.kafka.us-east-1.amazonaws.com when writing to a Hudi dataset:

import org.apache.hudi.QuickstartUtils._
import scala.collection.JavaConversions._
import org.apache.spark.sql.SaveMode._
import org.apache.hudi.DataSourceReadOptions._
import org.apache.hudi.DataSourceWriteOptions._
import org.apache.hudi.config.HoodieWriteConfig._
 
val tableName = "trips_data_kafka_callback"
val tablePath = "s3://gbrahmi-sample-bucket/hudi-dataset/hudi_kafka_callback/" + tableName
 
val dataGen = new DataGenerator(Array("2021/05/01"))
val updates = convertToStringList(dataGen.generateInserts(10))
 
val df = spark.read.json(spark.sparkContext.parallelize(updates, 1))
 
df.write.format("hudi").
  option(TABLE_NAME, tableName).
  option(PRECOMBINE_FIELD_OPT_KEY, "ts").
  option(RECORDKEY_FIELD_OPT_KEY, "uuid").
  option("hoodie.write.commit.callback.on", "true").
  option("hoodie.write.commit.callback.class", "org.apache.hudi.utilities.callback.kafka.HoodieWriteCommitKafkaCallback").
  option("hoodie.write.commit.callback.kafka.bootstrap.servers", "b-1.demo-hudi.xxxxxx.xxx.kafka.us-east-1.amazonaws.com:9092,b-2.demo-hudi.xxxxxx.xxx.kafka.us-east-1.amazonaws.com:9092,b-3.demo-hudi.xxxxxx.xxx.kafka.us-east-1.amazonaws.com:9092").
  option("hoodie.write.commit.callback.kafka.topic", "ExampleTopic").
  option("hoodie.write.commit.callback.kafka.acks", "all").
  option("hoodie.write.commit.callback.kafka.retries", 3).
  mode(Append).
  save(tablePath)

The following is how the messages appear in your Kafka topic:

{"commitTime":"20210508210230","tableName":"trips_data_kafka_callback","basePath":"s3:// gbrahmi-sample-bucket/hudi-dataset/hudi_kafka_callback/trips_data_kafka_callback"}

A downstream pipeline can now easily query these events from Kafka and process the incremental data into derived Hudi tables.

Other improvements

Besides the aforementioned improvements, there have been some additional changes worth mentioning. On the writer side, there are the following improvements:

Support for Spark 3 – Support for writing and querying the data using Apache Spark 3 is now available with Apache Hudi 0.7.0 onwards. This works with Scala 2.12 bundle for hudi-spark-bundle.
Insert overwrite and insert overwrite table write operations – Apache Hudi 0.7.0 introduces two new operations, insert_overwrite and insert_overwrite_table, to support batch ETL jobs where an entire table or partition is overwritten during each execution. You can use these operations instead of the upsert operation, and it’s must cheaper to run.
Delete partitions – The new API is now available since 0.7.0 to delete an entire partition. This helps avoid the use of record-level deletes.
Java writer support – Hudi 0.7.0 introduced Java-based writing support via the HoodieJavaWriteClient class.

Similarly, on the query integration side, there have been the following improvements:

Structured streaming reads – Hudi 0.8.0 introduced a Spark structured streaming source implementation via the HoodieStreamSource class. You can use it to support streaming reads from Hudi tables.
Incremental query on MoR – Since Hudi 0.7.0, we now have incremental query support for MoR tables, which you can use to incrementally pull data by downstream applications.

Conclusion

The new features introduced in Apachi Hudi enable you to build decoupled solutions by using features like Kafka commit callback and Flink integration with Apache Hudi with Amazon EMR. You can also improve your overall performance of the Hudi data lake by using the capabilities of clustering and metadata tables.

About the Authors

Udit Mehrotra is a software development engineer at Amazon Web Services and an Apache Hudi PMC member/committer. He works on cutting-edge features of Amazon EMR and is also involved in open-source projects such as Apache Hudi, Apache Spark, Apache Hadoop, and Apache Hive. In his spare time, he likes to play guitar, travel, binge watch, and hang out with friends.

Gagan Brahmi is a Specialist Solutions Architect focused on Big Data & Analytics at Amazon Web Services. Gagan has over 16 years of experience in information technology. He helps customers architect and build highly scalable, performant, and secure cloud-based solutions on AWS.

Вестник Der Standard (Австрия) „Без Виктория“

2021-12-20 Николай Марченко

Post Syndicated from Николай Марченко original https://bivol.bg/%D0%B1%D0%B5%D0%B7-%D0%B2%D0%B8%D0%BA%D1%82%D0%BE%D1%80%D0%B8%D1%8F-%D1%80%D0%B0%D0%B7%D0%BA%D0%B0%D0%B7%D1%8A%D1%82-%D0%BD%D0%B0-%D0%B4%D0%B8%D0%BC%D0%B8%D1%82%D1%8A%D1%80-%D0%B4.html

понеделник 20 декември 2021

Сайтът за разследваща журналистика “Биволъ” публикува превода на статията* в един от най-авторитетните австрийски и европейски вестници Der Standard, посветена на предстоящото излизане на фотокнига на Димитър Динев и Евгения…

Query cross-account AWS Glue Data Catalogs using Amazon Athena

2021-12-20 Louis Hourcade

Post Syndicated from Louis Hourcade original https://aws.amazon.com/blogs/big-data/query-cross-account-aws-glue-data-catalogs-using-amazon-athena/

Many AWS customers rely on a multi-account strategy to scale their organization and better manage their data lake across different projects or lines of business. The AWS Glue Data Catalog contains references to data used as sources and targets of your extract, transform, and load (ETL) jobs in AWS Glue. Using a centralized Data Catalog offers organizations a unified metadata repository and minimizes the administrative overhead related to sharing data across different accounts, thereby expanding access to the data lake.

Amazon Athena is one of the popular choices to run analytical queries in data lakes. This interactive query service makes it easy to analyze data in Amazon Simple Storage Service (Amazon S3) using standard SQL. Athena is serverless, so there is no infrastructure to manage, and you’re charged based on the amount of data scanned by your queries.

In May 2021, Athena introduced the ability to query Data Catalogs across multiple AWS accounts, enabling you to access your data lake without the complexity of replicating catalog metadata in individual AWS accounts. This blog post details the procedure for using the feature.

Solution overview

The following diagram shows the necessary components used in two different accounts (consumer account and producer account, hosting a central Data Catalog) and the flow between the two for cross-account Data Catalog access using Athena.

Our use case showcases Data Catalog sharing between two accounts:

Producer account – The account that administrates the central Data Catalog
Consumer account – The account querying data from the producer’s Data Catalog (the central Data Catalog)

In this walkthrough, we use the following two tables, extracted from an ecommerce dataset:

The orders table logs the website’s orders and contains the following key attributes:
- Row ID – Unique entry identifier in the orders table
- Order ID – Unique order identifier
- Order date – Date the order was placed
- Profit – Profit value of the order
The returns table logs the returned items and contains the following attributes:
- Returned – If the order has been returned (Yes/No)
- Order ID – Unique order identifier
- Market – Region market

We walk you through the following high-level steps to use this solution:

Set up the producer account.
Set up the consumer account.
Set up permissions.
Register the producer account in the Data Catalog.
Query your data.

You use Athena in the consumer account to perform different operations using the producer account’s Data Catalog.

First, you use the consumer account to query the orders table in the producer account’s Data Catalog.

Next, you use the consumer account to join the two tables and retrieve information about lost profit from returned items. The returns table is in the consumer’s Data Catalog, and the orders table is in the producer’s.

Prerequisites

The following are the prerequisites for this walkthrough:

Two AWS accounts.
An AWS Identity and Access Management (IAM) principal with access to AWS resources used in this solution.
Querying Data Catalogs across accounts only works with Athena engine V2. To check if your Athena workgroup is running on this engine, select the Workgroup tab on the left of the Athena console.

This lists all your Athena workgroups. Make sure that the one you use runs on Athena engine version 2.

If all your workgroups are using Athena engine version 1, you need to update the engine version of an existing workgroup or create a new workgroup with the appropriate version.

Set up the producer account

In the producer account, complete the following steps:

Create an S3 bucket for your producer’s data. For information about how to secure your S3 bucket, see Security Best Practices for Amazon S3.
In this bucket, create a prefix named orders.
Download the orders table in CSV format and upload it to the orders prefix.
Run the following Athena query to create the producer’s database:

CREATE DATABASE producer_database
  COMMENT 'Producer data'

Run the following Athena query to create the orders table in the producer’s database. Make sure to replace <your-producer-s3-bucket-name> with the name of the bucket you created.

CREATE EXTERNAL TABLE producer_database.orders(
  `row id` bigint, 
  `order id` string, 
  `order date` string, 
  `ship date` string, 
  `ship mode` string, 
  `customer id` string, 
  `customer name` string, 
  `segment` string, 
  `city` string, 
  `state` string, 
  `country` string, 
  `postal code` bigint, 
  `market` string, 
  `region` string, 
  `product id` string, 
  `category` string, 
  `sub-category` string, 
  `product name` string, 
  `sales` string, 
  `quantity` bigint, 
  `discount` string, 
  `profit` string, 
  `shipping cost` string, 
  `order priority` string)
ROW FORMAT DELIMITED 
  FIELDS TERMINATED BY '\;'
LOCATION
  's3://<your-producer-s3-bucket-name>/orders/'
TBLPROPERTIES (
  'skip.header.line.count'='1'
)

Set up the consumer account

In the consumer account, complete the following steps:

Create an S3 bucket for your consumer’s data.
In this bucket, create a prefix named returns.
Download the returns table in CSV format and upload it to the returns prefix.
Run the following Athena query to create the consumer’s database:

CREATE DATABASE consumer_database
COMMENT 'Consumer data'

Run the following Athena query to create the returns table in the consumer’s database. Make sure to replace <your-consumer-s3-bucket-name> with the name of the bucket you created.

CREATE EXTERNAL TABLE consumer_database.returns(
  `returned` string, 
  `order id` string, 
  `market` string)
ROW FORMAT DELIMITED 
  FIELDS TERMINATED BY '\;' 
LOCATION
  's3://<your-consumer-s3-bucket-name>/returns/'
TBLPROPERTIES (
  'skip.header.line.count'='1'
)

Set up permissions

For the consumer account to query data in the producer account, we need to set up permissions.

First, we give the consumer account permission to access the producer account’s AWS Glue resources.

In the producer account’s Data Catalog settings, add the following AWS Glue resource policy, which grants the consumer account access to the Data Catalog:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": {
                "AWS": "arn:aws:iam::<Consumer-account-id>:role/<role-in-consumer-account>"
            },
            "Action": [
        "glue:GetDatabases",
        "glue:GetTables"
      ],
            "Resource": [
                "arn:aws:glue:<Region>:<Producer-account-id>:catalog",
                "arn:aws:glue:<Region>:<Producer-account-id>:database/producer-database",
                "arn:aws:glue:<Region>:<Producer-account-id>:table/producer-database/orders"
            ]
        }
    ]
}

Next, we give the consumer account permission to list and get data from the S3 bucket in the producer account.

In the producer account, add the following S3 bucket policy to the bucket <Producer-bucket>, which stores the data:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": {
                "AWS": "arn:aws:iam::<Consumer-account-id>:role/<role-in-consumer-account>"
            },
            "Action": [
                "s3:GetObject",
                "s3:ListBucket"
            ],
            "Resource": [
                "arn:aws:s3:::<Producer-bucket>",
                "arn:aws:s3:::<Producer-bucket>/orders/*"
            ]
        }
    ]
}

Register the producer account’s Data Catalog

At this stage, you have set up the required permissions to access the central Data Catalog in the producer account from the consumer account. You now need to register the central Data Catalog as a data source in Athena.

In the consumer account, go the Athena console and choose Connect data source.
Select S3 – AWS Glue Data Catalog as the data source selection.
Select AWS Glue Data Catalog in another account.

You then need to provide some information regarding the central Data Catalog you want to register.

For Data source name, enter a name for the catalog (for example, Central_Data_Catalog). This serves as an alias in the consumer account, pointing to the central Data Catalog in the producer account.

For Catalog ID, enter the producer account ID.
Choose Register to complete the process.

Query your data

You have now registered the central Data Catalog as a data source in the consumer account. In the Athena query editor, you can then choose Central_Data_Catalog as a data source. Under Database, you can see all the databases for which you were granted access in the producer account’s AWS Glue resource policy. The same applies for the tables. After completing the steps in the earlier sections, you should see the orders table from producer_database located in the producer account.

You can start querying the Data Catalog of the producer account directly from Athena in the consumer account. You can test this by running the following SQL query in Athena:

SELECT * FROM "Central_Data_Catalog"."producer_database"."orders" limit 10;

This SQL query extracts the first 10 rows of the orders table located in the producer account.

You just queried a Data Catalog located in another AWS account, which enables you to easily access your central Data Catalog and scale your data lake strategy.

Now, let’s see how we can join two tables that are in different AWS accounts. In our scenario, the returns table is in the consumer account and the orders table is in the producer account. Suppose you want to join the two tables and see the total amount of items returned in each market. The Athena built-in support for cross-account Data Catalogs makes this operation easy. In the Athena query editor, run the following SQL query:

SELECT
returns_tb.Market as Market,
sum(orders_tb.quantity) as Total_Quantity
FROM "Central_Data_Catalog"."producer_database"."orders" as orders_tb
JOIN "AwsDataCatalog"."consumer_database"."returns" as returns_tb
ON orders_tb."order id" = returns_tb."order id"
GROUP BY returns_tb.Market;

In this SQL query, you use both the consumer’s Data Catalog AwsDataCatalog and the producer’s Data Catalog Central_Data_Catalog to join tables and get insights from your data.

Limitations and considerations

The following are some limitations that you should take into consideration before using Athena built-in support for cross-account Data Catalogs:

This Athena feature is available only in Regions where Athena engine version 2 is supported. For a list of Regions that support Athena engine version 2, see Athena engine version 2. To upgrade a workgroup to engine version 2, see Changing Athena Engine Versions.
As of this writing, CREATE VIEW statements that include a cross-account Data Catalog are not supported.
Cross-Region Data Catalog queries are not supported.

Clean up

After you query and analyze the data, you should clean up the resources used in this tutorial to prevent any recurring AWS costs.

To clean up the resources, navigate to the Amazon S3 console in both the provider and consumer accounts, and empty the S3 buckets. Also, navigate to the AWS Glue console and delete the databases.

Conclusion

In this post, you learned how to query data from multiple accounts using Athena, which allows your organization to access to a centralized Data Catalog. We hope that this post helps you build and explore your data lake across multiple accounts.

To learn more about AWS tools to manage access to your data, check out AWS Lake Formation. This service facilitates setting up a centralized data lake and allows you to grant users and ETL jobs cross-account access to Data Catalog metadata and underlying data.

About the Authors

Louis Hourcade is a Data Scientist in the AWS Professional Services team. He works with AWS customer across various industries to accelerate their business outcomes with innovative technologies. In his spare time he enjoys running, climbing big rocks, and surfing (not so big) waves.

Sara Kazdagli is a Professional Services consultant specialized in data analytics and machine learning. She helps customers across different industries build innovative solutions and make data-driven decisions. Sara holds a MSc in Software engineering and a MSc in data science. In her spare time, she likes to go on hikes and walks with her australian shepherd dog Kiba.

Jahed Zaïdi is an AI/ML & Big Data specialist at AWS Professional Services. He is a builder and a trusted advisor to companies across industries, helping them innovate faster and on a larger scale. As a lifelong explorer, Jahed enjoys discovering new places, cultures, and outdoor activities.

The Windsor Castle: 40 Sovereigns, 25 Ghosts and 1 Big Fire

2021-12-20 Geographics

Post Syndicated from Geographics original https://www.youtube.com/watch?v=jMQdhIZNja8

More on NSO Group and Cytrox: Two Cyberweapons Arms Manufacturers

2021-12-20 Bruce Schneier

Post Syndicated from Bruce Schneier original https://www.schneier.com/blog/archives/2021/12/more-on-nso-group-and-cytrox-two-cyberweapons-arms-manufacturers.html

Citizen Lab published another report on the spyware used against two Egyptian nationals. One was hacked by NSO Group’s Pegasus spyware. The other was hacked both by Pegasus and by the spyware from another cyberweapons arms manufacturer: Cytrox.

We haven’t heard a lot about Cytrox and its Predator spyware. According to Citzen Lab:

We conducted Internet scanning for Predator spyware servers and found likely Predator customers in Armenia, Egypt, Greece, Indonesia, Madagascar, Oman, Saudi Arabia, and Serbia.

Cytrox was reported to be part of Intellexa, the so-called “Star Alliance of spyware,” which was formed to compete with NSO Group, and which describes itself as “EU-based and regulated, with six sites and R&D labs throughout Europe.”

In related news, Google’s Project Zero has published a detailed analysis of NSO Group’s zero-click iMessage exploit: FORCED ENTRY.

Based on our research and findings, we assess this to be one of the most technically sophisticated exploits we’ve ever seen, further demonstrating that the capabilities NSO provides rival those previously thought to be accessible to only a handful of nation states.

By the way, this vulnerability was patched on 13 Sep 2021 in iOS 14.8.

In 2021, the Internet went for TikTok, space and beyond

2021-12-20 João Tomé

Post Syndicated from João Tomé original https://blog.cloudflare.com/popular-domains-year-in-review-2021/

In 2021, the Internet went for TikTok, space and beyond

The years come and go, Internet traffic continues to grow (at least so far and with some ‘help’ from the pandemic), and Internet applications, be they websites, IoT devices or mobile apps, continue to evolve throughout the year, depending on if they attract human beings.

We’ll have a more broad Internet traffic-related Year in Review 2021 in the next few days (you can check the 2020 one here), but for now, let’s focus on the most popular domains this year according to our data on Cloudflare Radar and those domains’ changes in our popularity ranking. With Alexa.com going away, if you need a domain ranking, you can get it from Cloudflare.

We’ll focus on space (NASA and SpaceX flew higher), e-commerce (Amazon and Taobao rule), and social media (TikTok ‘danced’ to take the crown from Facebook). We’ll also take a little ‘bite’ on video streaming wars. Netflix is a Squid Game of its own and January 2021 was at the highest in our ranking — probably lockdown and pandemic-related.

Chat domains (WhatsApp, what else) will also be present and, of course, the less established metaverse domains of sorts (Roblox took the lead from Fortnite late in the game). Come with us, let’s travel through 2021.

The following will show the way Cloudflare saw Internet traffic focusing on specific domains (some of which have many websites aggregated into them) and their highs and lows in our global popularity ranking.

Top Sites: Google dethroned by the young ‘padawan’ TikTok

Let’s start with our Top Domains Ranking and 2021 brought us a very interesting duel for the Number 1 spot in our global ranking. Google.com (which includes Maps, Translate, Photos, Flights, Books, and News, among others) ended 2020 as the undefeated leader in our ranking — from September to December of last year it was always on top. Back then TikTok.com was only ranked #7 or #8.

Top 10 — Most popular domains (late) 2021

1 TikTok.com
2 Google.com
3 Facebook.com
4 Microsoft.com
5 Apple.com
6 Amazon.com
7 Netflix.com
8 YouTube.com
9 Twitter.com
10 WhatsApp.com

Top 10 — Most popular domains (late) 2020

1 Google.com
2 Facebook.com
3 Microsoft.com
4 Apple.com
5 Netflix.com*
6 Amazon.com
7 TikTok.com
8 YouTube.com
9 Instagram.com *
10 Twitter.com

Amazon was #5 in November, but Netflix surpassed in December 2020 (on some days it was higher than Apple, in #4); Instagram and Twitter were constantly changing positions throughout November and December.

2021 told a different story. It was on February 17, 2021, that TikTok got the top spot for a day. Back in March, TikTok got a few more days and also in May, but it was after August 10, 2021, that TikTok took the lead on most days. There were some days when Google was #1, but October and November were mostly TikTok’s days, including on Thanksgiving (November 25) and Black Friday (November 26).

There are other trends we can see comparing both years — for 2020 we only show data of the end of the year, after September (Cloudflare Radar was launched that month). For example, Facebook.com was steadily number #2 across 2020, but with TikTok.com going up Facebook is now a solid #3, followed by Microsoft.com (Office365 and Teams numbers are included there) and by Apple.com (App Store and Apple TV+ numbers are included), the same trend as in 2020.

Amazon.com is the juggernaut that follows, but it is interesting to see that since January 2021 the e-commerce website (we will talk more about that category in a few paragraphs) jumped in front of Apple.com. But Apple got back in front, after September, with some exceptions like November 28, 2021, the day before Cyber Monday — and also December 1 and 6.

Christmas time, Netflix time

Another trend is that Netflix surpassed Amazon in December 2020, especially around Christmas week. On some days around 2020 Christmas, Netflix was even higher than Apple, in #4, that is the case with December 23, 25, and from December 29 to January 2, 2021.

February 2, 2021: The day YouTube (and an aerobics instructor) ruled the world

In our global popularity ranking we also saw another trend: YouTube, usually ranked #6 or #7, got to the top spot of our list on February 2, 2021 — and only on that day.

Why? One can only guess, but back then, although it was the week of the Super Bowl (some commercials, like the one from Doritos with Matthew McConaughey, were out on that day), there was another big newsworthy event: the Myanmar coup d’état on February 1, 2021. How can a coup in a Southeast Asian country have an impact on YouTube? A video of a fitness instructor who unwittingly filmed as the takeover unfolds behind her took the Internet by storm and became viral as the memes started to pour in.

That February day was also the one where Donald Trump announced his new legal team for the impeachment trial after the previous one quit, and Jeff Bezos announced he would step down as Amazon’s CEO. That was also the week prior to a record in YouTube’s history. On February 11, 2021, the video “Baby Shark Dance” from Korean education brand Pinkfong was the new most-viewed YouTube video of all time, surpassing the former record holder “Despacito” by Luis Fonsi.

Google Trends also shows that the week of February 2 was the one in 2021 that “YouTube” was more searched on Google.

In what was the second year of the pandemic, social media domains continued high on our ranking. The nine main social media applications were all in our top 100 list of most popular global domains — the only one out is Quora.com (during 2021 it was between #687 and #242).

We can see that TikTok (who also surpassed Google, as we explained before in the global #1 spot) took Facebook from its crown of the most popular social media website-domain in our ranking. So, that should mean that TikTok got more Internet traffic from our standpoint (our ranking is derived from our public DNS resolver 1.1.1.1 and so it’s not related to the number of unique users or visitors it gets per month) — Facebook is, by far, the platform with more users worldwide).

1 TikTok.com
2 Facebook.com
3 YouTube.com
4 Twitter.com
5 Instagram.com
6 Snapchat.com
7 Reddit.com
8 Pinterest.com
9 LinkedIn.com
10 Quora.com

1 Facebook.com
2 TikTok.com
3 YouTube.com
4 Instagram.com
5 Twitter.com
6 Snapchat.com
7 Reddit.com
8 Pinterest.com
9 LinkedIn.com
10 Quora.com

The Facebook outage — that we explained from our standpoint extensively — on October 4, 2021, also had an impact on Facebook’s position in our ranking, leading to Facebook.com losing its #3 position (it was #4) for seven days in a row in that week. This number of days in #4 was something that never happened before (since September 2020) to the social media giant.

Looking to the top 10 list, it’s also clear that, just looking to social media domains, YouTube comes third and Twitter got a bump up and beat Instagram in 2021, getting the #5 place (barely, in what was a very close race). Back in late 2020 Twitter was behind Instagram in our ranking.

LinkedIn is the ninth most popular social media domain in our ranking and is still in our top 100 and throughout 2021 it got higher in our list, especially in February and March. The social media for professionals then started to drop in June and July (in the Northern Hemisphere’s summer), starting in late August to climb again and by November it reached the #52 place, the highest of the year in our global ranking — in January it was ~#78. In a year when terms like The Great Resignation and the reset of people and organizations’ mindsets were talked about, it makes sense to see this social media platform growing.

Streaming: The (Squid) Netflix Game rules

The so-called video streaming wars got another important round in 2021 with new players appearing and old ones having amazing numbers — not only in subscribers, revenue, and content budgets but also in… Internet traffic. In our ranking, Netflix is still the undefeated hero.

We added YouTube.com (its most important service is free) to the list to compare with the big numbers from Netflix, and still, the Squid Game phenomenon platform won our ranking for most of the year. Amazon Prime is not included because the streaming service mainly uses Amazon.com (ranked #5 or #6 most of the year) as a domain.

The days of the year when Netflix was more popular? January was a great month with Netflix reaching the #4 spot in our global ranking in the first two days of the year (and also all the weekends of January, Fridays included), going through February in the #5 place. For the rest of 2021, the platform was mostly #7. Yes, on the weekends Netflix seems to have a better performance in our ranking.

Roku.com seems to be the next video streaming platform after those two traffic giants, getting around the #80 position in our ranking through 2021. In late 2020 Hulu.com was the next one, but HBOMax.com surpassed Hulu in July 2021 and entered our top 100 list. In 2021, Disneyplus.com also rose in our ranking and surpassed the app-based TV service Sling.com later in the year. Our top 10 chart also includes Iq.com (iQiyi), the Chinese online video platform.

Top 10 — Most popular video streaming domains (late) 2021

1 Netflix.com
2 YouTube.com
3 Roku.com
4 HBOMax.com
5 Hulu.com
6 Peacocktv.com
7 Disneyplus.co
8 ParamountPlus.com
9 Sling.com
10 Iq.com

Top 10 — Most popular video streaming domains (late) 2020

1 Netflix.com
2 YouTube.com
3 Roku.com
4 Hulu.com
5 HBOMax.com
6 Peacocktv.com
7 Sling.com
8 Disneyplus.com
9 Iq.com
10 Wetv.vip

Netflix vs YouTube

E-commerce: Podium to Amazon, Taobao and eBay

Since the pandemic started e-commerce has continued to strive and grow at an even faster pace than before. The top four e-commerce domains (Amazon, Taobao, eBay and Walmart) in our global ranking are all in the top 100 and that happens steadily throughout the year.

The fifth in the e-commerce list, the Chinese giant Jd.com had a few periods that it also entered the top 100 mainly in May and especially June — on the day of the 618 shopping event, on June 18, 2021, it reached #68 on our list, beating Walmart.com and almost catching Ebay.com.

In the following list it is easy to see that Jd.com surpassed Shopify.com in 2021, occupying the #5 place, and also Bestbuy.com and Target.com rose from one year to another.

Top 10 — Most popular e-commerce domains (late) 2021

1 Amazon.com
2 Taobao.com
3 Ebay.com
4 Walmart.com
5 Jd.com
6 Shopify.com
7 Bestbuy.com
8 Target.com
9 Rakuten.co.jp
10 Homedepot.com

Top 10 — Most popular e-commerce domains (late) 2020

1 Amazon.com
2 Taobao.com
3 Ebay.com
4 Walmart.com
5 Shopify.com
6 Jd.com
7 Olx.com.br
8 Rakuten.co.jp
9 Target.com
10 Bestbuy.com*

Shein.com went ahead of Bestbuy.com and Target.com from December 19 to 24, 2020*

Here are other trends:

Amazon.com is a domain, as we already explained, with more than e-commerce services (that’s why globally it ranks between #4 and #6). In 2021, it had some good days in January and in late April 2021, reaching #4, but by the end of the year it got its best days in our ranking, especially on the day before Cyber Monday, November 28, and on December 1 and 6 — it reached #5.
Taobao.com had its best day of the year in our global ranking on August 20 — #15 — and by the popular Chinese shopping day, Singles’ Day, November 11, it was #17.
Ebay.com had a solid year and a good late August (#29 on August 31) and grew more after Cyber Monday, peaking on December 1, reaching #27.
Shopify had a great August (reaching #100 on August 18), the same with Etsy.com that peaked at #128 on August 21. Walmart had a great June (#66) and also end of November (it reached #70).
Ikea.com had a big increase in importance throughout the year and got very near to Homedepot.com’s position in September (peaked in the #695 position in our global ranking), staying up through November.
Best Buy peaked on October 6 and had a high growth throughout November, also matching Shopify in December.

Shein.com, the global Chinese online fast-fashion retailer, went high in our ranking for the Christmas of 2020 — it went ahead of Bestbuy.com and Target.com from December 19 to 24, 2020, reaching the #253 position. In March, it had another peak, and it got the best position in 2021 in our ranking after Cyber Monday — it reached #301 on December 1, 2021.

2021: A Space Odyssey (for NASA, SpaceX, Blue Origin and Virgin Galactic)

This year was also a big year for space travel with several achievements. Spacecraft from three Mars exploration programs from the United Arab Emirates, China, and the United States arrived at Mars in February — NASA’s Perseverance rover landed on February 18, 2021, and after that the Ingenuity drone made history, being the first powered aircraft flight on another planet in human history. And there is also another big space event just around the corner — the James Webb Telescope launch.

Virgin Galactic (July 11), Blue Origin (July 20) and SpaceX (September 16 — but with several other events before that regarding satellites and reuse of space capsules) also stormed the Internet with space tourism achievements with different scopes. Only SpaceX offered an orbital ride.

In terms of domains, NASA.gov was way ahead of the others, but Elon Musk’s SpaceX.com was definitely second in our global ranking, followed by Blueorigin.com. Virgingalactic.com only appears once in our top 100k ranking on July 17 and 18 (a few days after Richard Branson’s spaceflight).

Since last year NASA is high on our global ranking, in the top 1,000 domains of our list, but after the rover Perseverance landed on Mars on February 18 NASA.gov entered our top 700 ranking — the highest day of that month was February 25, when it reached #657. In the summer it went down in our ranking, but it picked up in late September and on October 13, 2021, reached the highest position of the year (#637). That was the day the press conference about NASA’s Lucy mission, the agency’s first to Jupiter’s Trojan asteroids, took place (the launch was on October 16).

SpaceX.com had a great start of February, it entered our top 8,000, a month with a launch of 60 new Starlink internet satellites into orbit amidst a missed rocket landing and a fresh $850 million of new investment. And then it was after September 16, 2021, with the first orbital launch of an all-private crew, Inspiration4, that it flew again in our ranking.

For Blue Origin, after a strong start of the year — it reached our #32,000 on January 10 (a few days before New Shepard 4’s first test flight) — it went up between July 20- 27 after its first crewed flight, with Jeff Bezos onboard. It also went up in our ranking a few days after October 13, 2021 (the day William Shatner flew aboard a Blue Origin suborbital capsule).

Messaging or chat: WhatsApp, what else?

There aren’t as many messaging or chat platforms as there are popular social media sites, video streaming, or e-commerce platforms. So, this ranking is slim, and even slimmer because Messenger (uses Facebook.com) or iMessage (uses Apple.com) aren’t included. Snapchat is both a social media platform and a messaging app — the same with Instagram — and we added them in the social media ranking. If they were here they would be higher than WeChat but behind WhatsApp — Instagram actually started 2021 (it got to #8) in front of WhatsApp until February and went as low as #13 and Snapchat went between #29 to #16.

Top — Most popular chat domains (late) 2021

1 WhatsApp.com
2 WeChat.com
3 Signal.org
4 Telegram.com

Top — Most popular chat domains (late) 2020

1 WhatsApp.com
2 Signal.org
3 WeChat.com
4 Telegram.com

From our standpoint, WhatsApp is the undisputed leader of the messaging apps ranging from as low as #13 in our global ranking to as high as #8. Its best parts of the year were late March, late April, late October and then late November going through December 2021 as #8 in our ranking.

How Signal skyrocketed in January (and WeChat in February)

All the others are far away in our ranking, but 2021 brought three trends we should highlight:

Signal.org had an incredible month of January — on January 3 it was in #1815 in our ranking and by January 20 it rose to #766, a climb in more than 1,000 positions in just 17 days. Why? WhatsApp’s new privacy policy was in the headlines in the second week of January.

WeChat.com also had an amazing jump in our ranking, but more in February and by April it surpassed Signal.org — it went from #3142 at the start of February to #979 by April 25 and by October both of the messaging apps were almost tied at ~#370 and had a significantly higher place in our ranking than in late 2020.
Telegram.com on the other hand had a decrease in ranking throughout the year and ended up in the top 38,000.

“You can’t just materialize anywhere in the Metaverse, like Captain Kirk beaming down from on high. This would be confusing and irritating to the people around you. It would break the metaphor. Materializing out of nowhere (or vanishing back into Reality) is considered to be a private function best done in the confines of your own House.“
― Neal Stephenson, Snow Crash (1992)

Metaverse: Don’t mess with Roblox

Back in November, we heard in the halls of Web Summit — the 42,000 in-person tech global event in Lisbon — that in a way the metaverse is already here (Roblox’s Global Head of Music had some thoughts on virtual concerts). But we’re still far from the promise of almost living in the virtual world that books like Neal Stephenson’s Snow Crash or Ernest Cline’s Ready Player One showed us.

Oculus shipped a lot of headsets and there are immersive experiences out there that are Metaverse-like (a step further than the now-usual-for-most spending all day working, learning, communicating through a screen) and we focused on that ones, like Fortnite, Roblox, Second Life (the oldest, from 2003), Minecraft and Oculus. But Oculus.com doesn’t have enough direct traffic (playing games using Oculus headset could direct the traffic elsewhere) to be in our top 100k domains ranking, and the same happens with Minecraft.

Top — Most popular metaverse domains (late) 2021

1 Roblox.com
2 Epicgames.com (Fortnite)
3 Secondlife.com

Oculus.com and Minecraft.net are not in our 100,000 ranking

Top — Most popular metaverse domains (late) 2020

1 Epicgames.com (Fortnite)
2 Roblox.com
3 Secondlife.com

Oculus.com and Minecraft.net are not in our 100,000 ranking

The (short) list from 2020 and 2021 shows us that Roblox.com surpassed Epicgames.com (the home of the popular Fortnite) for the first time in July reaching back then #27 in our list. But it was after late September that it was consistently in front of the rival game platform, ending the year on a good note reaching #20 in our ranking.

Epicgames.com (Fortnite) started the year a lot better, reaching #14 on January 5, 2021, but it started to lose importance in February and that deepened after May, but mostly in July and August. It never truly recovered and ended the year between #26 and #47, depending on the day.

Conclusion: Human (online) trends

The Internet is not a quiet place, the same way humans on Earth (especially during a pandemic) aren’t quiet or passive but active and reactive. Although on the top of our domain ranking there don’t seem to be drastic ups and downs throughout the year (TikTok, and YouTube, were the exceptions), we saw how an event like the Myanmar coup and the subsequent viral video may have brought YouTube to #1 on our ranking. We also saw how e-commerce was affected throughout the year, how space-related websites had a big (online) year with important events, and how Netflix rose around Christmas time.

And remember: you can keep an eye on Cloudflare Radar to monitor how we see Internet traffic globally and in every country.

Security updates for Monday

2021-12-20

Post Syndicated from original https://lwn.net/Articles/879228/rss

Security updates have been issued by Debian (apache-log4j2, firefox-esr, libssh2, modsecurity-apache, and tang), Fedora (lapack, log4j, rust-libsqlite3-sys, rust-rusqlite, xorg-x11-server, and xorg-x11-server-Xwayland), Mageia (bind, botan2, chromium-browser-stable, dovecot, hiredis, keepalived, log4j, matio, mediawiki, olm, openssh, pjproject, privoxy, vim, and watchdog), openSUSE (barrier, nim, and python-pip), Oracle (ipa and samba), Scientific Linux (ipa and samba), SUSE (log4j), and Ubuntu (apache-log4j2, htmldoc, python3.6, python3.7, python3.8, and python3.8, python3.9).

The journey to space for two special Raspberry Pi computers

Building space-ready Astro Pi units

Young people’s code will run on the new Astro Pi units next year — is yours on board?

Our Decision

What Actions Have We Taken?

Was Backblaze Compromised?

Next Steps

Data platform architecture

Lake Formation security

AWS Glue Data Catalog

Data sourcing and ingestion

Amazon AppFlow Salesforce ingestion

Register Salesforce objects to the Data Catalog

Amazon AppFlow dynamic field gathering

Create an Amazon CloudWatch Events rule, AWS Glue table, and partition

Amazon Redshift external schema query

Amazon AppFlow Salesforce destination

AWS Glue Jobs

Amazon EKS and KubeFlow

Firewall Manager

CI/CD

Infrastructure as code and AWS CDK

AWS CDK project structure

AWS CDK project configuration

Network architecture

Network security configurations

Operations and alerting

Conclusion

About The Authors

Considerations for AWS Organizations support

Other recent updates from Amazon Detective

Additional support for all GuardDuty findings

New resource focused view

New finding overview

Integration with Splunk

Clustering

Metadata-based file listing

Metadata listing considerations

Amazon CloudWatch integration

Optimistic Concurrency Control

Amazon EMR configuration support and improvements

Apache Flink integration

Kafka commit callbacks

Other improvements

Conclusion

About the Authors

Solution overview

Prerequisites

Set up the producer account

Set up the consumer account

Set up permissions

Register the producer account’s Data Catalog

Query your data

Limitations and considerations

Clean up

Conclusion

About the Authors

Top Sites: Google dethroned by the young ‘padawan’ TikTok

Top 10 — Most popular domains (late) 2021

Top 10 — Most popular domains (late) 2020

Christmas time, Netflix time

February 2, 2021: The day YouTube (and an aerobics instructor) ruled the world

Social media: There’s a new kid in town

Top 10 — Most popular social media domains (late) 2021

Top 10 — Most popular social media domains (late) 2020

Streaming: The (Squid) Netflix Game rules

Top 10 — Most popular video streaming domains (late) 2021

Top 10 — Most popular video streaming domains (late) 2020

Netflix vs YouTube

E-commerce: Podium to Amazon, Taobao and eBay

Top 10 — Most popular e-commerce domains (late) 2021

Top 10 — Most popular e-commerce domains (late) 2020

2021: A Space Odyssey (for NASA, SpaceX, Blue Origin and Virgin Galactic)

Messaging or chat: WhatsApp, what else?

Top — Most popular chat domains (late) 2021

Top — Most popular chat domains (late) 2020

How Signal skyrocketed in January (and WeChat in February)

Metaverse: Don’t mess with Roblox

Top — Most popular metaverse domains (late) 2021

Top — Most popular metaverse domains (late) 2020