How Edmunds GovTech unifies data and analytics data for municipalities with Amazon QuickSight

Post Syndicated from Edmunds GovTech original https://aws.amazon.com/blogs/big-data/how-edmunds-govtech-unifies-data-and-analytics-data-for-municipalities-with-amazon-quicksight/

This is a guest post from an Amazon QuickSight customer, Edmunds GovTech

Over the past 30 years, Edmunds GovTech has grown to provide enterprise resource planning (ERP) solutions to thousands of East Coast municipalities. We also serve cities and towns in 25 other states. In this blog, I’ll talk about how we used Amazon QuickSight embedded business intelligence (BI) to quickly bring powerful dashboards to our on-premises and cloud-based customers.

Unifying insights

Our customers rely on our suite of solutions to manage finances, personnel, revenue, and municipal management activities such as permits, land management, business licensing, and fleet maintenance. They can access a wide variety of reports and data analysis tools tailored to the needs of users in finance, operations, and other departments. Recent acquisitions have also added new capabilities to our offering, each with its own set of reporting tools.

These reports serve specialist users well. However, we wanted to add the ability to aggregate and visualize information in one easy-to-consume service. Time-starved executives, boards, and decision-makers needed a better way to gain key insights into spending trends and implement better cost and cash management strategies. They strive to better achieve the financial goals of their municipalities without having to spend time running reports in different areas of the solution.

Production-ready in record time

With this vision in place, our primary directive was speed to market, with the aim of releasing a production-ready solution in just 4 months. We carefully evaluated our priorities and functional requirements and, ultimately outsourcing infrastructure management was key. QuickSight, a fully managed, cloud-native BI service from AWS, was the only option that allowed us to deliver so quickly.

Just as importantly, our professional services team saves an extensive amount of time to implement and train customers. That means immediate value for the customer and more time for our professional services team to spend on other activities, increasing profitability. We sell the embedded dashboard service as a subscription-based add-on, so customers can easily purchase and use it.

Flexible and future-proof

Although many of our customers use traditional client/server configurations in their own data centers, our cloud-hosted solution is becoming increasingly popular, especially with increasing numbers of remote workers. We’re also developing a software as a service (SaaS) version of our suite and continue to acquire other vendors to add functionality. All these factors mean our QuickSight dashboard service needs to be platform-agnostic. It must work with any source application, whether in AWS or on premises.

We accomplished this using Amazon Simple Queue Service (Amazon SQS) and Amazon Simple Storage Service (Amazon S3). The source application emits events about finance accounts, vendors, and yearly budgets using Amazon SQS, with Amazon S3 available to ingest large messages that exceed the limits of Amazon SQS. We rely on AWS Lambda serverless functions to handle the ingestion and routing of the messages. Each customer has an individual reporting store, separate from the database of the source system.

This system transforms data from the customer’s system into a format that is normalized for QuickSight reporting. By pointing QuickSight at these schemas, we enable it to report on that data. The customer dashboard is embedded in the ERP application, so the customer doesn’t need to go to the QuickSight website to access it.

Any source application that can adhere to the messaging format can be reported on. The source system is responsible for the number-crunching, so any customizations the customer has applied are reflected in the reports.

The following diagram illustrates this architecture.

The following diagram illustrates this architecture.

The high-level architecture is as follows:

  1. Application sends JSON message to SQS queue. If the message is too large, it is added to an S3 bucket and the message contains a reference to the large message. Note this source can be any application as long as it produces messaging adhering to predefined JSON schema.
  1. Lambda consumer ingests batch of messages, validates payloads, and transforms payloads from JSON to tenant’s MYSQL Aurora reporting database that uses a star schema. The consumer can ingest small messages directly from SQS event or retrieve large messages from S3.
  1. QuickSight Namespace for tenant contains dashboard created from a master template that points to the appropriate reporting database.
  1. Source application requests dashboard on users behalf. Dashboard is embedded within the source application UI.

Because the system relies on Lambda functions, it’s a modern, decoupled architecture that is inherently future-proof and scalable. We don’t have to manage cloud or on-premises servers, and we only pay for what clients actually use.

Additionally, we were able to build a user interface that makes it easy to deploy new customers with just a few clicks. We use the installer to create the infrastructure for new clients using the AWS Command Line Interface (AWS CLI). The customer simply pushes a button from the source system to push data to the dashboard. They can be using the dashboard in less than an hour from start to finish.

Continuously increasing customer value

QuickSight has rolled out a lot of new features just in the short time we’ve been using it, and we’re already taking advantage of them. For example, QuickSight now remembers filter selections by user, so that the choices a user makes are remembered from session to session. That saves our customers time and effort and helps them get the information they need faster.

Embedded authoring is another significant feature that we’re looking forward to rolling out soon. As of this writing, we manage and maintain reporting templates for customers and push them out to clients using the AWS CLI. With the new embedded authoring capability of QuickSight, customers will be able to explore data in a self-service manner, perform ad hoc analysis, and create new dashboards. This will greatly increase the utility of the service while maintaining ease of use for customers and simplicity of management for our team. We’re also adopting the new namespace functionality to help customers maintain data separation from others in our multi-tenant solution.

Together today and tomorrow

Working with AWS has been a great experience. Our account representative has always been available for questions and feedback, which helped us succeed especially on such an accelerated timeframe. In addition to bringing QuickSight to our customers, we value the relationship we’ve developed with AWS and look forward to building on it as we move forward with our cloud solutions. Partnering with AWS has led to many benefits across our entire organization.

Marketing and sales teams in our organization are leading client demos with the QuickSight dashboard because it looks great and works seamlessly, and it’s something a lot of customers have been asking for. For department heads, executives, and other leaders, the ability to quickly visualize current and historical budget information is huge. They can also show their boards the information they need in a very easy-to-consume way. By giving customers one place to go for a high-level strategic view across their municipality, we’re helping them make better decisions and ultimately serve their constituents more effectively.


About the Author

Thomas Mancini is the VP, Concept Development at Edmunds GovTech

Data monetization and customer experience optimization using telco data assets: Part 2

Post Syndicated from Vikas Omer original https://aws.amazon.com/blogs/big-data/part-2-data-monetization-and-customer-experience-optimization-using-telco-data-assets/

Part 1 of this series explains the importance of building and implementing a customer experience (CX) management and data monetization strategy for telecom service providers (TSPs), and the major challenges driving these initiatives. It also includes an AWS CloudFormation template to set up a demonstration of the solution using AWS services. It covers transforming and enriching multiple datasets, and offers information about data standardization, baselining an analytics data model to marry different datasets like deep packet inspection (DPI) engine embedded Packet Switch (PS) probe, CRM, subscriptions, media, carrier, device, and network configuration management in the data warehouse with AWS Glue, AWS Lambda, and Amazon Redshift.

In this post, I demonstrate how you can enable data analysts, scientists, and advanced business users to query data from Amazon Redshift or Amazon Simple Storage Service (Amazon S3) directly. I also demonstrate configuring a simple drag-and-drop interface for self-service analytics so you can prepare and publish insights based on enriched data stored in Amazon Redshift or Amazon S3 through Amazon QuickSight.

Solution overview

The following diagram illustrates the workflow of the solution.

In part 1 of this series, we discuss the overall workflow. In this post, we focus on the following steps:

  1. Catalog the processed raw, aggregate, and dimension data in the AWS Glue Data Catalog using the DPI processed data crawler.
  2. Interactively query data directly from Amazon S3 using Amazon Athena and visualize in QuickSight.
  3. Enable self-service analytics using QuickSight to prepare and publish insights based on data residing in the Amazon Redshift cluster.

Querying data using Amazon Redshift

After creating your Amazon Redshift cluster, you can immediately run queries by using the query editor on the Amazon Redshift console. Complete the following steps:

  1. On the Amazon Redshift console, in the navigation pane, choose Clusters.

A cluster with the identifier <redshift database name>-<cloudformation stack> should be present. For this example, the cluster is cemdm-telco.

  1. Choose Editor.
  2. Enter the required credentials to connect to the Amazon Redshift query editor. (Database name, Database user, and Database password are the ones you entered while creating the CloudFormation stack.)

  1. Choose Connect to database.

Upon successful authentication, you’re directed to the query editor.

  1. Run a few queries to check if data is in the tables.

In the following code, <table-name> is the Amazon Redshift table name:

select count(1) from cemdm.<table-name>;

The following query extracts the number of unique subscriber count by age group with Apple devices browsing retail domain websites or apps in or around shopping malls. You can also extract the list of subscribers and micro-segment them by consumption (total data volume) or by adding KPIs like recency and frequency.

select 
  dcd.age_range, 
  count(distinct f.customer_id)as "Unique Subs Count"
from 
  cemdm.f_daily_dpi f
inner join cemdm.d_customer_demographics dcd on f.customer_id = dcd.customer_id
inner join cemdm.d_tac dt on f.tac_code = dt.tac_sid
inner join cemdm.d_device dd on dt.device_sid = dd.device_sid
inner join cemdm.d_dpi_dictionary ddd on f.protocol_id = ddd.app_id
inner join cemdm.d_location dl on f.location_id = dl.location_id
where 
  dd.device_manufacturer = 'Apple' 
and ddd.media_category = 'Retail' 
and location_tier_4 ilike '%mall%'
group by 1 
order by 2 desc;

The following screenshot shows the output.

Unloading processed and enriched data from Amazon Redshift to Amazon S3

Amazon Redshift also includes Amazon Redshift Spectrum, which allows you to directly run SQL queries against exabytes of unstructured data in Amazon S3 data lakes. No loading or transformation is required, and you can use open data formats, including Avro, CSV, Ion, JSON, ORC, and Parquet. Amazon Redshift Spectrum automatically scales query compute capacity based on the data being retrieved, so queries against Amazon S3 run quickly, regardless of dataset size.

Amazon Redshift Spectrum gives you the freedom to store your data where you want, in the format you want, and have it available for processing when you need it. This is particularly helpful if you need to offload cold or historical data on Amazon Redshift to Amazon S3 in open data format. You can still access this data through Amazon Redshift via Amazon Redshift Spectrum plus any other application.

TSP data assets also include a lot of unstructured event data. This data is transient, and only valuable for a short amount of time. Therefore, you can leave it on Amazon S3 and access it from Amazon Redshift directly through Amazon Redshift Spectrum. You can use a lake house architecture approach, where hot, mostly static, and corporate data is in the warehouse, and the events data is in the data lake.

Alternatively, you can analyze data on Amazon S3 using Athena.

  1. Use the queries in the following table (in the Unload Statement column) in the Amazon Redshift query editor to unload data from Amazon Redshift to Amazon S3. For instructions, see Unloading data to Amazon S3. Provide the following information:
    • <aws-stack-name> – The name of the CloudFormation stack
    • <aws-region> – The Region in which you deployed the stack (for example, us-east-1)
    • <s3-bucket-name> – The bucket that you created while deploying the stack
    • <aws-account-id> – The AWS account ID in which you deployed the stack
    • <table-name> – The name of the Amazon Redshift table
Amazon Redshift Table Unload Statement

f_raw_dpi

f_hourly_dpi

unload ('select * from  cemdm.<table-name>') 
       to 's3://<s3-bucket-name>/dpi/processed/<table-name>/' 
       iam_role 'arn:aws:iam::<aws-account-id>:role /RedshiftBasicCustom-<aws-region>-<aws-stack-name>' 
       ALLOWOVERWRITE
       PARQUET 
       PARTITION BY (date_id, hour_id);

f_daily_dpi
unload ('select * from  cemdm.<table-name>') 
       to 's3://<s3-bucket-name>/dpi/processed/f_daily_dpi/' 
       iam_role 'arn:aws:iam::<aws-account-id>:role/RedshiftBasicCustom-<aws-region>-<aws-stack-name>' 
       ALLOWOVERWRITE
       PARQUET 
       PARTITION BY (date_id);

d_customer_demographics

d_device

d_dpi_dictionary

d_location

d_operator_plmn

d_tac

d_tariff_plan

d_tariff_plan_desc

unload ('select * from  cemdm.<table-name>') 
   to 's3://<s3-bucket-name>/dpi/processed/<table-name>/' 
       iam_role 'arn:aws:iam::<aws-account-id>:role /RedshiftBasicCustom-<aws-region>-<aws-stack-name>' 
       ALLOWOVERWRITE
       PARQUET;

Alternatively, you can copy the Amazon Redshift AWS Identity and Access Management (IAM) role ARN to unload data to Amazon S3 from the console under the cluster’s properties.

  1. Verify that the data has been unloaded to Amazon S3 under <s3-bucket-name>/dpi/processed/.
  2. On the AWS Glue console, in the navigation pane, choose Crawlers.
  3. Select DPIProcessedDataCrawler.
  4. Choose Run crawler.

  1. Wait for the crawler to show the status Stopping.

The tables added against the DPIProcessedDataCrawler crawlers should show 11.

  1. Under Databases, choose Tables.
  2. Verify the following 11 tables are created under the cemdm database:
    • processed_f_raw_dpi
    • processed_f_hourly_dpi
    • processed_f_daily_dpi
    • processed_d_customer_demographics
    • processed_d_device
    • processed_d_dpi_dictionary
    • processed_d_location
    • processed_d_operator_plmn
    • processed_d_tac
    • processed_d_tariff_plan
    • processed_d_tariff_plan_desc

Visualizing data using QuickSight

QuickSight is a business analytics service you can use to build visualizations, perform one-time analysis, and get business insights from your data. For more information, see What Is Amazon QuickSight?

To connect QuickSight to Amazon Redshift as your data source, complete the following steps:

  1. Create a private connection from Amazon QuickSight to an Amazon Redshift cluster.

These steps involve creating a new private subnet that the CloudFormation stack already created. Use the private subnet that isn’t used by Amazon Redshift cluster for your QuickSight connection.

QuickSight provides out-of-the-box integration with Amazon Redshift, making it simple to query and visualize your Redshift data. For more information, see Creating a Dataset from an Autodiscovered Amazon Redshift Cluster or Amazon RDS Instance.

  1. For Schema, choose cdmdm.
  2. For Tables, select f_daily_dpi.
  3. Choose Edit/Preview data.

  1. Add data and prepare the following table relationships in the Data Prep Use the information provided to create the relationships between different tables:
Table A Name Table A Attribute Join Type Table B Name Table B Attribute
f_daily_dpi customer_id LEFT d_tariff_plan customer_id
f_daily_dpi tac_code INNER d_tac tac_sid
f_daily_dpi sgsn_plmn_sid INNER d_operator_plmn plmn_sid
f_daily_dpi location_id LEFT d_location location_id
f_daily_dpi protocol_id INNER d_dpi_dictionary app_id
f_daily_dpi customer_id LEFT d_customer_demographics customer_id
d_tariff_plan tariff_plan_id INNER d_tariff_plan_desc tariff_plan_id
d_tac device_sid INNER d_device device_sid

You can join d_operator_plmn with sgsn_plmn_sid and home_plmn_sid, but because the sample data only contains home subscriber data, a second join of f_raw_dpi data with d_operator_plmn on home_plmn_sid and plmn_sid is not present in the given relationship of tables.

The following screenshot shows the table relationships.

  1. Name your analysis CEMDM.
  2. Choose Save & visualize.

The following screenshots demonstrate a few QuickSight analyses created from the dataset we created. For more information about creating analyses in QuickSight, see Working with Analyses. You can divide all analyses across all the available attributes. We use the use case from part 1 of this series.

The following screenshot shows visualizations of user demographics on the Demographics tab.

The following screenshot shows visualizations of user interest on the Interest Analysis tab.

The following screenshot shows visualizations of user locations on the Location tab.

The following screenshot shows visualizations of device information on the Device tab.

The following screenshot shows visualizations of subscription information on the Subscriptions tab.

The following screenshot shows visualizations of roaming users on the Roaming tab.

The following screenshot shows visualizations on the Sub Details tab. You can drill down to subscriber-level details from any dashboard across any dimension or apply global-level filters to narrow down the desired segment.

You can also build these reports using Athena as a data connector. QuickSight provides out-of-the-box integration with Athena, which lets you run SQL queries on top of the metadata in your AWS Glue Data Catalog. For more information, see Creating a Dataset Using Amazon Athena Data.

You can also use Amazon Redshift metadata as a business glossary and visualize it using QuickSight with the following custom SQL:

SELECT * FROM (
  select 
    n.nspname as "Schema",c1.relname as "Table Name", c.attname as "Column Name", 'Attribute' as "Type",
    c.attnum as "Ordinal Position",typnotnull as "Is Not Null",typdefault as "Default Value", t.typname as "Data Type",
    split_part(d.description,'|',1) as "Category", 
    split_part(d.description,'|',2) as "Source",
    split_part(d.description,'|',3) as "Transient/Derived",
    split_part(d.description,'|',4) as "Is PII",
    split_part(d.description,'|',5) as "Is Business Sensitive",
    split_part(d.description,'|',6) as "Description"  
  from pg_catalog.pg_attribute c
  inner join pg_class c1 on c.attrelid=c1.oid
  inner JOIN pg_type t on t.oid=c.atttypid
  inner join pg_catalog.pg_namespace n on c1.relnamespace=n.oid
  inner join pg_catalog.pg_description d on d.objoid=c1.oid AND c.attnum = d.objsubid
  where n.nspname='cemdm' and c.attnum > 0
  UNION ALL
  select 
    pn.nspname as "Schema",pc.relname "Table Name",null as "Column Name", 'Table' as "Type", 
    null as "Ordinal Position",null as "Is Not Null",null as "Default Value",null as "Data Type",
    split_part(pd.description,'|',1) as "Category", 
    split_part(pd.description,'|',2) as "Source",
    split_part(pd.description,'|',3) as "Transient/Derived",
    split_part(pd.description,'|',4) as "Is PII",
    split_part(pd.description,'|',5) as "Is Business Sensitive",
    split_part(pd.description,'|',6) as "Description"
  from pg_catalog.pg_description pd 
  inner join pg_class pc on pd.objoid = pc.oid
  inner join pg_catalog.pg_namespace pn on pc.relnamespace = pn.oid
  where pn.nspname = 'cemdm' and pd.objsubid = 0
) x
order by "Table Name", nvl("Ordinal Position",0);

The following screenshot shows a sample visualization which you can build on QuickSight.

For more information about running custom Amazon Redshift SQL using Amazon QuickSight, see Using the Query Editor.

QuickSight allows creating template from existing analysis. You can use the resulting template to create a dashboard. For more information, see Evolve your analytics with Amazon QuickSight’s new APIs and theming capabilitiesYou can also embed QuickSight dashboards into your own apps, websites, and wikis without the need to provision and manage users (readers) in QuickSight. For more information, see New in Amazon QuickSight – session capacity pricing for large scale deployments, embedding in public websites, and developer portal for embedded analytics.”

Cleaning up

To avoid incurring future charges, delete the resources you created. Manually delete anything created outside of the CloudFormation stack and then the stack itself.

Conclusion

In this post, I demonstrated how data analysts, data scientists, and advanced business users can easily query multiple data sources and generate actionable insights including user interest profiles, segments, and micro-segments. Downstream systems like campaign management systems, customer care portals, and customer-facing applications; internal teams like retention, marketing, CX, and network; and workloads like machine learning can greatly benefit from the insights generated from this solution. You can automate these insights and integrate them with northbound systems, and trigger them based on a schedule or an event.

I also demonstrated how business users are empowered with self-service analytics to help them perform data exploration and publish ready-made insights in the form of dashboards. You can also create stories to drive data-heavy conversations based on enriched data stored in Amazon Redshift or Amazon S3.

Perceiving customer behavior across multiple touchpoints is the key for any business to thrive. And the essence of this solution is to capitalize on data and drive CX and monetization initiatives holistically across your organization. This framework allows you to accelerate your journey towards improving CX and generating new revenue streams by using existing data assets.

You can progressively augment this solution by adding additional data sources to evolve into a customer data platform hosting 360° profiles of individual subscribers correlated from multiple data sources. This solution can further support new and existing marketing, partnerships, loyalty, retention, network planning, and network optimization initiatives to drive revenue growth and improve profitability while keeping subscribers happy and loyal. It also helps you define an organization-wide standard for data visualization, self-service analytics, metadata discovery, and data marketplace.

For more ways to expand this solution, consider the following services:

  • AWS Data Exchange makes it easy to find, subscribe to, and use third-party data in the cloud. You can merge it with in-house data assets to span existing insights across multiple domains.
  • Amazon Pinpoint is a flexible and scalable outbound and inbound marketing communications service. You can connect with customers over channels like email, SMS, push, or voice. You can segment and micro-segment your campaign audience for the right customer and personalize your messages with the right content.

As always, AWS welcomes feedback. This is a wide-open space to explore, so reach out to us if you want to dive deep into understanding how you can build this solution and more on AWS. Please submit comments or questions in the comments section.


About the Author

Vikas Omer is an analytics specialist solutions architect at Amazon Web Services. Vikas has a strong background in analytics, customer experience management (CEM) and data monetization, with over 11 years of experience in the telecommunications industry globally. With six AWS Certifications, including Analytics Specialty, he is a trusted analytics advocate to AWS customers and partners. He loves traveling, meeting customers, and helping them become successful in what they do.

Data monetization and customer experience optimization using telco data assets: Part 1

Post Syndicated from Vikas Omer original https://aws.amazon.com/blogs/big-data/part-1-data-monetization-and-customer-experience-optimization-using-telco-data-assets/

The landscape of the telecommunications industry is changing rapidly. For telecom service providers (TSPs), revenue from core voice and data services continues to shrink due to regulatory pressure and emerging OTT players that offer an attractive alternative. Despite increasing demand from customers for bandwidth, speed, and efficiency, TSPs are finding that ROI from implementing new access technologies like 5G are unsubstantial.

To overcome the risk of being relegated to a utility or dumb pipe, TSPs today are looking to diversify, adopting alternative business models to generate new revenue streams.

In recent times, adopting customer experience (CX) and data monetization initiatives has been a key theme across all industries. Although many Tier-1 TSPs are leading this transformation by using new technologies to improve CX and improve profitability, many TSPs have yet to embark on this challenging but rewarding journey.

Building and implementing a CX management and data monetization strategy

Data monetization is often misunderstood as making dollars by selling data, but what it really means is to drive revenue by increasing the top line or the bottom line. It can be tangible or intangible, internal or external, or by making use of data assets.

According to Gartner, most data and analytics leaders are looking to increase investments in business intelligence (BI) and analytics (see the following study results).

The preceding visualization is from “The 2019 CIO Agenda: Securing a New Foundation for Digital Business”, published October 15, 2018.

Although the external monetization opportunities are limited due to strict regulations, a plethora of opportunities exist for TSPs to monetize data both internally (regulated but much less compared to external) and externally via a marketplace (highly regulated). If TSPs can shift their mindsets from selling data to focus on using data insights for monetization and improving CX, they can adopt a significant number of use cases to realize an immediate positive impact.

Tapping and utilizing insights around customer behavior acts like a Swiss Army Knife for businesses. You can use these insights to drive CX, hyper-personalization and localization, micro-segmentation, subscriber retention, loyalty and rewards programs, network planning and optimization, internal and external data monetization, and more. The following are some use cases that can be driven using CX and data monetization strategies:

  • Segmentation/micro-segmentation (cross-sell, up-sell, targeted advertising, enhanced market locator); for example:
    • Identify targets for consuming baby products or up-selling a kids-related TV channel
    • Identify females in the age range of 18-35 to target for high-end beauty products or apparels

You can build hundreds of such segments.

  • Personalized loyalty and reward programs (incentivize customers with what they like). For example, movie tickets or discounts for a movie lover, or food coupons and deals for a food lover.
  • CX-driven network optimization (allocate more resources to streaming hotspots with high-value customers).
  • Identifying potential partners for joint promotions. For example, bundling device offers with a music app subscription.
  • Hyper-personalization. For example, personalized recommendations for on-portal apps and websites.
  • Next best action and next best offer. For example, intelligent bundling and packaging of offerings.

Challenges with driving CX and data monetization

In this digital era, TSPs consider data analytics a strategic pillar in their quest to evolve into a true data-driven organization. Although many TSPs are harnessing the power of data to drive and improve CX, there are technological gaps and challenges to baseline and formulate internal and external data monetization strategies. Some of these challenges include:

  • Non-overlapping technology investments for CX and data monetization due to misaligned business and IT initiatives
  • Huge CAPEX requirements to process massive volumes of data
  • Inability to unearth hidden insights due to siloed data initiatives
  • Inability to marry various datasets together due to missing pieces around data standardization techniques
  • Lack of user-friendly tools and techniques to discover, ingest, process, correlate, analyze, and consume the data
  • Inability to experiment and innovate with agility and low cost

In this two-part series, I demonstrate a working solution with an AWS CloudFormation template for how a TSP can use existing data assets to generate new revenue streams and improve and personalize CX using AWS services. I also include key pieces of information around data standardization, baselining an analytics data model to marry different datasets in the data warehouse, self-service analytics, metadata search, and media dictionary framework.

In this post, you deploy the stack using a CloudFormation template and follow simple steps to transform, enrich, and bring multiple datasets together so that they can be correlated and queried.

In part 2, you learn how advanced business users can query enriched data and derive meaningful insights using Amazon Redshift and Amazon Redshift Spectrum or Amazon Athena, enable self-service analytics for business users, and publish ready-made dashboards via Amazon QuickSight.

Solution overview

The main ingredient of this solution is Packet Switch (PS) probe data embedded with a deep packet inspection (DPI) engine, which can reveal a lot of information about user interests and usage behavior. This data is transformed and enriched with DPI media and device dictionaries, along with other standard telco transformations to deduce insights, profile and micro-segment subscribers. Enriched data is made available along with other transformed dimensional attributes (CRM, subscriptions, media, carrier, device and network configuration management) for rich slicing and dicing.

For example, the following QuickSight visualizations depict a use case to identity music lovers ages 18-55 with Apple devices. You can also generate micro-segments by capturing the top X subscribers by consumption or adding KPIs like recency and frequency.

The following diagram illustrates the workflow of the solution.

For this post, AWS CloudFormation sets up the required folder structure in Amazon Simple Storage Service (Amazon S3) and provides sample data and dictionary file. Most of the data included as part of the CloudFormation template is dummy and is as follows:

  • CRM
  • Subscription and subscription mapping
  • Network 3G & 4G configuration management
  • Operator PLMN
  • DPI and device dictionary
  • PS probe data

Descriptions of all the input datasets and attributes are available with AWS Glue Data Catalog tables and as part of Amazon Redshift metadata for all tables in Amazon Redshift.

The workflow for this post includes the following steps:

  1. Catalog all the files in the AWS Glue Data Catalog using the following AWS Glue data crawlers:
    1. DPI data crawler (to crawl incoming PS probe DPI data)
    2. Dimension data crawler (to crawl all dimension data)
  2. Update attribute descriptions in the Data Catalog (this step is optional).
  3. Create Amazon Redshift schema, tables, procedures, and metadata using an AWS Lambda
  4. Process each data source file using separate AWS Glue Spark jobs. These jobs enrich, transform, and apply business filtering rules before ingesting data into an Amazon Redshift cluster.
  5. Trigger Amazon Redshift hourly and daily aggregation procedures using Lambda functions to aggregate data from the raw table into hourly and daily tables.

Part 2 includes the following steps:

  1. Catalog the processed raw, aggregate, and dimension data in the Data Catalog using the DPI processed data crawler.
  2. Interactively query data directly from Amazon S3 using Amazon Athena.
  3. Enable self-service analytics using QuickSight to prepare and publish insights based on data residing in the Amazon Redshift cluster.

The workflow can change depending on the complexity of the environment and your use case, but the fundamental idea remains the same. For example, your use case could be processing PS probe DPI data in real time rather than in batch mode, keeping hot data in Amazon Redshift, storing cold and historical data on Amazon S3, or archiving data in Amazon S3 Glacier for regulatory compliance. Amazon S3 offers several storage classes designed for different use cases. You can move the data among these different classes based on Amazon S3 lifecycle properties. For more information, see Amazon S3 Storage Classes.

Prerequisites

For this walkthrough, you should have the following prerequisites:

For more information about AWS Regions and where AWS services are available, see Region Table.

Creating your resources with AWS CloudFormation

To get started, create your resources with the following CloudFormation stack.

  1. Click the Launch Stack button below:
  2. Leave the parameters at their default, with the following exceptions:
    1. Enter RedshiftPassword and S3BucketNameParameter parameters, which aren’t populated by default.
    2. An Amazon S3 bucket name is globally unique, so enter a unique bucket name for S3BucketNameParameter.

The following screenshot shows the parameters for our use case.

  1. Choose Next.
  2. Select I acknowledge that AWS CloudFormation might create IAM resources with custom names.
  3. Choose Create stack.

It takes approximately 10 minutes to deploy the stack. For more information about the key resources deployed through the stack, see Data Monetization and Customer Experience(CX)Optimization using telco data assets: Amazon CloudFormation stack details. You can view all the resources on the AWS CloudFormation console. For instructions, see Viewing AWS CloudFormation stack data and resources on the AWS Management Console.

The CloudFormation stack we provide in this post serves as a baseline and is not a production-grade solution.

Building a Data Catalog using AWS Glue

You start by discovering sample data stored on Amazon S3 through an AWS Glue crawler. For more information, see Populating the AWS Glue Data Catalog. To catalog data, complete the following steps:

  1. On the AWS Glue console, in the navigation pane, choose Crawlers.
  2. Select DPIRawDataCrawler and choose Run crawler.
  3. Select DimensionDataCrawler and choose Run crawler.
  4. Wait for the crawlers to show the status Stopping.

The tables added against the DimensionDataCrawler and DPIRawDataCrawler crawlers should show 9 and 1, respectively.

  1. In the navigation pane, choose Tables.
  2. Verify the following 10 tables are created under the cemdm database:
    • d_crm_demographics
    • d_device
    • d_dpi_dictionary
    • d_network_cm_3g
    • d_network_cm_4g
    • d_operator_plmn
    • d_tac
    • d_tariff_plan
    • d_tariff_plan_desc
    • raw_dpi_incoming

Updating attribute descriptions in the Data Catalog

The AWS Glue Data Catalog has a comment field to store the metadata under each table in the AWS Glue database. Anybody who has access to this database can easily understand attributes coming from different data sources through metadata provided in the comment field. The CloudFormation stack includes a CSV file that contains a description of all the attributes from the source files. This file is used to update the comment field for all the Data Catalog tables this stack deployed. This step is not mandatory to proceed with the workflow. However, if you want to update the comment field against each table, complete the following steps:

  1. On the Lambda console, in the navigation pane, choose Functions.
  2. Choose the GlueCatalogUpdate
  3. Configure a test event by choosing Configure test events.
  4. For Event name, enter Test.
  5. Choose Create.
  6. Choose Test.

You should see a message that the test succeeded, which implies that the Data Catalog attribute description is complete.

Attributes of the table under the Data Catalog database should now have descriptions in the Comment column. For example, the following screenshot shows the d_operator_plmn table.

Creating Amazon Redshift schema, tables, procedures, and metadata

To create schema, tables, procedures, and metadata in Amazon Redshift, complete the following steps:

  1. On the Lambda console, in the navigation pane, choose Functions.
  2. Choose the RedshiftDDLCreation
  3. Choose Configure test events.
  4. For Event name, enter Test.
  5. Choose Create.
  6. Choose Test.

You should see a message that the test succeeded, which means that the schema, table, procedures, and metadata generation is complete.

Running AWS Glue ETL jobs

AWS Glue provides the serverless, scalable, and distributed processing capability to transform and enrich your datasets. To run AWS Glue extract, transform, and load (ETL) jobs, complete the following steps:

  1. On the AWS Glue console, in the navigation pane, choose Jobs.
  2. Select the following jobs (one at a time) and choose Run job from Action
    • d_customer_demographics
    • d_device
    • d_dpi_dictionary
    • d_location
    • d_operator_plmn
    • d_tac
    • d_tariff_plan
    • d_tariff_plan_desc
    • f_dpi_enrichment

You can run all these jobs in parallel.

All dimension data jobs should finish successfully within 3 minutes, and the fact data enrichment job should finish within 5 minutes.

  1. Verify the jobs are complete by selecting each job and checking Run status on the History tab.

Aggregating hourly and daily DPI data in Amazon Redshift

To aggregate hourly and daily sample data in Amazon Redshift using Lambda functions, complete the following steps:

  1. On the Lambda console, in the navigation pane, choose Functions.
  2. Choose the RedshiftDPIHourlyAgg function.
  3. Choose Configure test events.
  4. For Event name, enter Test.
  5. Choose Create.
  6. Choose Test.

You should see a message that the test succeeded, which means that hourly aggregation is complete.

  1. In the navigation pane, choose Functions.
  2. Choose the RedshiftDPIDailyAgg function.
  3. Choose Configure test events.
  4. For Event name, enter Test.
  5. Choose Create.
  6. Choose Test.

You should see a message that the test succeeded, which means that daily aggregation is complete.

Both hourly and daily Lambda functions are hardcoded with the date and hour to aggregate the sample data. To make them generic, there are a few commented lines of code that need to be uncommented and a few lines to be commented. Both functions are also equipped with offset parameters to decide how far back in time you want to do the aggregations. However, this isn’t required for this walkthrough.

You can schedule these functions with CloudWatch. However, this is not required for this walkthrough.

So far, we have completed the following:

  1. Deployed the CloudFormation stack.
  2. Cataloged sample raw data by running DimensionDataCrawler and DPIRawDataCrawler AWS Glue crawlers.
  3. Updated attribute descriptions in the AWS Glue Data Catalog by running the GlueCatalogUpdate Lambda function.
  4. Created Amazon Redshift schema, tables, stored procedures, and metadata through the RedshiftDDLCreation Lambda function.
  5. Ran all AWS Glue ETL jobs to transform raw data and load it into their respective Amazon Redshift tables.
  6. Aggregated hourly and daily data from enriched raw data into hourly and daily Amazon Redshift tables by running the RedshiftDPIHourlyAgg and RedshiftDPIDailyAgg Lambda functions.

Cleaning up

If you don’t plan to proceed to the part 2 of this series, and want to avoid incurring future charges, delete the resources you created by deleting the CloudFormation stack.

Conclusion

In this post, I demonstrated how you can easily transform, enrich, and bring multiple telco datasets together in an Amazon Redshift data warehouse cluster. You can correlate these datasets to produce multi-dimensional insights from several angles, like subscriber, network, device, subscription, roaming, and more.

In part 2 of this series, I demonstrate how you can enable data analysts, scientists, and advanced business users to query data from Amazon Redshift or Amazon S3 directly.

As always, AWS welcomes feedback. This is a wide space to explore, so reach out to us if you want a deep dive into building this solution and more on AWS. Please submit comments or questions in the comments section.


About the Author

Vikas Omer is an analytics specialist solutions architect at Amazon Web Services. Vikas has a strong background in analytics, customer experience management (CEM), and data monetization, with over 11 years of experience in the telecommunications industry globally. With six AWS Certifications, including Analytics Specialty, he is a trusted analytics advocate to AWS customers and partners. He loves traveling, meeting customers, and helping them become successful in what they do.

Security updates for Tuesday

Post Syndicated from original https://lwn.net/Articles/844054/rss

Security updates have been issued by CentOS (dnsmasq, net-snmp, and xstream), Debian (mutt), Gentoo (cfitsio, f2fs-tools, freeradius, libvirt, mutt, ncurses, openjpeg, PEAR-Archive_Tar, and qtwebengine), openSUSE (chromium, mutt, stunnel, and virtualbox), Red Hat (cryptsetup, gnome-settings-daemon, and net-snmp), Scientific Linux (xstream), SUSE (postgresql, postgresql12, postgresql13 and rubygem-nokogiri), and Ubuntu (mutt).

Backblaze Hard Drive Stats for 2020

Post Syndicated from original https://www.backblaze.com/blog/backblaze-hard-drive-stats-for-2020/

In 2020, Backblaze added 39,792 hard drives and as of December 31, 2020 we had 165,530 drives under management. Of that number, there were 3,000 boot drives and 162,530 data drives. We will discuss the boot drives later in this report, but first we’ll focus on the hard drive failure rates for the data drive models in operation in our data centers as of the end of December. In addition, we’ll welcome back Western Digital to the farm and get a look at our nascent 16TB and 18TB drives. Along the way, we’ll share observations and insights on the data presented and as always, we look forward to you doing the same in the comments.

2020 Hard Drive Failure Rates

At the end of 2020, Backblaze was monitoring 162,530 hard drives used to store data. For our evaluation, we remove from consideration 231 drives which were used for testing purposes and those drive models for which we did not have at least 60 drives. This leaves us with 162,299 hard drives in 2020, as listed below.

Observations

The 231 drives not included in the list above were either used for testing or did not have at least 60 drives of the same model at any time during the year. The data for all drives, data drives, boot drives, etc., is available for download on the Hard Drive Test Data webpage.

For drives which have less than 250,000 drive days, any conclusions about drive failure rates are not justified. There is not enough data over the year-long period to reach any conclusions. We present the models with less than 250,000 drive days for completeness only.

For drive models with over 250,000 drive days over the course of 2020, the Seagate 6TB drive (model: ST6000DX000) leads the way with a 0.23% annualized failure rate (AFR). This model was also the oldest, in average age, of all the drives listed. The 6TB Seagate model was followed closely by the perennial contenders from HGST: the 4TB drive (model: HMS5C4040ALE640) at 0.27%, the 4TB drive (model: HMS5C4040BLE640), at 0.27%, the 8TB drive (model: HUH728080ALE600) at 0.29%, and the 12TB drive (model: HUH721212ALE600) at 0.31%.

The AFR for 2020 for all drive models was 0.93%, which was less than half the AFR for 2019. We’ll discuss that later in this report.

What’s New for 2020

We had a goal at the beginning of 2020 to diversify the number of drive models we qualified for use in our data centers. To that end, we qualified nine new drives models during the year, as shown below.

Actually, there were two additional hard drive models which were new to our farm in 2020: the 16TB Seagate drive (model: ST16000NM005G) with 26 drives, and the 16TB Toshiba drive (model: MG08ACA16TA) with 40 drives. Each fell below our 60-drive threshold and were not listed.

Drive Diversity

The goal of qualifying additional drive models proved to be prophetic in 2020, as the effects of Covid-19 began to creep into the world economy in March 2020. By that time we were well on our way towards our goal and while being less of a creative solution than drive farming, drive model diversification was one of the tactics we used to manage our supply chain through the manufacturing and shipping delays prevalent in the first several months of the pandemic.

Western Digital Returns

The last time a Western Digital (WDC) drive model was listed in our report was Q2 2019. There are still three 6TB WDC drives in service and 261 WDC boot drives, but neither are listed in our reports, so no WDC drives—until now. In Q4 a total of 6,002 of these 14TB drives (model: WUH721414ALE6L4) were installed and were operational as of December 31st.

These drives obviously share their lineage with the HGST drives, but they report their manufacturer as WDC versus HGST. The model numbers are similar with the first three characters changing from HUH to WUH and the last three characters changing from 604, for example, to 6L4. We don’t know the significance of that change, perhaps it is the factory location, a firmware version, or some other designation. If you know, let everyone know in the comments. As with all of the major drive manufacturers, the model number carries patterned information relating to each drive model and is not randomly generated, so the 6L4 string would appear to mean something useful.

WDC is back with a splash, as the AFR for this drive model is just 0.16%—that’s with 6,002 drives installed, but only for 1.7 months on average. Still, with only one failure during that time, they are off to a great start. We are looking forward to seeing how they perform over the coming months.

New Models From Seagate

There are six Seagate drive models that were new to our farm in 2020. Five of these models are listed in the table above and one model had only 26 drives, so it was not listed. These drives ranged in size from 12TB to 18TB and were used for both migration replacements as well as new storage. As a group, they totaled 13,596 drives and amassed 1,783,166 drive days with just 46 failures for an AFR of 0.94%.

Toshiba Delivers More Zeros

The new Toshiba 14TB drive (model: MG07ACA14TA) and the new Toshiba 16TB (model: MG08ACA16TEY) were introduced to our data centers in 2020 and they are putting up zeros, as in zero failures. While each drive model has only been installed for about two months, they are off to a great start.

Comparing Hard Drive Stats for 2018, 2019, and 2020

The chart below compares the AFR for each of the last three years. The data for each year is inclusive of that year only and for the drive models present at the end of each year.

The Annualized Failure Rate for 2020 Is Way Down

The AFR for 2020 dropped below 1% down to 0.93%. In 2019, it stood at 1.89%. That’s over a 50% drop year over year. So why was the 2020 AFR so low? The answer: It was a group effort. To start, the older drives: 4TB, 6TB, 8TB, and 10TB drives as a group were significantly better in 2020, decreasing from a 1.35% AFR in 2019 to a 0.96% AFR in 2020. At the other end of the size spectrum, we added over 30,000 larger drives: 14TB, 16TB, and 18TB, which as a group recorded an AFR of 0.89% for 2020. Finally, the 12TB drives as a group had a 2020 AFR of 0.98%. In other words, whether a drive was old or new, or big or small, they performed well in our environment in 2020.

Lifetime Hard Drive Stats

The chart below shows the lifetime annualized failure rates of all of the drives models in production as of December 31, 2020.

AFR and Confidence Intervals

Confidence intervals give you a sense of the usefulness of the corresponding AFR value. A narrow confidence interval range is better than a wider range, with a very wide range meaning the corresponding AFR value is not statistically useful. For example, the confidence interval for the 18TB Seagate drives (model: ST18000NM000J) ranges from 1.5% to 45.8%. This is very wide and one should conclude that the corresponding 12.54% AFR is not a true measure of the failure rate of this drive model. More data is needed. On the other hand, when we look at the 14TB Toshiba drive (model: MG07ACA14TA), the range is from 0.7% to 1.1% which is fairly narrow, and our confidence in the 0.9% AFR is much more reasonable.

3,000 Boot Drives

We always exclude boot drives from our reports as their function is very different from a data drive. While it may not seem obvious, having 3,000 boot drives is a bit of a milestone. It means we have 3,000 Backblaze Storage Pods in operation as of December 31st. All of these Storage Pods are organized into Backblaze Vaults of 20 Storage Pods each or 150 Backblaze Vaults.

Over the last year or so, we moved from using hard drives to SSDs as boot drives. We have a little over 1,200 SSDs acting as boot drives today. We are validating the SMART and failure data we are collecting on these SSD boot drives. We’ll keep you posted if we have anything worth publishing.

Are you interested in learning more about the trends in the 2020 drive stats? Join our upcoming webinar: “Backblaze Hard Drive Report: 2020 Year in Review Q&A” with drive stats author, Andy Klein, on February 3.

The Hard Drive Stats Data

The complete data set used to create the information used in this review is available on our Hard Drive Test Data page. You can download and use this data for free for your own purpose. All we ask are three things: 1) you cite Backblaze as the source if you use the data, 2) you accept that you are solely responsible for how you use the data, and 3) you do not sell this data to anyone; it is free.

If you just want the summarized data used to create the tables and charts in this blog post you can download the ZIP file containing the CSV files for each chart.

Good luck and let us know if you find anything interesting.

The post Backblaze Hard Drive Stats for 2020 appeared first on Backblaze Blog | Cloud Storage & Cloud Backup.

2021-01-26 работна среда

Post Syndicated from original https://vasil.ludost.net/blog/?p=3446

Писал съм преди за работната си среда, и понеже при нас във фирмата аз onboard-вам новите хора за оперативния екип, ми се налага да обяснявам как може да се работи по-ефективно (даже си направихме един вътрешен training, в който показвахме кой как работи). Понеже така и не записах моите обяснения, ще взема да ги напиша един път, с колкото мога подробности, може и да е полезно за някой.

В общи линии при работа вкъщи или в офиса нещата са ми по същия начин, въпреки че е малко криво логистично (и може да ми се наложи да си купя още един квадратен монитор в някакъв момент).

Хардуер

Ще започна с по-хардуерната част.

За стол ползвам от 10тина години един Markus от IKEA, и в последните две фирми убедих хората, че това е удобния стол. Издържа много, удобен е, и не е безумно скъп (не е и евтин, около 300 лв беше последно). Аз го ползвам без подлакътници (понеже така може да се пъхне повече под бюрото, като реша).

Гледам да имам достатъчно широко бюро, поне метър и половина. С двата монитора и двата лаптопа (за тях по-долу) си е нужно място, а и аз имам навика да трупам на него, и са известни случаи, в които съм завземал съседни бюра за някаква част от техниката, с която работя. По принцип е хубаво бюрото да е с дебел, здрав плот, та да не клати много (което се случва понякога при моето писане), но по-здравото ми бюро го ползва в момента Елена (на него пък не можех да си сложа стойките), и за това съм измислил допълнително укрепване на монитора.

От StorPool открих идеята със стойките за монитори – ползвам две Arctic Z1 Basic. Дават ми възможност да си подредя мониторите точно както искам, и освобождават място на бюрото, в което може да се озоват други неща (напр. разклонители).

Сравнително скорошна снимка (с грешната клавиатура)

Много години си имах нормален настолен компютър, после минах на лаптоп, и в момента съм на “компромиса” да съм с лаптоп, който да ползвам на практика като работна станция (не съм го отварял от 3 месеца, само си стои и скърца). В момента е T70 дъно (което правят едни китайски ентусиасти от 51nb.com) в T60p лаптопска кутия, която отне един ден флексотерапия за монтажа. Взех си го заради 4:3 монитора и възможността да имам прилично бърз лаптоп (вече съм му сложил 32GB памет, щото 16 не стигаха), с поносима клавиатура – след Tx20 серията lenovo почнаха да слагат още по-гадни клавиатури, а да стоя на T420 почваше да става бавно и мъчително (и да свършват лаптопите).
Ако тръгна да си вземам нов, вероятно няма да обърна внимание на тия подробности и ще гледам основно да му държи батерията и да не твърде малък екрана (и да не се чупи лесно, все пак се мотаят деца наоколо).
Около това минах и основно на жична мрежа, понеже няма смисъл да си тормозя wifi-то за нещо, което не разнасям.

Около пандемията и работата от вкъщи взех стария лаптоп на Елена и го ползвам за всичките видео/аудио конференции. Така мога да правя нещо и да мога да виждам с кого си говоря, и да не си товаря работния лаптоп с глупости.

За самите разговори имам два audio device-а – една Jabra speak за през повечето време и една jabra wireless слушалка за като трябва да съм по-тих (или да съм в съседната стая, докато върви разговора 🙂 ). Специално Jabra speak-а е наистина страхотно устройство, с много добро качество на звука и микрофона и почти невероятен echo cancellation, аз в крайна сметка съм купил поне 5-6 такива и съм давал на разни хора, на които им трябва. Силно препоръчвам – удобно е, чува се добре и не тормози ушите.

Мониторите са много интересна тема. Както казах по-горе, избирах си лаптоп специално с 4:3 екран, понеже 16:9 и 16:10 не ми стигат като вертикала, и най-накрая няколко приятели ми се ядосаха и ми купиха квадратен монитор (Eizo EV2730Q). Отне ми няколко дни да свикна с идеята, и от тогава насам не мога да си представя как живял без него – събира ми се всичко, имам място да си наслагам много прозорци, и като реша да пиша текст като този, мога да го отворя на цял екран на увеличен шрифт и да си пиша на воля. Като гледам, по-голям монитор от този вече ще ми е трудно да ползвам, но този мога да обхвана с поглед и половина и спокойно да гледам 2-3 лога как мърдат, докато работя по някаква система.
За лаптопа, който ползвам за видео конференции ползвам един 24″ монитор, който не е нещо особено (взех го евтино от ebay), целта му е да мога да виждам с кого говоря.

Клавиатурата е почти религиозна тема. След всякакви клавиатури, известно време ползване на IBM Model M и после дълги години лаптопски на Lenovo T серия, като минах на квадратния монитор си взех пак Model M-то. В офиса обаче вдигнаха бунт срещу мен (“не можем да си чуваме мислите, докато пишеш”) и като компромис си взех една Das Keyboard с кафеви switch-ове, която не е лоша, но … не е същото нещо. С пандемията можах пак да се върна на Model M-то (щото не се чува през стената да буди децата) и даже за нова година си взех една от новото производство на Unicomp (pckeyboard.com им е сайта). На цена, с доставката от САЩ ми се върза на същите пари като DasKeyboard-а.
Като усещане, за мен няма по-добра клавиатура, и дори се усещам как copy-paste-вам по-малко, понеже правя по-малко грешки, докато пиша. Единствения проблем на клавиатурата е шумът, но новата версия е малко по-басова и по-поносима за околните като звук (но ако се върнем в офис някой ден и трябва да съм с някой друг в стая, трябва или да е глух, или пак да сменя клавиатурата).

Снимка на старата и новата клавиатури

Като pointing device ползвам един безжичен logitech trackball – не ми се налага да си местя много ръката по бюрото, и поне на мен ми е удобно да си го ползвам с палеца. Налага се да го чистя от време на време (което не се случва особено много с модерните оптични мишки), но аз съм свикнал от едно време, и децата много се радват на голямото топче…

Софтуер

От около 22-23 години съм на Debian, и не съм намерил причина да го сменям. Имал съм периоди на ползване на пакети от Ubuntu, но като цяло си ги харесвам като дистрибуция – ползвам го за работни станции, сървъри и каквото ми се наложи и все още много му се радвам.

Отгоре има стандартен X, и в/у него XFCE с Compiz за window manager. Причината да ползвам Compiz е, че е най-бързо работещия window manager, който съм виждал, най-вече за нещата, които аз правя (например switch на workspace не примигва по никакъв начин). Дава много начини за настройка (което аз обичам, понеже мога да си го напасна до моите нужди).
Основните неща, които настройвам са:
– focus follows mouse, т.е. не трябва да click-на на прозорец, че да ми е на фокус и да работя в него – това много забързва нещата, и дава възможност да се пише в прозорец, който не е най-отгоре;
– 11 workspace-а (достъпни с ctrl-alt-1..-). Първо бяха 10, но не стигаха (те и сега не стигат, но ctrl-alt-= май е заето и е много близо до ctrl-alt-backspace:) );
– ползвам Desktop cube plugin-а за много workspace-и, и съм сложил скоростите на максимум, та на практика превключвам мигновено м/у два workspace-а.

За повечето си комуникация ползвам pidgin с кофа plugin-и за различните messenger-и. Имам едно-две irc-та, два jabber-а, два slack-а, skype, telegram, и вероятно още някакви неща. Pidgin-а е от софтуерите, за които си компилирам и дебъгвам някакви неща (ползвам го от пакет, едно време и си го компилирах сам). Повечето комуникация да ми е на едно място е доста удобно, щото не трябва да превключвам м/у 10 неща.
Понеже за някои неща поддръжката му не е много добра, ползвам и web версиите им – например фирмения Slack, понеже така ми излизат както трябва notification-ите, и от време на време този на Telegram (например като ми пратят контакт).

Ползвам и claws-mail за mail client, понеже имам много поща и не мога да понасям web-интерфейсите за поща. Ползвал съм преди evolution и малко съм пробвал thunderbird, но са ми бавни и неудобни, а claws-а се оправя много добре с десетки/стотици хиляди писма, pgp и т.н..

Ползвам и два browser-а – един chrome и един firefox – google docs и подобни неща работят много по-добре в chrome, а електрическия ми подпис само във firefox, и съм разделил някак кое къде ползвам. Двата browser-а са причината за 32-та GB памет, ядат много (и често развъртат вентилаторите).

Да следя какво правя ползвам gtimelog – много удобно приложение бързо да си отбелязваш какво си правил, и така следя горе-долу какво съм правил през деня, да имам някаква идея кога трябва да си стана от бюрото и да мога да кажа какво съм правил, ако някой ме пита (или аз се чудя).

И приложението, в което на практика живея – xfce4-terminal 🙂 Пробвал съм различни, но този е достатъчно бърз, и поддържа правилния вид прозрачност. Пуснал съм му голям scrollback, и съм сложил Go Mono шрифт (който е серифен, но по някаква причина ми е по-удобен от останалите). Нещо, което много ползвам и не препоръчвам на никой е прозрачността да гледам терминала под него, основно дали нещо мръдва (и което подлудява повечето хора, които ми гледат екрана).

За текстов редактор ползвам vim, на който не съм правил нищо специално, само за някои директории имам да прави git commit всеки път, като запише файл, много е удобно за файлове с бележки и подобни.

Процес на работа

Всичко това е навързано в моя леко странен начин на работа…

Използвам workspace-ите, за да си организирам различните задачи:
– Първия е за messenger, gtimelog и един vim с бележки;
– Втория е за пощата – claws-mail и някакво количество отворени mail-ове, на които трябва да отговоря (ако не ги отворя, няма начин да ги запомня);
– трети до осми са за терминали. Горе-долу workspace за задача, понякога два;
– девети за chrome, десети за firefox;
– последния за неща, дето почти не гледам, например vpn-а или контрола на музиката вкъщи.

Основният принцип, който гледам да следвам е всяко нещо, което ми трябва в дадена задача да е на един клавиш/комбинация разстояние. Например два терминала на един workspace, и като трябва, превключвам м/у тях. Ако ми трябват още, слагам ги на съседния workspace. Ако са някакви по-странни неща, слагам tmux и го деля на две (при моя монитор и хоризонталното, и вертикалното разделяне работят много добре).

Да си погледна messenger-а, ако има нещо светнало, ми е лесна комбинация, да си видя пощата пак, да отида до по-важните задачи (които гледам да държа на началните workspace-и) – също, директно с лява ръка. Ако трябва да ходя до browser-ите ми е малко по-бавно, но за там най-често така и така си трябва context switch в главата.

Нещо, което също много ми помага, понеже през голяма част от времето си пиша с хора е, че сменям кирилица/латиница с caps lock – един клавиш, на лесно място. Идеята с shift+alt винаги ми се е виждала ужасна, и много често задейства контекстните менюта някъде. Като добавка, ползвам kbdd за да помни в кой прозорец на какъв език съм бил, та да не се налага да превключвам постоянно, като се местя м/у комуникация и терминалите.

Разни

Има разни неща, които съм спрял да ползвам, щото са ми пролазили по нервите. Може би основното такова нещо е Gnome и повечето му неща – терминалът е бавен, windowmanager-а също, и основната работа на разработчиците му беше да премахват настройките, които ползвах. След като един път ми отне половин ден да убедя курсора ми да не мига (или някаква подобна глупост), минах на Xfce.

По същия начин зарязах evolution, понеже ядеше гигантски количества памет и ставаше все по-бавно и по-бавно. Claws-mail ми върши същата (и по-добра работа) с няколко пъти по-малко ресурси.

Firefox 85 released

Post Syndicated from original https://lwn.net/Articles/844015/rss

Version 85 of
the Firefox browser
has been released. The headline change appears to
be the isolation of internal caches to defeat the use of “supercookies” to
track users; see this
blog entry
for details. “In fact, there are many different
caches trackers can abuse to build supercookies. Firefox 85 partitions all
of the following caches by the top-level site being visited: HTTP cache,
image cache, favicon cache, HSTS cache, OCSP cache, style sheet cache, font
cache, DNS cache, HTTP Authentication cache, Alt-Svc cache, and TLS
certificate cache.

New book: Get Started with MicroPython on Raspberry Pi Pico

Post Syndicated from Phil King original https://www.raspberrypi.org/blog/new-book-get-started-with-micropython-on-raspberry-pi-pico/

So, you’ve got a brand new Raspberry Pi Pico and want to know how to get started with this tiny but powerful microcontroller? We’ve got just the book for you.

Get Started with Raspberry Pi Pico book

Beginner-friendly

In Get Started with MicroPython on Raspberry Pi Pico, you’ll learn how to use the beginner-friendly language MicroPython to write programs and connect hardware to make your Raspberry Pi Pico interact with the world around it. Using these skills, you can create your own electro-mechanical projects, whether for fun or to make your life easier.

Inside the pages of the Raspberry Pi Pico book

After taking you on a guided tour of Pico, the books shows you how to get it up and running with a step-by-step illustrated guide to soldering pin headers to the board and installing the MicroPython firmware via a computer.

Programming basics

Inside the pages of the Raspberry Pi Pico book 02

Next, we take you through the basics of programming in MicroPython, a Python-based programming language developed specifically for microcontrollers such as Pico. From there, we explore the wonderful world of physical computing and connect a variety of electronic components to Pico using a breadboard. Controlling LEDs and reading input from push buttons, you’ll start by creating a pedestrian crossing simulation, before moving on to projects such as a reaction game, burglar alarm, temperature gauge, and data logger.

Inside the pages of the Raspberry Pi Pico book

Raspberry Pi Pico also supports the I2C and SPI protocols for communicating with devices, which we explore by connecting it up to an LCD display. You can even use MicroPython to take advantage of one of Pico’s most powerful features, Programmable I/O (PIO), which we explore by controlling NeoPixel LED strips.

Get your copy today!

You can buy Get Started with MicroPython on Raspberry Pi Pico now from the Raspberry Pi Press online store. If you don’t need the lovely new book, with its new-book smell, in your hands in real life, you can download a PDF version for free (or a small voluntary contribution).

STOP PRESS: we’ve spotted an error in the first print run of the book, affecting the code examples in Chapters 4 to 7. We’re sorry! Fortunately it’s easy for readers to correct in their own code; see here for everything you need to know. We’ve already corrected this in the PDF version.

The post New book: Get Started with MicroPython on Raspberry Pi Pico appeared first on Raspberry Pi.

State-Sponsored Threat Actors Target Security Researchers

Post Syndicated from boB Rudis original https://blog.rapid7.com/2021/01/26/state-sponsored-threat-actors-target-security-researchers/

State-Sponsored Threat Actors Target Security Researchers

This blog was co-authored by Caitlin Condon, VRM Security Research Manager, and Bob Rudis, Senior Director and Chief Security Data Scientist.

On Monday, Jan. 25, 2021, Google’s Threat Analysis Group (TAG) published a blog on a widespread social engineering campaign that targeted security researchers working on vulnerability research and development. The campaign, which Google attributed to North Korean (DPRK) state-sponsored actors, has been active for several months and sought to compromise researchers using several methods.

Rapid7 is aware that many security researchers were targeted in this campaign, and information is still developing. While we currently have no evidence that we were compromised, we are continuing to investigate logs and examine our systems for any of the IOCs listed in Google’s analysis. We will update this post with further information as it becomes available.

Organizations should take note that this was a highly sophisticated attack that was important enough to those who orchestrated it for them to burn an as-yet unknown exploit path on. This event is the latest in a chain of attacks—e.g., those targeting SonicWall, VMware, Mimecast, Malwarebytes, Microsoft, Crowdstrike, and SolarWinds—that demonstrates a significant increase in threat activity targeting cybersecurity firms with legitimately sophisticated campaigns. Scenarios like these should become standard components of tabletop exercises and active defense plans.

North Korean-attributed social engineering campaign

Google discovered that the DPRK threat actors had built credibility by establishing a vulnerability research blog and several Twitter profiles to interact with potential targets. They published videos of their alleged exploits, including a YouTube video of a fake proof-of-concept (PoC) exploit for CVE-2021-1647—a high-profile Windows Defender zero-day vulnerability that garnered attention from both security researchers and the media. The DPRK actors also published “guest” research (likely plagiarized from other researchers) on their blog to further build their reputation.

The malicious actors then used two methods to social engineer targets into accepting malware or visiting a malicious website. According to Google:

  • After establishing initial communications, the actors would ask the targeted researcher if they wanted to collaborate on vulnerability research together, and then provide the researcher with a Visual Studio Project. Within the Visual Studio Project would be source code for exploiting the vulnerability, as well as an additional pre-compiled library (DLL) that would be executed through Visual Studio Build Events. The DLL is custom malware that would immediately begin communicating with actor-controlled command and control (C2) domains.

State-Sponsored Threat Actors Target Security Researchers
Visual Studio Build Events command executed when building the provided VS Project files. Image provided by Google.

  • In addition to targeting users via social engineering, Google also observed several cases where researchers have been compromised after visiting the actors’ blog. In each of these cases, the researchers followed a link on Twitter to a write-up hosted on blog[.]br0vvnn[.]io, and shortly thereafter, a malicious service was installed on the researcher’s system and an in-memory backdoor would begin beaconing to an actor-owned command and control server. At the time of these visits, the victim systems were running fully patched and up-to-date Windows 10 and Chrome browser versions. As of Jan. 26, 2021, Google was unable to confirm the mechanism of compromise.

The blog the DPRK threat actors used to execute this zero-day drive-by attack was posted on Reddit as long as three months ago. The actors also used a range of social media and communications platforms to interact with targets—including Telegram, Keybase, Twitter, LinkedIn, and Discord. As of Jan. 26, 2021, many of these profiles have been suspended or deactivated.

Rapid7 customers

Google’s threat intelligence includes information on IOCs, command-and-control domains, actor-controlled social media accounts, and compromised domains used as part of the campaign. Rapid7’s MDR team is deploying IOCs and behavior-based detections. These detections will also be available to InsightIDR customers later today. We will update this blog post with further information as it becomes available.

Defender guidance

TAG noted in their blog post that they have so far only seen actors targeting Windows systems. As of the evening of Jan. 25, 2021, researchers across many companies confirmed on Twitter that they had interacted with the DPRK actors and/or visited the malicious blog. Organizations that believe their researchers or other employees may have been targeted should conduct internal investigations to determine whether indicators of compromise are present on their networks.

At a minimum, responders should:

  • Ensure members of all security teams are aware of this campaign and encourage individuals to report if they believe they were targeted by these actors.
  • Search web traffic, firewall, and DNS logs for evidence of contacts to the domains and URLs provided by Google in their post.
  • According to Rapid7 Labs’ forward DNS archive, the br0vvnn[.]io apex domain has had two discovered fully qualified domain names (FQDNs)—api[.]br0vvnn[.]io and blog[.]br0vvnn[.]io—over the past four months with IP addresses 192[.]169[.]6[.]31 and 192[.]52[.]167[.]169, respectively. Contacts to those IPs should also be investigated in historical access records.
  • Check for evidence of the provided hashes on all systems, starting with those operated and accessed by members of security teams.

Moving forward, organizations and individuals should heed Google’s advice that “if you are concerned that you are being targeted, we recommend that you compartmentalize your research activities using separate physical or virtual machines for general web browsing, interacting with others in the research community, accepting files from third parties and your own security research.”

NEVER MISS A BLOG

Get the latest stories, expertise, and news about security today.

Getting your notifications via Signal

Post Syndicated from Brian van Baekel original https://blog.zabbix.com/getting-your-notifications-via-signal/13286/

Recently, Whatsapp pushed their new privacy policy where they announced to share more data with Facebook, causing an exodus to other platforms, where Signal is one of the more popular ones, among Telegram. Both are great alternatives, but I prefer Signal due to the open-source part, end to end encryption, and last but not least: their business model (living on donations instead of selling your data).

Typically, Zabbix is sending notifications to whatever medium you’ve chosen if a problem is detected. We all know the Email messages, the various webhook integrations with Slack/MS Teams/ Jira, etc, perhaps even some text message integrations and such. Now, if we’re migrating to Signal, we suddenly have access to the Signal API and can utilize it to receive Zabbix notifications. Nice!

There is only one drawback. You need a separate phone number to register against Signal. Don’t use your own phone number – unless you want to lose the ability to use Signal ;(

There are various ways to get a phone number for this purpose:

  • Use the phone number of your current SMS gateway
  • Use the company phone number (a lot of cloud PBX are providing the option to receive the verification email)
  • Purchase a prepaid phone number.
  • Use a service like Twilio

You just need to receive one text message, the rest of the communications will go via the internet

Time to get rid of Whatsapp and move to Signal! But… How to use Signal to get your notifications?

Signal-cli

Although we could built everything from scratch, talking to the API of Signal, there is a nice implementation available in order to talk to Signal within a few minutes: Signal-cli

Although this github page is very comprehensive in order to get Signal-cli installed, but of course it is not doing anything with Zabbix.

Configuration tasks

For this guide, we’re using:

  • Centos 8
  • Zabbix 5.2

signal-cli installation

First, lets install the Signal-cli utility, and in order to do so we need to resolve the dependency of Java by installing the openjdk application:

dnf -y install java-11-openjdk-devel.x86_64

After this installation, we should be good to continue with the installation of signal-cli. According to their installation guide, this should be sufficient:

export VERSION="0.7.3"
wget https://github.com/AsamK/signal-cli/releases/download/v"${VERSION}"/signal-cli-"${VERSION}".tar.gz
sudo tar xf signal-cli-"${VERSION}".tar.gz -C /opt
sudo ln -sf /opt/signal-cli-"${VERSION}"/bin/signal-cli /usr/local/bin/

At the time of writing, the most recent version is 0.7.3, and that’s what we’re installing here. If in the future a new version is released, of course you should install that!

If everything went as expected, we should be able to register ourself to Signal.

signal-cli registration

Since we want to execute these commands by Zabbix, we must make sure the registration is done with the correct user on the Zabbix server, otherwise you will get the following error message:

Unregistered user error

(ERROR App – User +19293771253 is not registered.)

In order to prevent this error, lets do the authentication against Signal as Zabbix user:

Important: The USERNAME (your phone number) must include the country calling code, i.e. the number must start with a “+” sign and you must replace everything between the  < > in the following examples with your own values

runuser -l zabbix -c 'signal-cli -u <NUMBER> register'

Now, check for incoming test messages on this phone number. Within seconds you should receive a 6 digit code in the following format: xxx-xxx

Once you’ve received the text, it’s time to complete the registration:

runuser -l zabbix -c 'signal-cli -u <NUMBER> verify <CODE>'

Since we’re running these commands as a different user, we won’t see the output of them. Let’s just test!

Sending messages from the command line is straight forward:

runuser -l zabbix -c 'signal-cli -u <NUMBER> send -m <MESSAGE> <RECEIVER NUMBER>'

You will see the message id as output. Simply ignore it, since it’s not relevant at this point.

Within seconds:

It works! Great.

So now we’ve got this part covered, time to get the AlertScript set up, before heading to the frontend.

Zabbix AlertScript setup

Ok, so now we’ve got the registration done, we need to make sure Zabbix can utilise it. In order to do so, we use a very old method. Although it would’ve made more sense to use the webhook option, that means I had to built the communication with Signal from scratch.

So AlertScripts it is. In your terminal/SSH session with the Zabbix server open a new file with this command: vi /usr/lib/zabbix/alertscripts/signal.sh and insert the following contents:

#!/bin/bash
signal-cli -u '+19293771253' send -m "$1" $2

 That’s right. just 2 lines. After saving the file, change the owner and set the permissions:

chown zabbix:zabbix /usr/lib/zabbix/alertscripts/signal.sh
chmod 7000 /usr/lib/zabbix/alertscripts/signal.sh

and it’s time to move to our frontend.

Zabbix mediatype configuration

In the frontend, go to Administration -> Mediatypes and create a new mediatype:

Signal Mediatype

Name: Signal
Type: Script
Script name: signal.sh
Script parameters:
    {ALERT.MESSAGE}
    {ALERT.SENDTO}

don’t forget to configure some Message templates as well (second tab in the Mediatype configuration). You can just use the defaults if you click on ‘add’

Zabbix media configuration

Next step. Navigate to Administration -> Users (or just open your own user profile) and create a new media:

new-media

Type: Signal
Sendto: <your number>
When active / severity as per needs

Important: The USERNAME (your phone number) must include the country calling code, i.e. the number must start with a “+” sign

We’re almost there, just some configuration on the actions

Zabbix action configuration

This step is only needed if you are sending notifications right now via a specific mediatype. If you configured the ‘send only to’ option to ‘- All -‘ there is nothing to change, and it will work straight away!

Otherwise, navigate to Configuration -> Actions and find the action you want to change, and in the Operations, Recovery operations and Update operations change the ‘send only to’ option to ‘Signal’

Save your action and it’s time to test – Generate some problem to confirm the implementation actually works.

Wrap up

That’s it. By now you should have a working implementation where Zabbix is sending notifications to Signal. The setup was extremely straight forward and easy to configure. Nevertheless, if you need help getting this going, we (Opensource ICT Solutions) offer consultancy services as well, and are more than happy to help you out!

 

Elasticsearch – Scalability and Multitenancy [slides]

Post Syndicated from Bozho original https://techblog.bozho.net/elasticsearch-scalability-and-multitenancy-slides/

Last week I gave a talk in a local tech group about my experience with Elasticsearch at LogSentinel, and how we achieve multitenancy and scalability.

Obviously, the topic of scalability is huge and it can’t be fully covered in 45 minutes, but I tried presenting the main aspects from the application perspective (I entirely skipped the Ops perspective, as it was a developer audience). The list of resources at the end of the slides show some of the sources of my “research” on the topic, which I recommend going through.

Below are the slides (the talk was not in English):

I hope it’s a useful intro to the topic and the main conclusion is – it’s counterintuitive if you are used to relational databases, and some internals (shards, Lucene segments) “leak” through the abstractions to influence the application design (as per the law of leaky abstractions).

The post Elasticsearch – Scalability and Multitenancy [slides] appeared first on Bozho's tech blog.

Understanding memory usage in your Java application with Amazon CodeGuru Profiler

Post Syndicated from Fernando Ciciliati original https://aws.amazon.com/blogs/devops/understanding-memory-usage-in-your-java-application-with-amazon-codeguru-profiler/

“Where has all that free memory gone?” This is the question we ask ourselves every time our application emits that dreaded OutOfMemoyError just before it crashes. Amazon CodeGuru Profiler can help you find the answer.

Thanks to its brand-new memory profiling capabilities, troubleshooting and resolving memory issues in Java applications (or almost anything that runs on the JVM) is much easier. AWS launched the CodeGuru Profiler Heap Summary feature at re:Invent 2020. This is the first step in helping us, developers, understand what our software is doing with all that memory it uses.

The Heap Summary view shows a list of Java classes and data types present in the Java Virtual Machine heap, alongside the amount of memory they’re retaining and the number of instances they represent. The following screenshot shows an example of this view.

Amazon CodeGuru Profiler heap summary view example

Figure: Amazon CodeGuru Profiler Heap Summary feature

Because CodeGuru Profiler is a low-overhead, production profiling service designed to be always on, it can capture and represent how memory utilization varies over time, providing helpful visual hints about the object types and the data types that exhibit a growing trend in memory consumption.

In the preceding screenshot, we can see that several lines on the graph are trending upwards:

  • The red top line, horizontal and flat, shows how much memory has been reserved as heap space in the JVM. In this case, we see a heap size of 512 MB, which can usually be configured in the JVM with command line parameters like -Xmx.
  • The second line from the top, blue, represents the total memory in use in the heap, independent of their type.
  • The third, fourth, and fifth lines show how much memory space each specific type has been using historically in the heap. We can easily spot that java.util.LinkedHashMap$Entry and java.lang.UUID display growing trends, whereas byte[] has a flat line and seems stable in memory usage.

Types that exhibit constantly growing trend of memory utilization with time deserve a closer look. Profiler helps you focus your attention on these cases. Associating the information presented by the Profiler with your own knowledge of your application and code base, you can evaluate whether the amount of memory being used for a specific data type can be considered normal, or if it might be a memory leak – the unintentional holding of memory by an application due to the failure in freeing-up unused objects. In our example above, java.util.LinkedHashMap$Entry and java.lang.UUIDare good candidates for investigation.

To make this functionality available to customers, CodeGuru Profiler uses the power of Java Flight Recorder (JFR), which is now openly available with Java 8 (since OpenJDK release 262) and above. The Amazon CodeGuru Profiler agent for Java, which already does an awesome job capturing data about CPU utilization, has been extended to periodically collect memory retention metrics from JFR and submit them for processing and visualization via Amazon CodeGuru Profiler. Thanks to its high stability and low overhead, the Profiler agent can be safely deployed to services in production, because it is exactly there, under real workloads, that really interesting memory issues are most likely to show up.

Summary

For more information about CodeGuru Profiler and other AI-powered services in the Amazon CodeGuru family, see Amazon CodeGuru. If you haven’t tried the CodeGuru Profiler yet, start your 90-day free trial right now and understand why continuous profiling is becoming a must-have in every production environment. For Amazon CodeGuru customers who are already enjoying the benefits of always-on profiling, this new feature is available at no extra cost. Just update your Profiler agent to version 1.1.0 or newer, and enable Heap Summary in your agent configuration.

 

Happy profiling!

AWS is the first global cloud service provider to comply with the new K-ISMS-P standard

Post Syndicated from Seulun Sung original https://aws.amazon.com/blogs/security/aws-is-the-first-global-cloud-service-provider-to-comply-with-the-new-k-isms-p-standard/

We’re excited to announce that Amazon Web Services (AWS) has achieved certification under the Korea-Personal Information & Information Security Management System (K-ISMS-P) standard (effective from December 16, 2020 to December 15, 2023). The assessment by the Korea Internet & Security Agency (KISA) covered the operation of infrastructure (including compute, storage, networking, databases, and security) in the AWS Asia Pacific (Seoul) Region. AWS was the first global cloud service provider (CSP) to obtain K-ISMS certification (the previous version of K-ISMS-P) back in 2017. Now AWS is the first global CSP to achieve compliance with the K-ISMS portion of the new K-ISMS-P standard.

Sponsored by KISA and affiliated with the Korean Ministry of Science and ICT (MSIT), K-ISMS-P serves as a standard for evaluating whether enterprises and organizations operate and manage their information security management systems consistently and securely, such that they thoroughly protect their information assets. The new K-ISMS-P standard combined the K-ISMS and K-PIMS (Personal Information Management System) standards with updated control items. Accordingly, the new K-ISMS certification and K-ISMS-P certification (personal information–focused) are introduced under the updated standard.

In this year’s audit, 110 services running in the Asia Pacific (Seoul) Region are included. The newly launched Availability Zone in 2020 is also added to the certification scope.

This certification helps enterprises and organizations across South Korea, regardless of industry, meet KISA compliance requirements more efficiently. Achieving this certification demonstrates the proactive approach AWS has taken to meet compliance set by the South Korean government and to deliver secure AWS services to customers. In addition, we’ve launched Quick Start and Operational Best Practices (conformance pack) pages to provide customers with a compliance framework that they can utilize for their K-ISMS-P compliance needs. Enterprises and organizations can use these toolkits and AWS certification to reduce the effort and cost of getting their own K-ISMS-P certification. You can download the AWS K-ISMS certification under the K-ISMS-P standard from AWS Artifact. To learn more about the AWS K-ISMS certification, see the AWS K-ISMS page. If you have any questions, don’t hesitate to contact your AWS Account Manager.

If you have feedback about this post, submit comments in the Comments section below.

Want more AWS Security how-to content, news, and feature announcements? Follow us on Twitter.

Author

Seulun Sung

Seulun is a Security Audit Program Manager at AWS, leading security certification programs, with a focus on the K-ISMS-P program in South Korea. She has a decade of experience in deploying global policies and processes to local Regions and helping customers adopt regulations. She is passionate about helping to build customers’ trust and provide them assurance on cloud security.

Serving driver-partners data at scale using mirror cache

Post Syndicated from Grab Tech original https://engineering.grab.com/mirror-cache-blog

Since the early beginnings, driver-partners have been the centerpiece of the wide-range of services or features provided by the Grab platform. Over time, many backend microservices were developed to support our driver-partners such as earnings, ratings, insurance, etc. All of these different microservices require certain information, such as name, phone number, email, active car types, and so on, to curate the services provided to the driver-partners.

We built the Drivers Data service to provide drivers-partners data to other microservices. The service attracts a high QPS and handles 10K requests during peak hours. Over the years, we have tried different strategies to serve driver-partners data in a resilient and cost-effective manner, while accounting for low response time. In this blog post, we talk about mirror cache, an in-memory local caching solution built to serve driver-partners data efficiently.

What we started with

Figure 1. Drivers Data service architecture
Figure 1. Drivers Data service architecture

Our Drivers Data service previously used MySQL DB as persistent storage and two caching layers – standalone local cache (RAM of the EC2 instances) as primary cache and Redis as secondary for eventually consistent reads. With this setup, the cache hit ratio was very low.

Figure 2. Request flow chart
Figure 2. Request flow chart

We opted for a cache aside strategy. So when a client request comes, the Drivers Data service responds in the following manner:

  • If data is present in the in-memory cache (local cache), then the service directly sends back the response.
  • If data is not present in the in-memory cache and found in Redis, then the service sends back the response and updates the local cache asynchronously with data from Redis.
  • If data is not present either in the in-memory cache or Redis, then the service responds back with the data fetched from the MySQL DB and updates both Redis and local cache asynchronously.
Figure 3. Percentage of response from different sources
Figure 3. Percentage of response from different sources

The measurement of the response source revealed that during peak hours ~25% of the requests were being served via standalone local cache, ~20% by MySQL DB, and ~55% via Redis.

The low cache hit rate is caused by the driver-partners data loading patterns: low frequency per driver over time but the high frequency in a short amount of time. When a driver-partner is a candidate for a job or is involved in an ongoing job, different services make multiple requests to the Drivers Data service to fetch that specific driver-partner information. The frequency of calls for a specific driver-partner reduces if he/she is not involved in the job allocation process or is not doing any job at the moment.

While low frequency per driver over time impacts the Redis cache hit rate, high frequency in short amounts of time mostly contributes to in-memory cache hit rate. In our investigations, we found that local caches of different nodes in the Drivers Data service cluster were making redundant calls to Redis and DB for fetching the same data that are already present in a node local cache.

Making in-memory cache available on every instance while the data is in active use, we could greatly increase the in-memory cache hit rate, and that’s what we did.

Mirror cache design goals

We set the following design goals:

  • Support a local least recently used (LRU) cache use-case.
  • Support active cache invalidation.
  • Support best effort replication between local cache instances (EC2 instances). If any instance successfully fetches the latest data from the database, then it should try to replicate or mirror this latest data across all the other nodes in the cluster. If replication fails and the item is expired or not found, then the nodes should fetch it from the database.
  • Support async data replication across nodes to ensure updates for the same key happens only with more recent data. For any older updates, the current data in the cache is ignored. The ordering of cache updates is not guaranteed due to the async replication.
  • Ability to handle auto-scaling.

The building blocks

Figure 4. Mirror cache
Figure 4. Mirror cache

The mirror cache library runs alongside the Drivers Data service inside each of the EC2 instances of the cluster. The two main components are in-memory cache and replicator.

In-memory cache

The in-memory cache is used to store multiple key/value pairs in RAM. There is a TTL associated with each key/value pair. We wanted to use a cache that can provide high hit ratio, memory bound, high throughput, and concurrency. After evaluating several options, we went with dgraph’s open-source concurrent caching library Ristretto as our in-memory local cache. We were particularly impressed by its use of the TinyLFU admission policy to ensure a high hit ratio.

Replicator

The replicator is responsible for mirroring/replicating each key/value entry among all the live instances of the Drivers Data service. The replicator has three main components: Membership Store, Notifier, and gRPC Server.

Membership Store

The Membership Store registers callbacks with our service discovery service to notify mirror cache in case any nodes are added or removed from the Drivers Data service cluster.

It maintains two maps – nodes in the same AZ (AWS availability zone) as itself (the current node of the Drivers Data service in which mirror cache is running) and the nodes in the other AZs.

Notifier

Each service (Drivers Data) node runs a single instance of mirror cache. So effectively, each node has one notifier.

  • Combine several (key/value) pairs updates to form a batch.
  • Propagate the batch updates among all the nodes in the same AZ as itself.
  • Send the batch updates to exactly one notifier (node) in different AZs who, in turn, are responsible for updating all the nodes in their own AZs with the latest batch of data. This communication technique helps to reduce cross AZ data transfer overheads.

In the case of auto-scaling, there is a warm-up period during which the notifier doesn’t notify the other nodes in the cluster. This is done to minimize duplicate data propagation. The warm-up period is configurable.

gRPC Server

An exclusive gRPC server runs for mirror cache. The different nodes of the Drivers Data service use this server to receive new cache updates from the other nodes in the cluster.

Here’s the structure of each cache update entity:

message Entity {
    string key = 1; // Key for cache entry.
    bytes value = 2; // Value associated with the key.
    Metadata metadata = 3; // Metadata related to the entity.
    replicationType replicate = 4; // Further actions to be undertaken by the mirror cache after updating its own in-memory cache.
    int64 TTL = 5; // TTL associated with the data.
    bool  delete = 6; // If delete is set as true, then mirror cache needs to delete the key from it's local cache.
}

enum replicationType {
    Nothing = 0; // Stop propagation of the request.
    SameRZ = 1; // Notify the nodes in the same Region and AZ.
}

message Metadata {
    int64 updatedAt = 1; // Same as updatedAt time of DB.
}

The server first checks if the local cache should update this new value or not. It tries to fetch the existing value for the key. If the value is not found, then the new key/value pair is added. If there is an existing value, then it compares the updatedAt time to ensure that stale data is not updated in the cache.

If the replicationType is Nothing, then the mirror cache stops further replication. In case the replicationType is SameRZ then the mirror cache tries to propagate this cache update among all the nodes in the same AZ as itself.

Run at scale

Figure 5. Drivers Data Service new architecture
Figure 5. Drivers Data Service new architecture

The behavior of the service hasn’t changed and the requests are being served in the same manner as before. The only difference here is the replacement of the standalone local cache in each of the nodes with mirror cache. It is the responsibility of mirror cache to replicate any cache updates to the other nodes in the cluster.

After mirror cache was fully rolled out to production, we rechecked our metrics related to the response source and saw a huge improvement. The graph showed that during peak hours ~75% of the response was from in-memory local cache. About 15% of the response was served by MySQL DB and a further 10% via Redis.

The local cache hit ratio was at 0.75, a jump of 0.5 from before and there was a 5% drop in the number of DB calls too.

Figure 6. New percentage of response from different sources
Figure 6. New percentage of response from different sources

Limitations and future improvements

Mirror cache is eventually consistent, so it is not a good choice for systems that need strong consistency.

Mirror cache stores all the data in volatile memory (RAM) and they are wiped out during deployments, resulting in a temporary load increase to Redis and DB.

Also, many new driver-partners are added everyday to the Grab system, and we might need to increase the cache size to maintain a high hit ratio. To address these issues we plan to use SSD in the future to store a part of the data and use RAM only to store hot data.

Conclusion

Mirror cache really helped us scale the Drivers Data service better and serve driver-partners data to the different microservices at low latencies. It also helped us achieve our original goal of an increase in the local cache hit ratio.

We also extended mirror cache in some other services and found similar promising results.


A huge shout out to Haoqiang Zhang and Roman Atachiants for their inputs into the final design. Special thanks to the Driver Backend team at Grab for their contribution.


Join us

Grab is more than just the leading ride-hailing and mobile payments platform in Southeast Asia. We use data and technology to improve everything from transportation to payments and financial services across a region of more than 620 million people. We aspire to unlock the true potential of Southeast Asia and look for like-minded individuals to join us on this ride.

If you share our vision of driving South East Asia forward, apply to join our team today.

Running cost optimized Spark workloads on Kubernetes using EC2 Spot Instances

Post Syndicated from Emma White original https://aws.amazon.com/blogs/compute/running-cost-optimized-spark-workloads-on-kubernetes-using-ec2-spot-instances/

This post is written by Kinnar Sen, Senior Solutions Architect, EC2 Spot 

Apache Spark is an open-source, distributed processing system used for big data workloads. It provides API operations to perform multiple tasks such as streaming, extract transform load (ETL), query, machine learning (ML), and graph processing. Spark supports four different types of cluster managers (Spark standalone, Apache Mesos, Hadoop YARN, and Kubernetes), which are responsible for scheduling and allocation of resources in the cluster. Spark can run with native Kubernetes support since 2018 (Spark 2.3). AWS customers that have already chosen Kubernetes as their container orchestration tool can also choose to run Spark applications in Kubernetes, increasing the effectiveness of their operations and compute resources.

In this post, I illustrate the deployment of scalable, resilient, and cost optimized Spark application using Kubernetes via Amazon Elastic Kubernetes Service (Amazon EKS) and Amazon EC2 Spot Instances. Learn how to save money on big data workloads by implementing this solution.

Overview

Amazon EC2 Spot Instances

Amazon EC2 Spot Instances let you take advantage of unused EC2 capacity in the AWS Cloud. Spot Instances are available at up to a 90% discount compared to On-Demand Instance prices. Capacity pools are a group of EC2 instances that belong to particular instance family, size, and Availability Zone (AZ). If EC2 needs capacity back for On-Demand Instance usage, Spot Instances can be interrupted by EC2 with a two-minute notification. There are many graceful ways to handle the interruption to ensure that the application is well architected for resilience and fault tolerance. This can be automated via the application and/or infrastructure deployments. Spot Instances are ideal for stateless, fault tolerant, loosely coupled and flexible workloads that can handle interruptions.

Amazon Elastic Kubernetes Service

Amazon EKS is a fully managed Kubernetes service that makes it easy for you to run Kubernetes on AWS without needing to install, operate, and maintain your own Kubernetes control plane. It provides a highly available and scalable managed control plane. It also provides managed worker nodes, which let you create, update, or terminate shut down worker nodes for your cluster with a single command. It is a great choice for deploying flexible and fault tolerant containerized applications. Amazon EKS supports creating and managing Amazon EC2 Spot Instances using Amazon EKS-managed node groups following Spot best practices. This enables you to take advantage of the steep savings and scale that Spot Instances provide for interruptible workloads running in your Kubernetes cluster. Using EKS-managed node groups with Spot Instances requires less operational effort compared to using self-managed nodes. In addition to launching Spot Instances in managed node groups, it is possible to specify multiple instance types in EKS managed node groups. You can find more in this blog.

Apache Spark and Kubernetes

When a spark application is submitted to the Kubernetes cluster the following happens:

  • A Spark driver is created.
  • The driver and the run within pods.
  • The Spark driver then requests for executors, which are scheduled to run within pods. The executors are managed by the driver.
  • The application is launched and once it completes, the executor pods are cleaned up. The driver pod persists the logs and remains in a completed state until the pod is cleared by garbage collection or manually removed. The driver in a completed stage does not consume any memory or compute resources.

Spark Deployment on Kubernetes Cluster

When a spark application runs on clusters managed by Kubernetes, the native Kubernetes scheduler is used. It is possible to schedule the driver/executor pods on a subset of available nodes. The applications can be launched either by a vanilla ‘spark submit’, a workflow orchestrator like Apache Airflow or the spark operator. I use vanilla ‘spark submit’ in this blog. is also able to schedule Spark applications on EKS clusters as described in this launch blog, but Amazon EMR on EKS is out of scope for this post.

Cost optimization

For any organization running big data workloads there are three key requirements: scalability, performance, and low cost. As the size of data increases, there is demand for more compute capacity and the total cost of ownership increases. It is critical to optimize the cost of big data applications. Big Data frameworks (in this case, Spark) are distributed to manage and process high volumes of data. These frameworks are designed for failure, can run on machines with different configurations, and are inherently resilient and flexible.

If Spark deploys on Kubernetes, the executor pods can be scheduled on EC2 Spot Instances and driver pods on On-Demand Instances. This reduces the overall cost of deployment – Spot Instances can save up to 90% over On-Demand Instance prices. This also enables faster results by scaling out executors running on Spot Instances. Spot Instances, by design, can be interrupted when EC2 needs the capacity back. If a driver pod is running on a Spot Instance, which is interrupted then the application fails and the application must be re-submitted. To avoid this situation, the driver pod can be scheduled on On-Demand Instances only. This adds a layer of resiliency to the Spark application running on Kubernetes. To cost optimize the deployment, all the executor pods are scheduled on Spot Instances as that’s where the bulk of compute happens. Spark’s inherent resiliency has the driver launch new executors to replace the ones that fail due to Spot interruptions.

There are a couple of key points to note here.

  • The idea is to start with minimum number of nodes for both On-Demand and Spot Instances (one each) and then auto-scale usingCluster Autoscaler and EC2 Auto Scaling  Cluster Autoscaler for AWS provides integration with Auto Scaling groups. If there are not sufficient resources, the driver and executor pods go into pending state. The Cluster Autoscaler detects pods in pending state and scales worker nodes within the identified Auto Scaling group in the cluster using EC2 Auto Scaling.
  • The scaling for On-Demand and Spot nodes is exclusive of one another. So, if multiple applications are launched the driver and executor pods can be scheduled in different node groups independently per the resource requirements. This helps reduce job failures due to lack of resources for the driver, thus adding to the overall resiliency of the system.
  • Using EKS Managed node groups
    • This requires significantly less operational effort compared to using self-managed nodegroup and enables:
      • Auto enforcement of Spot best practices like Capacity Optimized allocation strategy, Capacity Rebalancing and use multiple instances types.
      • Proactive replacement of Spot nodes using rebalance notifications.
      • Managed draining of Spot nodes via re-balance recommendations.
    • The nodes are auto-labeled so that the pods can be scheduled with NodeAffinity.
      • eks.amazonaws.com/capacityType: SPOT
      • eks.amazonaws.com/capacityType: ON_DEMAND

Now that you understand the products and best practices of used in this tutorial, let’s get started.

Tutorial: running Spark in EKS managed node groups with Spot Instances

In this tutorial, I review steps, which help you launch cost optimized and resilient Spark jobs inside Kubernetes clusters running on EKS. I launch a word-count application counting the words from an Amazon Customer Review dataset and write the output to an Amazon S3 folder. To run the Spark workload on Kubernetes, make sure you have eksctl and kubectl installed on your computer or on an AWS Cloud9 environment. You can run this by using an AWS IAM user or role that has the AdministratorAccess policy attached to it, or check the minimum required permissions for using eksctl. The spot node groups in the Amazon EKS cluster can be launched both in a managed or a self-managed way, in this post I use the former. The config files for this tutorial can be found here. The job is finally launched in cluster mode.

Create Amazon S3 Access Policy

First, I must create an Amazon S3 access policy to allow the Spark application to read/write from Amazon S3. Amazon S3 Access is provisioned by attaching the policy by ARN to the node groups. This associates Amazon S3 access to the NodeInstanceRole and, hence, the node groups then have access to Amazon S3. Download the Amazon S3 policy file from here and modify the <<output folder>> to an Amazon S3 bucket you created. Run the following to create the policy. Note the ARN.

aws iam create-policy --policy-name spark-s3-policy --policy-document file://spark-s3.json

Cluster and node groups deployment

Create an EKS cluster using the following command:

eksctl create cluster –name= sparkonk8 --node-private-networking  --without-nodegroup --asg-access –region=<<AWS Region>>

The cluster takes approximately 15 minutes to launch.

Create the nodegroup using the nodeGroup config file. Replace the <<Policy ARN>> string using the ARN string from the previous step.

eksctl create nodegroup -f managedNodeGroups.yml

Scheduling driver/executor pods

The driver and executor pods can be assigned to nodes using affinity. PodTemplates can be used to configure the detail, which is not supported by Spark launch configuration by default. This feature is available from Spark 3.0.0, requiredDuringScheduling node affinity is used to schedule the driver and executor jobs. Sample podTemplates have been uploaded here.

Launching a Spark application

Create a service account. The spark driver pod uses the service account to create and watch executor pods using Kubernetes API server.

kubectl create serviceaccount spark
kubectl create clusterrolebinding spark-role --clusterrole='edit'  --serviceaccount=default:spark --namespace=default

Download the Cluster Autoscaler and edit it to add the cluster-name. 

curl -LO https://raw.githubusercontent.com/kubernetes/autoscaler/master/cluster-autoscaler/cloudprovider/aws/examples/cluster-autoscaler-autodiscover.yaml

Install the Cluster AutoScaler using the following command:

kubectl apply -f cluster-autoscaler-autodiscover.yaml

Get the details of Kubernetes master to get the head URL.

kubectl cluster-info 

command output

Use the following instructions to build the docker image.

Download the application file (script.py) from here and upload into the Amazon S3 bucket created.

Download the pod template files from here. Submit the application.

bin/spark-submit \
--master k8s://<<MASTER URL>> \
--deploy-mode cluster \
--name 'Job Name' \
--conf spark.eventLog.dir=s3a:// <<S3 BUCKET>>/logs \
--conf spark.eventLog.enabled=true \
--conf spark.history.fs.inProgressOptimization.enabled=true \
--conf spark.history.fs.update.interval=5s \
--conf spark.kubernetes.container.image=<<ECR Spark Docker Image>> \
--conf spark.kubernetes.container.image.pullPolicy=IfNotPresent \
--conf spark.kubernetes.driver.podTemplateFile='../driver_pod_template.yml' \
--conf spark.kubernetes.executor.podTemplateFile='../executor_pod_template.yml' \
--conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \
--conf spark.dynamicAllocation.enabled=true \
--conf spark.dynamicAllocation.shuffleTracking.enabled=true \
--conf spark.dynamicAllocation.maxExecutors=100 \
--conf spark.dynamicAllocation.executorAllocationRatio=0.33 \
--conf spark.dynamicAllocation.sustainedSchedulerBacklogTimeout=30 \
--conf spark.dynamicAllocation.executorIdleTimeout=60s \
--conf spark.driver.memory=8g \
--conf spark.kubernetes.driver.request.cores=2 \
--conf spark.kubernetes.driver.limit.cores=4 \
--conf spark.executor.memory=8g \
--conf spark.kubernetes.executor.request.cores=2 \
--conf spark.kubernetes.executor.limit.cores=4 \
--conf spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem \
--conf spark.hadoop.fs.s3a.connection.ssl.enabled=false \
--conf spark.hadoop.fs.s3a.fast.upload=true \
s3a://<<S3 BUCKET>>/script.py \
s3a://<<S3 BUCKET>>/output 

A couple of key points to note here

  • podTemplateFile is used here, which enables scheduling of the driver pods to On-Demand Instances and executor pods to Spot Instances.
  • Spark provides a mechanism to allocate dynamically resources dynamically based on workloads. In the latest release of Spark (3.0.0), dynamicAllocation can be used with Kubernetes cluster manager. The executors that do not store, active, shuffled files can be removed to free up the resources. DynamicAllocation works well in tandem with Cluster Autoscaler for resource allocation and optimizes resource for jobs. We are using dynamicAllocation here to enable optimized resource sharing.
  • The application file and output are both in Amazon S3.

Output Files in S3

  • Spark Event logs are redirected to Amazon S3. Spark on Kubernetes creates local temporary files for logs and removes them once the application completes. The logs are redirected to Amazon S3 and Spark History Server can be used to analyze the logs. Note, you can create more instrumentation using tools like Prometheus and Grafana to monitor and manage the cluster.

Spark History Server + Dynamic Allocation

Observations

EC2 Spot Interruptions

The following diagram and log screenshot details from Spark History server showcases the behavior of a Spark application in case of an EC2 Spot interruption.

Four Spark applications launched in parallel in a cluster and one of the Spot nodes was interrupted. A couple of executor pods were terminated shut down in three of the four applications, but due to the resilient nature of Spark new executors were launched and the applications finished almost around the same time.
The Spark Driver identified the shut down executors, which handled the shuffle files and relaunched the tasks running on those executors.
Spark jobs

The Spark Driver identified the shut down executors, which handled the shuffle files and relaunched the tasks running on those executors.

Dynamic Allocation

Dynamic Allocation works with the caveat that it is an experimental feature.

dynamic allocation

Cost Optimization

Cost Optimization is achieved in several different ways from this tutorial.

  • Use of 100% Spot Instances for the Spark executors
  • Use of dynamicAllocation along with cluster autoscaler does make optimized use of resources and hence save cost
  • With the deployment of one driver and executor nodes to begin with and then scaling up on demand reduces the waste of a continuously running cluster

Cluster Autoscaling

Cluster Autoscaling is triggered as it is designed when there are pending (Spark executor) pods.

The Cluster Autoscaler logs can be fetched by:

kubectl logs -f deployment/cluster-autoscaler -n kube-system —tail=10  

Cluster Autoscaler Logs 

Cleanup

If you are trying out the tutorial, run the following steps to make sure that you don’t encounter unwanted costs.

Delete the EKS cluster and the nodegroups with the following command:

eksctl delete cluster --name sparkonk8

Delete the Amazon S3 Access Policy with the following command:

aws iam delete-policy --policy-arn <<POLICY ARN>>

Delete the Amazon S3 Output Bucket with the following command:

aws s3 rb --force s3://<<S3_BUCKET>>

Conclusion

In this blog, I demonstrated how you can run Spark workloads on a Kubernetes Cluster using Spot Instances, achieving scalability, resilience, and cost optimization. To cost optimize your Spark based big data workloads, consider running spark application using Kubernetes and EC2 Spot Instances.

 

 

 

The collective thoughts of the interwebz

By continuing to use the site, you agree to the use of cookies. more information

The cookie settings on this website are set to "allow cookies" to give you the best browsing experience possible. If you continue to use this website without changing your cookie settings or you click "Accept" below then you are consenting to this.

Close