Data monetization and customer experience optimization using telco data assets: Part 1

Post Syndicated from Vikas Omer original https://aws.amazon.com/blogs/big-data/part-1-data-monetization-and-customer-experience-optimization-using-telco-data-assets/

The landscape of the telecommunications industry is changing rapidly. For telecom service providers (TSPs), revenue from core voice and data services continues to shrink due to regulatory pressure and emerging OTT players that offer an attractive alternative. Despite increasing demand from customers for bandwidth, speed, and efficiency, TSPs are finding that ROI from implementing new access technologies like 5G are unsubstantial.

To overcome the risk of being relegated to a utility or dumb pipe, TSPs today are looking to diversify, adopting alternative business models to generate new revenue streams.

In recent times, adopting customer experience (CX) and data monetization initiatives has been a key theme across all industries. Although many Tier-1 TSPs are leading this transformation by using new technologies to improve CX and improve profitability, many TSPs have yet to embark on this challenging but rewarding journey.

Building and implementing a CX management and data monetization strategy

Data monetization is often misunderstood as making dollars by selling data, but what it really means is to drive revenue by increasing the top line or the bottom line. It can be tangible or intangible, internal or external, or by making use of data assets.

According to Gartner, most data and analytics leaders are looking to increase investments in business intelligence (BI) and analytics (see the following study results).

The preceding visualization is from “The 2019 CIO Agenda: Securing a New Foundation for Digital Business”, published October 15, 2018.

Although the external monetization opportunities are limited due to strict regulations, a plethora of opportunities exist for TSPs to monetize data both internally (regulated but much less compared to external) and externally via a marketplace (highly regulated). If TSPs can shift their mindsets from selling data to focus on using data insights for monetization and improving CX, they can adopt a significant number of use cases to realize an immediate positive impact.

Tapping and utilizing insights around customer behavior acts like a Swiss Army Knife for businesses. You can use these insights to drive CX, hyper-personalization and localization, micro-segmentation, subscriber retention, loyalty and rewards programs, network planning and optimization, internal and external data monetization, and more. The following are some use cases that can be driven using CX and data monetization strategies:

  • Segmentation/micro-segmentation (cross-sell, up-sell, targeted advertising, enhanced market locator); for example:
    • Identify targets for consuming baby products or up-selling a kids-related TV channel
    • Identify females in the age range of 18-35 to target for high-end beauty products or apparels

You can build hundreds of such segments.

  • Personalized loyalty and reward programs (incentivize customers with what they like). For example, movie tickets or discounts for a movie lover, or food coupons and deals for a food lover.
  • CX-driven network optimization (allocate more resources to streaming hotspots with high-value customers).
  • Identifying potential partners for joint promotions. For example, bundling device offers with a music app subscription.
  • Hyper-personalization. For example, personalized recommendations for on-portal apps and websites.
  • Next best action and next best offer. For example, intelligent bundling and packaging of offerings.

Challenges with driving CX and data monetization

In this digital era, TSPs consider data analytics a strategic pillar in their quest to evolve into a true data-driven organization. Although many TSPs are harnessing the power of data to drive and improve CX, there are technological gaps and challenges to baseline and formulate internal and external data monetization strategies. Some of these challenges include:

  • Non-overlapping technology investments for CX and data monetization due to misaligned business and IT initiatives
  • Huge CAPEX requirements to process massive volumes of data
  • Inability to unearth hidden insights due to siloed data initiatives
  • Inability to marry various datasets together due to missing pieces around data standardization techniques
  • Lack of user-friendly tools and techniques to discover, ingest, process, correlate, analyze, and consume the data
  • Inability to experiment and innovate with agility and low cost

In this two-part series, I demonstrate a working solution with an AWS CloudFormation template for how a TSP can use existing data assets to generate new revenue streams and improve and personalize CX using AWS services. I also include key pieces of information around data standardization, baselining an analytics data model to marry different datasets in the data warehouse, self-service analytics, metadata search, and media dictionary framework.

In this post, you deploy the stack using a CloudFormation template and follow simple steps to transform, enrich, and bring multiple datasets together so that they can be correlated and queried.

In part 2, you learn how advanced business users can query enriched data and derive meaningful insights using Amazon Redshift and Amazon Redshift Spectrum or Amazon Athena, enable self-service analytics for business users, and publish ready-made dashboards via Amazon QuickSight.

Solution overview

The main ingredient of this solution is Packet Switch (PS) probe data embedded with a deep packet inspection (DPI) engine, which can reveal a lot of information about user interests and usage behavior. This data is transformed and enriched with DPI media and device dictionaries, along with other standard telco transformations to deduce insights, profile and micro-segment subscribers. Enriched data is made available along with other transformed dimensional attributes (CRM, subscriptions, media, carrier, device and network configuration management) for rich slicing and dicing.

For example, the following QuickSight visualizations depict a use case to identity music lovers ages 18-55 with Apple devices. You can also generate micro-segments by capturing the top X subscribers by consumption or adding KPIs like recency and frequency.

The following diagram illustrates the workflow of the solution.

For this post, AWS CloudFormation sets up the required folder structure in Amazon Simple Storage Service (Amazon S3) and provides sample data and dictionary file. Most of the data included as part of the CloudFormation template is dummy and is as follows:

  • CRM
  • Subscription and subscription mapping
  • Network 3G & 4G configuration management
  • Operator PLMN
  • DPI and device dictionary
  • PS probe data

Descriptions of all the input datasets and attributes are available with AWS Glue Data Catalog tables and as part of Amazon Redshift metadata for all tables in Amazon Redshift.

The workflow for this post includes the following steps:

  1. Catalog all the files in the AWS Glue Data Catalog using the following AWS Glue data crawlers:
    1. DPI data crawler (to crawl incoming PS probe DPI data)
    2. Dimension data crawler (to crawl all dimension data)
  2. Update attribute descriptions in the Data Catalog (this step is optional).
  3. Create Amazon Redshift schema, tables, procedures, and metadata using an AWS Lambda
  4. Process each data source file using separate AWS Glue Spark jobs. These jobs enrich, transform, and apply business filtering rules before ingesting data into an Amazon Redshift cluster.
  5. Trigger Amazon Redshift hourly and daily aggregation procedures using Lambda functions to aggregate data from the raw table into hourly and daily tables.

Part 2 includes the following steps:

  1. Catalog the processed raw, aggregate, and dimension data in the Data Catalog using the DPI processed data crawler.
  2. Interactively query data directly from Amazon S3 using Amazon Athena.
  3. Enable self-service analytics using QuickSight to prepare and publish insights based on data residing in the Amazon Redshift cluster.

The workflow can change depending on the complexity of the environment and your use case, but the fundamental idea remains the same. For example, your use case could be processing PS probe DPI data in real time rather than in batch mode, keeping hot data in Amazon Redshift, storing cold and historical data on Amazon S3, or archiving data in Amazon S3 Glacier for regulatory compliance. Amazon S3 offers several storage classes designed for different use cases. You can move the data among these different classes based on Amazon S3 lifecycle properties. For more information, see Amazon S3 Storage Classes.

Prerequisites

For this walkthrough, you should have the following prerequisites:

For more information about AWS Regions and where AWS services are available, see Region Table.

Creating your resources with AWS CloudFormation

To get started, create your resources with the following CloudFormation stack.

  1. Click the Launch Stack button below:
  2. Leave the parameters at their default, with the following exceptions:
    1. Enter RedshiftPassword and S3BucketNameParameter parameters, which aren’t populated by default.
    2. An Amazon S3 bucket name is globally unique, so enter a unique bucket name for S3BucketNameParameter.

The following screenshot shows the parameters for our use case.

  1. Choose Next.
  2. Select I acknowledge that AWS CloudFormation might create IAM resources with custom names.
  3. Choose Create stack.

It takes approximately 10 minutes to deploy the stack. For more information about the key resources deployed through the stack, see Data Monetization and Customer Experience(CX)Optimization using telco data assets: Amazon CloudFormation stack details. You can view all the resources on the AWS CloudFormation console. For instructions, see Viewing AWS CloudFormation stack data and resources on the AWS Management Console.

The CloudFormation stack we provide in this post serves as a baseline and is not a production-grade solution.

Building a Data Catalog using AWS Glue

You start by discovering sample data stored on Amazon S3 through an AWS Glue crawler. For more information, see Populating the AWS Glue Data Catalog. To catalog data, complete the following steps:

  1. On the AWS Glue console, in the navigation pane, choose Crawlers.
  2. Select DPIRawDataCrawler and choose Run crawler.
  3. Select DimensionDataCrawler and choose Run crawler.
  4. Wait for the crawlers to show the status Stopping.

The tables added against the DimensionDataCrawler and DPIRawDataCrawler crawlers should show 9 and 1, respectively.

  1. In the navigation pane, choose Tables.
  2. Verify the following 10 tables are created under the cemdm database:
    • d_crm_demographics
    • d_device
    • d_dpi_dictionary
    • d_network_cm_3g
    • d_network_cm_4g
    • d_operator_plmn
    • d_tac
    • d_tariff_plan
    • d_tariff_plan_desc
    • raw_dpi_incoming

Updating attribute descriptions in the Data Catalog

The AWS Glue Data Catalog has a comment field to store the metadata under each table in the AWS Glue database. Anybody who has access to this database can easily understand attributes coming from different data sources through metadata provided in the comment field. The CloudFormation stack includes a CSV file that contains a description of all the attributes from the source files. This file is used to update the comment field for all the Data Catalog tables this stack deployed. This step is not mandatory to proceed with the workflow. However, if you want to update the comment field against each table, complete the following steps:

  1. On the Lambda console, in the navigation pane, choose Functions.
  2. Choose the GlueCatalogUpdate
  3. Configure a test event by choosing Configure test events.
  4. For Event name, enter Test.
  5. Choose Create.
  6. Choose Test.

You should see a message that the test succeeded, which implies that the Data Catalog attribute description is complete.

Attributes of the table under the Data Catalog database should now have descriptions in the Comment column. For example, the following screenshot shows the d_operator_plmn table.

Creating Amazon Redshift schema, tables, procedures, and metadata

To create schema, tables, procedures, and metadata in Amazon Redshift, complete the following steps:

  1. On the Lambda console, in the navigation pane, choose Functions.
  2. Choose the RedshiftDDLCreation
  3. Choose Configure test events.
  4. For Event name, enter Test.
  5. Choose Create.
  6. Choose Test.

You should see a message that the test succeeded, which means that the schema, table, procedures, and metadata generation is complete.

Running AWS Glue ETL jobs

AWS Glue provides the serverless, scalable, and distributed processing capability to transform and enrich your datasets. To run AWS Glue extract, transform, and load (ETL) jobs, complete the following steps:

  1. On the AWS Glue console, in the navigation pane, choose Jobs.
  2. Select the following jobs (one at a time) and choose Run job from Action
    • d_customer_demographics
    • d_device
    • d_dpi_dictionary
    • d_location
    • d_operator_plmn
    • d_tac
    • d_tariff_plan
    • d_tariff_plan_desc
    • f_dpi_enrichment

You can run all these jobs in parallel.

All dimension data jobs should finish successfully within 3 minutes, and the fact data enrichment job should finish within 5 minutes.

  1. Verify the jobs are complete by selecting each job and checking Run status on the History tab.

Aggregating hourly and daily DPI data in Amazon Redshift

To aggregate hourly and daily sample data in Amazon Redshift using Lambda functions, complete the following steps:

  1. On the Lambda console, in the navigation pane, choose Functions.
  2. Choose the RedshiftDPIHourlyAgg function.
  3. Choose Configure test events.
  4. For Event name, enter Test.
  5. Choose Create.
  6. Choose Test.

You should see a message that the test succeeded, which means that hourly aggregation is complete.

  1. In the navigation pane, choose Functions.
  2. Choose the RedshiftDPIDailyAgg function.
  3. Choose Configure test events.
  4. For Event name, enter Test.
  5. Choose Create.
  6. Choose Test.

You should see a message that the test succeeded, which means that daily aggregation is complete.

Both hourly and daily Lambda functions are hardcoded with the date and hour to aggregate the sample data. To make them generic, there are a few commented lines of code that need to be uncommented and a few lines to be commented. Both functions are also equipped with offset parameters to decide how far back in time you want to do the aggregations. However, this isn’t required for this walkthrough.

You can schedule these functions with CloudWatch. However, this is not required for this walkthrough.

So far, we have completed the following:

  1. Deployed the CloudFormation stack.
  2. Cataloged sample raw data by running DimensionDataCrawler and DPIRawDataCrawler AWS Glue crawlers.
  3. Updated attribute descriptions in the AWS Glue Data Catalog by running the GlueCatalogUpdate Lambda function.
  4. Created Amazon Redshift schema, tables, stored procedures, and metadata through the RedshiftDDLCreation Lambda function.
  5. Ran all AWS Glue ETL jobs to transform raw data and load it into their respective Amazon Redshift tables.
  6. Aggregated hourly and daily data from enriched raw data into hourly and daily Amazon Redshift tables by running the RedshiftDPIHourlyAgg and RedshiftDPIDailyAgg Lambda functions.

Cleaning up

If you don’t plan to proceed to the part 2 of this series, and want to avoid incurring future charges, delete the resources you created by deleting the CloudFormation stack.

Conclusion

In this post, I demonstrated how you can easily transform, enrich, and bring multiple telco datasets together in an Amazon Redshift data warehouse cluster. You can correlate these datasets to produce multi-dimensional insights from several angles, like subscriber, network, device, subscription, roaming, and more.

In part 2 of this series, I demonstrate how you can enable data analysts, scientists, and advanced business users to query data from Amazon Redshift or Amazon S3 directly.

As always, AWS welcomes feedback. This is a wide space to explore, so reach out to us if you want a deep dive into building this solution and more on AWS. Please submit comments or questions in the comments section.


About the Author

Vikas Omer is an analytics specialist solutions architect at Amazon Web Services. Vikas has a strong background in analytics, customer experience management (CEM), and data monetization, with over 11 years of experience in the telecommunications industry globally. With six AWS Certifications, including Analytics Specialty, he is a trusted analytics advocate to AWS customers and partners. He loves traveling, meeting customers, and helping them become successful in what they do.

Security updates for Tuesday

Post Syndicated from original https://lwn.net/Articles/844054/rss

Security updates have been issued by CentOS (dnsmasq, net-snmp, and xstream), Debian (mutt), Gentoo (cfitsio, f2fs-tools, freeradius, libvirt, mutt, ncurses, openjpeg, PEAR-Archive_Tar, and qtwebengine), openSUSE (chromium, mutt, stunnel, and virtualbox), Red Hat (cryptsetup, gnome-settings-daemon, and net-snmp), Scientific Linux (xstream), SUSE (postgresql, postgresql12, postgresql13 and rubygem-nokogiri), and Ubuntu (mutt).

Backblaze Hard Drive Stats for 2020

Post Syndicated from original https://www.backblaze.com/blog/backblaze-hard-drive-stats-for-2020/

In 2020, Backblaze added 39,792 hard drives and as of December 31, 2020 we had 165,530 drives under management. Of that number, there were 3,000 boot drives and 162,530 data drives. We will discuss the boot drives later in this report, but first we’ll focus on the hard drive failure rates for the data drive models in operation in our data centers as of the end of December. In addition, we’ll welcome back Western Digital to the farm and get a look at our nascent 16TB and 18TB drives. Along the way, we’ll share observations and insights on the data presented and as always, we look forward to you doing the same in the comments.

2020 Hard Drive Failure Rates

At the end of 2020, Backblaze was monitoring 162,530 hard drives used to store data. For our evaluation, we remove from consideration 231 drives which were used for testing purposes and those drive models for which we did not have at least 60 drives. This leaves us with 162,299 hard drives in 2020, as listed below.

Observations

The 231 drives not included in the list above were either used for testing or did not have at least 60 drives of the same model at any time during the year. The data for all drives, data drives, boot drives, etc., is available for download on the Hard Drive Test Data webpage.

For drives which have less than 250,000 drive days, any conclusions about drive failure rates are not justified. There is not enough data over the year-long period to reach any conclusions. We present the models with less than 250,000 drive days for completeness only.

For drive models with over 250,000 drive days over the course of 2020, the Seagate 6TB drive (model: ST6000DX000) leads the way with a 0.23% annualized failure rate (AFR). This model was also the oldest, in average age, of all the drives listed. The 6TB Seagate model was followed closely by the perennial contenders from HGST: the 4TB drive (model: HMS5C4040ALE640) at 0.27%, the 4TB drive (model: HMS5C4040BLE640), at 0.27%, the 8TB drive (model: HUH728080ALE600) at 0.29%, and the 12TB drive (model: HUH721212ALE600) at 0.31%.

The AFR for 2020 for all drive models was 0.93%, which was less than half the AFR for 2019. We’ll discuss that later in this report.

What’s New for 2020

We had a goal at the beginning of 2020 to diversify the number of drive models we qualified for use in our data centers. To that end, we qualified nine new drives models during the year, as shown below.

Actually, there were two additional hard drive models which were new to our farm in 2020: the 16TB Seagate drive (model: ST16000NM005G) with 26 drives, and the 16TB Toshiba drive (model: MG08ACA16TA) with 40 drives. Each fell below our 60-drive threshold and were not listed.

Drive Diversity

The goal of qualifying additional drive models proved to be prophetic in 2020, as the effects of Covid-19 began to creep into the world economy in March 2020. By that time we were well on our way towards our goal and while being less of a creative solution than drive farming, drive model diversification was one of the tactics we used to manage our supply chain through the manufacturing and shipping delays prevalent in the first several months of the pandemic.

Western Digital Returns

The last time a Western Digital (WDC) drive model was listed in our report was Q2 2019. There are still three 6TB WDC drives in service and 261 WDC boot drives, but neither are listed in our reports, so no WDC drives—until now. In Q4 a total of 6,002 of these 14TB drives (model: WUH721414ALE6L4) were installed and were operational as of December 31st.

These drives obviously share their lineage with the HGST drives, but they report their manufacturer as WDC versus HGST. The model numbers are similar with the first three characters changing from HUH to WUH and the last three characters changing from 604, for example, to 6L4. We don’t know the significance of that change, perhaps it is the factory location, a firmware version, or some other designation. If you know, let everyone know in the comments. As with all of the major drive manufacturers, the model number carries patterned information relating to each drive model and is not randomly generated, so the 6L4 string would appear to mean something useful.

WDC is back with a splash, as the AFR for this drive model is just 0.16%—that’s with 6,002 drives installed, but only for 1.7 months on average. Still, with only one failure during that time, they are off to a great start. We are looking forward to seeing how they perform over the coming months.

New Models From Seagate

There are six Seagate drive models that were new to our farm in 2020. Five of these models are listed in the table above and one model had only 26 drives, so it was not listed. These drives ranged in size from 12TB to 18TB and were used for both migration replacements as well as new storage. As a group, they totaled 13,596 drives and amassed 1,783,166 drive days with just 46 failures for an AFR of 0.94%.

Toshiba Delivers More Zeros

The new Toshiba 14TB drive (model: MG07ACA14TA) and the new Toshiba 16TB (model: MG08ACA16TEY) were introduced to our data centers in 2020 and they are putting up zeros, as in zero failures. While each drive model has only been installed for about two months, they are off to a great start.

Comparing Hard Drive Stats for 2018, 2019, and 2020

The chart below compares the AFR for each of the last three years. The data for each year is inclusive of that year only and for the drive models present at the end of each year.

The Annualized Failure Rate for 2020 Is Way Down

The AFR for 2020 dropped below 1% down to 0.93%. In 2019, it stood at 1.89%. That’s over a 50% drop year over year. So why was the 2020 AFR so low? The answer: It was a group effort. To start, the older drives: 4TB, 6TB, 8TB, and 10TB drives as a group were significantly better in 2020, decreasing from a 1.35% AFR in 2019 to a 0.96% AFR in 2020. At the other end of the size spectrum, we added over 30,000 larger drives: 14TB, 16TB, and 18TB, which as a group recorded an AFR of 0.89% for 2020. Finally, the 12TB drives as a group had a 2020 AFR of 0.98%. In other words, whether a drive was old or new, or big or small, they performed well in our environment in 2020.

Lifetime Hard Drive Stats

The chart below shows the lifetime annualized failure rates of all of the drives models in production as of December 31, 2020.

AFR and Confidence Intervals

Confidence intervals give you a sense of the usefulness of the corresponding AFR value. A narrow confidence interval range is better than a wider range, with a very wide range meaning the corresponding AFR value is not statistically useful. For example, the confidence interval for the 18TB Seagate drives (model: ST18000NM000J) ranges from 1.5% to 45.8%. This is very wide and one should conclude that the corresponding 12.54% AFR is not a true measure of the failure rate of this drive model. More data is needed. On the other hand, when we look at the 14TB Toshiba drive (model: MG07ACA14TA), the range is from 0.7% to 1.1% which is fairly narrow, and our confidence in the 0.9% AFR is much more reasonable.

3,000 Boot Drives

We always exclude boot drives from our reports as their function is very different from a data drive. While it may not seem obvious, having 3,000 boot drives is a bit of a milestone. It means we have 3,000 Backblaze Storage Pods in operation as of December 31st. All of these Storage Pods are organized into Backblaze Vaults of 20 Storage Pods each or 150 Backblaze Vaults.

Over the last year or so, we moved from using hard drives to SSDs as boot drives. We have a little over 1,200 SSDs acting as boot drives today. We are validating the SMART and failure data we are collecting on these SSD boot drives. We’ll keep you posted if we have anything worth publishing.

Are you interested in learning more about the trends in the 2020 drive stats? Join our upcoming webinar: “Backblaze Hard Drive Report: 2020 Year in Review Q&A” with drive stats author, Andy Klein, on February 3.

The Hard Drive Stats Data

The complete data set used to create the information used in this review is available on our Hard Drive Test Data page. You can download and use this data for free for your own purpose. All we ask are three things: 1) you cite Backblaze as the source if you use the data, 2) you accept that you are solely responsible for how you use the data, and 3) you do not sell this data to anyone; it is free.

If you just want the summarized data used to create the tables and charts in this blog post you can download the ZIP file containing the CSV files for each chart.

Good luck and let us know if you find anything interesting.

The post Backblaze Hard Drive Stats for 2020 appeared first on Backblaze Blog | Cloud Storage & Cloud Backup.

2021-01-26 работна среда

Post Syndicated from original https://vasil.ludost.net/blog/?p=3446

Писал съм преди за работната си среда, и понеже при нас във фирмата аз onboard-вам новите хора за оперативния екип, ми се налага да обяснявам как може да се работи по-ефективно (даже си направихме един вътрешен training, в който показвахме кой как работи). Понеже така и не записах моите обяснения, ще взема да ги напиша един път, с колкото мога подробности, може и да е полезно за някой.

В общи линии при работа вкъщи или в офиса нещата са ми по същия начин, въпреки че е малко криво логистично (и може да ми се наложи да си купя още един квадратен монитор в някакъв момент).

Хардуер

Ще започна с по-хардуерната част.

За стол ползвам от 10тина години един Markus от IKEA, и в последните две фирми убедих хората, че това е удобния стол. Издържа много, удобен е, и не е безумно скъп (не е и евтин, около 300 лв беше последно). Аз го ползвам без подлакътници (понеже така може да се пъхне повече под бюрото, като реша).

Гледам да имам достатъчно широко бюро, поне метър и половина. С двата монитора и двата лаптопа (за тях по-долу) си е нужно място, а и аз имам навика да трупам на него, и са известни случаи, в които съм завземал съседни бюра за някаква част от техниката, с която работя. По принцип е хубаво бюрото да е с дебел, здрав плот, та да не клати много (което се случва понякога при моето писане), но по-здравото ми бюро го ползва в момента Елена (на него пък не можех да си сложа стойките), и за това съм измислил допълнително укрепване на монитора.

От StorPool открих идеята със стойките за монитори – ползвам две Arctic Z1 Basic. Дават ми възможност да си подредя мониторите точно както искам, и освобождават място на бюрото, в което може да се озоват други неща (напр. разклонители).

Сравнително скорошна снимка (с грешната клавиатура)

Много години си имах нормален настолен компютър, после минах на лаптоп, и в момента съм на “компромиса” да съм с лаптоп, който да ползвам на практика като работна станция (не съм го отварял от 3 месеца, само си стои и скърца). В момента е T70 дъно (което правят едни китайски ентусиасти от 51nb.com) в T60p лаптопска кутия, която отне един ден флексотерапия за монтажа. Взех си го заради 4:3 монитора и възможността да имам прилично бърз лаптоп (вече съм му сложил 32GB памет, щото 16 не стигаха), с поносима клавиатура – след Tx20 серията lenovo почнаха да слагат още по-гадни клавиатури, а да стоя на T420 почваше да става бавно и мъчително (и да свършват лаптопите).
Ако тръгна да си вземам нов, вероятно няма да обърна внимание на тия подробности и ще гледам основно да му държи батерията и да не твърде малък екрана (и да не се чупи лесно, все пак се мотаят деца наоколо).
Около това минах и основно на жична мрежа, понеже няма смисъл да си тормозя wifi-то за нещо, което не разнасям.

Около пандемията и работата от вкъщи взех стария лаптоп на Елена и го ползвам за всичките видео/аудио конференции. Така мога да правя нещо и да мога да виждам с кого си говоря, и да не си товаря работния лаптоп с глупости.

За самите разговори имам два audio device-а – една Jabra speak за през повечето време и една jabra wireless слушалка за като трябва да съм по-тих (или да съм в съседната стая, докато върви разговора 🙂 ). Специално Jabra speak-а е наистина страхотно устройство, с много добро качество на звука и микрофона и почти невероятен echo cancellation, аз в крайна сметка съм купил поне 5-6 такива и съм давал на разни хора, на които им трябва. Силно препоръчвам – удобно е, чува се добре и не тормози ушите.

Мониторите са много интересна тема. Както казах по-горе, избирах си лаптоп специално с 4:3 екран, понеже 16:9 и 16:10 не ми стигат като вертикала, и най-накрая няколко приятели ми се ядосаха и ми купиха квадратен монитор (Eizo EV2730Q). Отне ми няколко дни да свикна с идеята, и от тогава насам не мога да си представя как живял без него – събира ми се всичко, имам място да си наслагам много прозорци, и като реша да пиша текст като този, мога да го отворя на цял екран на увеличен шрифт и да си пиша на воля. Като гледам, по-голям монитор от този вече ще ми е трудно да ползвам, но този мога да обхвана с поглед и половина и спокойно да гледам 2-3 лога как мърдат, докато работя по някаква система.
За лаптопа, който ползвам за видео конференции ползвам един 24″ монитор, който не е нещо особено (взех го евтино от ebay), целта му е да мога да виждам с кого говоря.

Клавиатурата е почти религиозна тема. След всякакви клавиатури, известно време ползване на IBM Model M и после дълги години лаптопски на Lenovo T серия, като минах на квадратния монитор си взех пак Model M-то. В офиса обаче вдигнаха бунт срещу мен (“не можем да си чуваме мислите, докато пишеш”) и като компромис си взех една Das Keyboard с кафеви switch-ове, която не е лоша, но … не е същото нещо. С пандемията можах пак да се върна на Model M-то (щото не се чува през стената да буди децата) и даже за нова година си взех една от новото производство на Unicomp (pckeyboard.com им е сайта). На цена, с доставката от САЩ ми се върза на същите пари като DasKeyboard-а.
Като усещане, за мен няма по-добра клавиатура, и дори се усещам как copy-paste-вам по-малко, понеже правя по-малко грешки, докато пиша. Единствения проблем на клавиатурата е шумът, но новата версия е малко по-басова и по-поносима за околните като звук (но ако се върнем в офис някой ден и трябва да съм с някой друг в стая, трябва или да е глух, или пак да сменя клавиатурата).

Снимка на старата и новата клавиатури

Като pointing device ползвам един безжичен logitech trackball – не ми се налага да си местя много ръката по бюрото, и поне на мен ми е удобно да си го ползвам с палеца. Налага се да го чистя от време на време (което не се случва особено много с модерните оптични мишки), но аз съм свикнал от едно време, и децата много се радват на голямото топче…

Софтуер

От около 22-23 години съм на Debian, и не съм намерил причина да го сменям. Имал съм периоди на ползване на пакети от Ubuntu, но като цяло си ги харесвам като дистрибуция – ползвам го за работни станции, сървъри и каквото ми се наложи и все още много му се радвам.

Отгоре има стандартен X, и в/у него XFCE с Compiz за window manager. Причината да ползвам Compiz е, че е най-бързо работещия window manager, който съм виждал, най-вече за нещата, които аз правя (например switch на workspace не примигва по никакъв начин). Дава много начини за настройка (което аз обичам, понеже мога да си го напасна до моите нужди).
Основните неща, които настройвам са:
– focus follows mouse, т.е. не трябва да click-на на прозорец, че да ми е на фокус и да работя в него – това много забързва нещата, и дава възможност да се пише в прозорец, който не е най-отгоре;
– 11 workspace-а (достъпни с ctrl-alt-1..-). Първо бяха 10, но не стигаха (те и сега не стигат, но ctrl-alt-= май е заето и е много близо до ctrl-alt-backspace:) );
– ползвам Desktop cube plugin-а за много workspace-и, и съм сложил скоростите на максимум, та на практика превключвам мигновено м/у два workspace-а.

За повечето си комуникация ползвам pidgin с кофа plugin-и за различните messenger-и. Имам едно-две irc-та, два jabber-а, два slack-а, skype, telegram, и вероятно още някакви неща. Pidgin-а е от софтуерите, за които си компилирам и дебъгвам някакви неща (ползвам го от пакет, едно време и си го компилирах сам). Повечето комуникация да ми е на едно място е доста удобно, щото не трябва да превключвам м/у 10 неща.
Понеже за някои неща поддръжката му не е много добра, ползвам и web версиите им – например фирмения Slack, понеже така ми излизат както трябва notification-ите, и от време на време този на Telegram (например като ми пратят контакт).

Ползвам и claws-mail за mail client, понеже имам много поща и не мога да понасям web-интерфейсите за поща. Ползвал съм преди evolution и малко съм пробвал thunderbird, но са ми бавни и неудобни, а claws-а се оправя много добре с десетки/стотици хиляди писма, pgp и т.н..

Ползвам и два browser-а – един chrome и един firefox – google docs и подобни неща работят много по-добре в chrome, а електрическия ми подпис само във firefox, и съм разделил някак кое къде ползвам. Двата browser-а са причината за 32-та GB памет, ядат много (и често развъртат вентилаторите).

Да следя какво правя ползвам gtimelog – много удобно приложение бързо да си отбелязваш какво си правил, и така следя горе-долу какво съм правил през деня, да имам някаква идея кога трябва да си стана от бюрото и да мога да кажа какво съм правил, ако някой ме пита (или аз се чудя).

И приложението, в което на практика живея – xfce4-terminal 🙂 Пробвал съм различни, но този е достатъчно бърз, и поддържа правилния вид прозрачност. Пуснал съм му голям scrollback, и съм сложил Go Mono шрифт (който е серифен, но по някаква причина ми е по-удобен от останалите). Нещо, което много ползвам и не препоръчвам на никой е прозрачността да гледам терминала под него, основно дали нещо мръдва (и което подлудява повечето хора, които ми гледат екрана).

За текстов редактор ползвам vim, на който не съм правил нищо специално, само за някои директории имам да прави git commit всеки път, като запише файл, много е удобно за файлове с бележки и подобни.

Процес на работа

Всичко това е навързано в моя леко странен начин на работа…

Използвам workspace-ите, за да си организирам различните задачи:
– Първия е за messenger, gtimelog и един vim с бележки;
– Втория е за пощата – claws-mail и някакво количество отворени mail-ове, на които трябва да отговоря (ако не ги отворя, няма начин да ги запомня);
– трети до осми са за терминали. Горе-долу workspace за задача, понякога два;
– девети за chrome, десети за firefox;
– последния за неща, дето почти не гледам, например vpn-а или контрола на музиката вкъщи.

Основният принцип, който гледам да следвам е всяко нещо, което ми трябва в дадена задача да е на един клавиш/комбинация разстояние. Например два терминала на един workspace, и като трябва, превключвам м/у тях. Ако ми трябват още, слагам ги на съседния workspace. Ако са някакви по-странни неща, слагам tmux и го деля на две (при моя монитор и хоризонталното, и вертикалното разделяне работят много добре).

Да си погледна messenger-а, ако има нещо светнало, ми е лесна комбинация, да си видя пощата пак, да отида до по-важните задачи (които гледам да държа на началните workspace-и) – също, директно с лява ръка. Ако трябва да ходя до browser-ите ми е малко по-бавно, но за там най-често така и така си трябва context switch в главата.

Нещо, което също много ми помага, понеже през голяма част от времето си пиша с хора е, че сменям кирилица/латиница с caps lock – един клавиш, на лесно място. Идеята с shift+alt винаги ми се е виждала ужасна, и много често задейства контекстните менюта някъде. Като добавка, ползвам kbdd за да помни в кой прозорец на какъв език съм бил, та да не се налага да превключвам постоянно, като се местя м/у комуникация и терминалите.

Разни

Има разни неща, които съм спрял да ползвам, щото са ми пролазили по нервите. Може би основното такова нещо е Gnome и повечето му неща – терминалът е бавен, windowmanager-а също, и основната работа на разработчиците му беше да премахват настройките, които ползвах. След като един път ми отне половин ден да убедя курсора ми да не мига (или някаква подобна глупост), минах на Xfce.

По същия начин зарязах evolution, понеже ядеше гигантски количества памет и ставаше все по-бавно и по-бавно. Claws-mail ми върши същата (и по-добра работа) с няколко пъти по-малко ресурси.

Firefox 85 released

Post Syndicated from original https://lwn.net/Articles/844015/rss

Version 85 of
the Firefox browser
has been released. The headline change appears to
be the isolation of internal caches to defeat the use of “supercookies” to
track users; see this
blog entry
for details. “In fact, there are many different
caches trackers can abuse to build supercookies. Firefox 85 partitions all
of the following caches by the top-level site being visited: HTTP cache,
image cache, favicon cache, HSTS cache, OCSP cache, style sheet cache, font
cache, DNS cache, HTTP Authentication cache, Alt-Svc cache, and TLS
certificate cache.

New book: Get Started with MicroPython on Raspberry Pi Pico

Post Syndicated from Phil King original https://www.raspberrypi.org/blog/new-book-get-started-with-micropython-on-raspberry-pi-pico/

So, you’ve got a brand new Raspberry Pi Pico and want to know how to get started with this tiny but powerful microcontroller? We’ve got just the book for you.

Get Started with Raspberry Pi Pico book

Beginner-friendly

In Get Started with MicroPython on Raspberry Pi Pico, you’ll learn how to use the beginner-friendly language MicroPython to write programs and connect hardware to make your Raspberry Pi Pico interact with the world around it. Using these skills, you can create your own electro-mechanical projects, whether for fun or to make your life easier.

Inside the pages of the Raspberry Pi Pico book

After taking you on a guided tour of Pico, the books shows you how to get it up and running with a step-by-step illustrated guide to soldering pin headers to the board and installing the MicroPython firmware via a computer.

Programming basics

Inside the pages of the Raspberry Pi Pico book 02

Next, we take you through the basics of programming in MicroPython, a Python-based programming language developed specifically for microcontrollers such as Pico. From there, we explore the wonderful world of physical computing and connect a variety of electronic components to Pico using a breadboard. Controlling LEDs and reading input from push buttons, you’ll start by creating a pedestrian crossing simulation, before moving on to projects such as a reaction game, burglar alarm, temperature gauge, and data logger.

Inside the pages of the Raspberry Pi Pico book

Raspberry Pi Pico also supports the I2C and SPI protocols for communicating with devices, which we explore by connecting it up to an LCD display. You can even use MicroPython to take advantage of one of Pico’s most powerful features, Programmable I/O (PIO), which we explore by controlling NeoPixel LED strips.

Get your copy today!

You can buy Get Started with MicroPython on Raspberry Pi Pico now from the Raspberry Pi Press online store. If you don’t need the lovely new book, with its new-book smell, in your hands in real life, you can download a PDF version for free (or a small voluntary contribution).

STOP PRESS: we’ve spotted an error in the first print run of the book, affecting the code examples in Chapters 4 to 7. We’re sorry! Fortunately it’s easy for readers to correct in their own code; see here for everything you need to know. We’ve already corrected this in the PDF version.

The post New book: Get Started with MicroPython on Raspberry Pi Pico appeared first on Raspberry Pi.

State-Sponsored Threat Actors Target Security Researchers

Post Syndicated from boB Rudis original https://blog.rapid7.com/2021/01/26/state-sponsored-threat-actors-target-security-researchers/

State-Sponsored Threat Actors Target Security Researchers

This blog was co-authored by Caitlin Condon, VRM Security Research Manager, and Bob Rudis, Senior Director and Chief Security Data Scientist.

On Monday, Jan. 25, 2021, Google’s Threat Analysis Group (TAG) published a blog on a widespread social engineering campaign that targeted security researchers working on vulnerability research and development. The campaign, which Google attributed to North Korean (DPRK) state-sponsored actors, has been active for several months and sought to compromise researchers using several methods.

Rapid7 is aware that many security researchers were targeted in this campaign, and information is still developing. While we currently have no evidence that we were compromised, we are continuing to investigate logs and examine our systems for any of the IOCs listed in Google’s analysis. We will update this post with further information as it becomes available.

Organizations should take note that this was a highly sophisticated attack that was important enough to those who orchestrated it for them to burn an as-yet unknown exploit path on. This event is the latest in a chain of attacks—e.g., those targeting SonicWall, VMware, Mimecast, Malwarebytes, Microsoft, Crowdstrike, and SolarWinds—that demonstrates a significant increase in threat activity targeting cybersecurity firms with legitimately sophisticated campaigns. Scenarios like these should become standard components of tabletop exercises and active defense plans.

North Korean-attributed social engineering campaign

Google discovered that the DPRK threat actors had built credibility by establishing a vulnerability research blog and several Twitter profiles to interact with potential targets. They published videos of their alleged exploits, including a YouTube video of a fake proof-of-concept (PoC) exploit for CVE-2021-1647—a high-profile Windows Defender zero-day vulnerability that garnered attention from both security researchers and the media. The DPRK actors also published “guest” research (likely plagiarized from other researchers) on their blog to further build their reputation.

The malicious actors then used two methods to social engineer targets into accepting malware or visiting a malicious website. According to Google:

  • After establishing initial communications, the actors would ask the targeted researcher if they wanted to collaborate on vulnerability research together, and then provide the researcher with a Visual Studio Project. Within the Visual Studio Project would be source code for exploiting the vulnerability, as well as an additional pre-compiled library (DLL) that would be executed through Visual Studio Build Events. The DLL is custom malware that would immediately begin communicating with actor-controlled command and control (C2) domains.

State-Sponsored Threat Actors Target Security Researchers
Visual Studio Build Events command executed when building the provided VS Project files. Image provided by Google.

  • In addition to targeting users via social engineering, Google also observed several cases where researchers have been compromised after visiting the actors’ blog. In each of these cases, the researchers followed a link on Twitter to a write-up hosted on blog[.]br0vvnn[.]io, and shortly thereafter, a malicious service was installed on the researcher’s system and an in-memory backdoor would begin beaconing to an actor-owned command and control server. At the time of these visits, the victim systems were running fully patched and up-to-date Windows 10 and Chrome browser versions. As of Jan. 26, 2021, Google was unable to confirm the mechanism of compromise.

The blog the DPRK threat actors used to execute this zero-day drive-by attack was posted on Reddit as long as three months ago. The actors also used a range of social media and communications platforms to interact with targets—including Telegram, Keybase, Twitter, LinkedIn, and Discord. As of Jan. 26, 2021, many of these profiles have been suspended or deactivated.

Rapid7 customers

Google’s threat intelligence includes information on IOCs, command-and-control domains, actor-controlled social media accounts, and compromised domains used as part of the campaign. Rapid7’s MDR team is deploying IOCs and behavior-based detections. These detections will also be available to InsightIDR customers later today. We will update this blog post with further information as it becomes available.

Defender guidance

TAG noted in their blog post that they have so far only seen actors targeting Windows systems. As of the evening of Jan. 25, 2021, researchers across many companies confirmed on Twitter that they had interacted with the DPRK actors and/or visited the malicious blog. Organizations that believe their researchers or other employees may have been targeted should conduct internal investigations to determine whether indicators of compromise are present on their networks.

At a minimum, responders should:

  • Ensure members of all security teams are aware of this campaign and encourage individuals to report if they believe they were targeted by these actors.
  • Search web traffic, firewall, and DNS logs for evidence of contacts to the domains and URLs provided by Google in their post.
  • According to Rapid7 Labs’ forward DNS archive, the br0vvnn[.]io apex domain has had two discovered fully qualified domain names (FQDNs)—api[.]br0vvnn[.]io and blog[.]br0vvnn[.]io—over the past four months with IP addresses 192[.]169[.]6[.]31 and 192[.]52[.]167[.]169, respectively. Contacts to those IPs should also be investigated in historical access records.
  • Check for evidence of the provided hashes on all systems, starting with those operated and accessed by members of security teams.

Moving forward, organizations and individuals should heed Google’s advice that “if you are concerned that you are being targeted, we recommend that you compartmentalize your research activities using separate physical or virtual machines for general web browsing, interacting with others in the research community, accepting files from third parties and your own security research.”

NEVER MISS A BLOG

Get the latest stories, expertise, and news about security today.

Getting your notifications via Signal

Post Syndicated from Brian van Baekel original https://blog.zabbix.com/getting-your-notifications-via-signal/13286/

Recently, Whatsapp pushed their new privacy policy where they announced to share more data with Facebook, causing an exodus to other platforms, where Signal is one of the more popular ones, among Telegram. Both are great alternatives, but I prefer Signal due to the open-source part, end to end encryption, and last but not least: their business model (living on donations instead of selling your data).

Typically, Zabbix is sending notifications to whatever medium you’ve chosen if a problem is detected. We all know the Email messages, the various webhook integrations with Slack/MS Teams/ Jira, etc, perhaps even some text message integrations and such. Now, if we’re migrating to Signal, we suddenly have access to the Signal API and can utilize it to receive Zabbix notifications. Nice!

There is only one drawback. You need a separate phone number to register against Signal. Don’t use your own phone number – unless you want to lose the ability to use Signal ;(

There are various ways to get a phone number for this purpose:

  • Use the phone number of your current SMS gateway
  • Use the company phone number (a lot of cloud PBX are providing the option to receive the verification email)
  • Purchase a prepaid phone number.
  • Use a service like Twilio

You just need to receive one text message, the rest of the communications will go via the internet

Time to get rid of Whatsapp and move to Signal! But… How to use Signal to get your notifications?

Signal-cli

Although we could built everything from scratch, talking to the API of Signal, there is a nice implementation available in order to talk to Signal within a few minutes: Signal-cli

Although this github page is very comprehensive in order to get Signal-cli installed, but of course it is not doing anything with Zabbix.

Configuration tasks

For this guide, we’re using:

  • Centos 8
  • Zabbix 5.2

signal-cli installation

First, lets install the Signal-cli utility, and in order to do so we need to resolve the dependency of Java by installing the openjdk application:

dnf -y install java-11-openjdk-devel.x86_64

After this installation, we should be good to continue with the installation of signal-cli. According to their installation guide, this should be sufficient:

export VERSION="0.7.3"
wget https://github.com/AsamK/signal-cli/releases/download/v"${VERSION}"/signal-cli-"${VERSION}".tar.gz
sudo tar xf signal-cli-"${VERSION}".tar.gz -C /opt
sudo ln -sf /opt/signal-cli-"${VERSION}"/bin/signal-cli /usr/local/bin/

At the time of writing, the most recent version is 0.7.3, and that’s what we’re installing here. If in the future a new version is released, of course you should install that!

If everything went as expected, we should be able to register ourself to Signal.

signal-cli registration

Since we want to execute these commands by Zabbix, we must make sure the registration is done with the correct user on the Zabbix server, otherwise you will get the following error message:

Unregistered user error

(ERROR App – User +19293771253 is not registered.)

In order to prevent this error, lets do the authentication against Signal as Zabbix user:

Important: The USERNAME (your phone number) must include the country calling code, i.e. the number must start with a “+” sign and you must replace everything between the  < > in the following examples with your own values

runuser -l zabbix -c 'signal-cli -u <NUMBER> register'

Now, check for incoming test messages on this phone number. Within seconds you should receive a 6 digit code in the following format: xxx-xxx

Once you’ve received the text, it’s time to complete the registration:

runuser -l zabbix -c 'signal-cli -u <NUMBER> verify <CODE>'

Since we’re running these commands as a different user, we won’t see the output of them. Let’s just test!

Sending messages from the command line is straight forward:

runuser -l zabbix -c 'signal-cli -u <NUMBER> send -m <MESSAGE> <RECEIVER NUMBER>'

You will see the message id as output. Simply ignore it, since it’s not relevant at this point.

Within seconds:

It works! Great.

So now we’ve got this part covered, time to get the AlertScript set up, before heading to the frontend.

Zabbix AlertScript setup

Ok, so now we’ve got the registration done, we need to make sure Zabbix can utilise it. In order to do so, we use a very old method. Although it would’ve made more sense to use the webhook option, that means I had to built the communication with Signal from scratch.

So AlertScripts it is. In your terminal/SSH session with the Zabbix server open a new file with this command: vi /usr/lib/zabbix/alertscripts/signal.sh and insert the following contents:

#!/bin/bash
signal-cli -u '+19293771253' send -m "$1" $2

 That’s right. just 2 lines. After saving the file, change the owner and set the permissions:

chown zabbix:zabbix /usr/lib/zabbix/alertscripts/signal.sh
chmod 7000 /usr/lib/zabbix/alertscripts/signal.sh

and it’s time to move to our frontend.

Zabbix mediatype configuration

In the frontend, go to Administration -> Mediatypes and create a new mediatype:

Signal Mediatype

Name: Signal
Type: Script
Script name: signal.sh
Script parameters:
    {ALERT.MESSAGE}
    {ALERT.SENDTO}

don’t forget to configure some Message templates as well (second tab in the Mediatype configuration). You can just use the defaults if you click on ‘add’

Zabbix media configuration

Next step. Navigate to Administration -> Users (or just open your own user profile) and create a new media:

new-media

Type: Signal
Sendto: <your number>
When active / severity as per needs

Important: The USERNAME (your phone number) must include the country calling code, i.e. the number must start with a “+” sign

We’re almost there, just some configuration on the actions

Zabbix action configuration

This step is only needed if you are sending notifications right now via a specific mediatype. If you configured the ‘send only to’ option to ‘- All -‘ there is nothing to change, and it will work straight away!

Otherwise, navigate to Configuration -> Actions and find the action you want to change, and in the Operations, Recovery operations and Update operations change the ‘send only to’ option to ‘Signal’

Save your action and it’s time to test – Generate some problem to confirm the implementation actually works.

Wrap up

That’s it. By now you should have a working implementation where Zabbix is sending notifications to Signal. The setup was extremely straight forward and easy to configure. Nevertheless, if you need help getting this going, we (Opensource ICT Solutions) offer consultancy services as well, and are more than happy to help you out!

 

Elasticsearch – Scalability and Multitenancy [slides]

Post Syndicated from Bozho original https://techblog.bozho.net/elasticsearch-scalability-and-multitenancy-slides/

Last week I gave a talk in a local tech group about my experience with Elasticsearch at LogSentinel, and how we achieve multitenancy and scalability.

Obviously, the topic of scalability is huge and it can’t be fully covered in 45 minutes, but I tried presenting the main aspects from the application perspective (I entirely skipped the Ops perspective, as it was a developer audience). The list of resources at the end of the slides show some of the sources of my “research” on the topic, which I recommend going through.

Below are the slides (the talk was not in English):

I hope it’s a useful intro to the topic and the main conclusion is – it’s counterintuitive if you are used to relational databases, and some internals (shards, Lucene segments) “leak” through the abstractions to influence the application design (as per the law of leaky abstractions).

The post Elasticsearch – Scalability and Multitenancy [slides] appeared first on Bozho's tech blog.

Understanding memory usage in your Java application with Amazon CodeGuru Profiler

Post Syndicated from Fernando Ciciliati original https://aws.amazon.com/blogs/devops/understanding-memory-usage-in-your-java-application-with-amazon-codeguru-profiler/

“Where has all that free memory gone?” This is the question we ask ourselves every time our application emits that dreaded OutOfMemoyError just before it crashes. Amazon CodeGuru Profiler can help you find the answer.

Thanks to its brand-new memory profiling capabilities, troubleshooting and resolving memory issues in Java applications (or almost anything that runs on the JVM) is much easier. AWS launched the CodeGuru Profiler Heap Summary feature at re:Invent 2020. This is the first step in helping us, developers, understand what our software is doing with all that memory it uses.

The Heap Summary view shows a list of Java classes and data types present in the Java Virtual Machine heap, alongside the amount of memory they’re retaining and the number of instances they represent. The following screenshot shows an example of this view.

Amazon CodeGuru Profiler heap summary view example

Figure: Amazon CodeGuru Profiler Heap Summary feature

Because CodeGuru Profiler is a low-overhead, production profiling service designed to be always on, it can capture and represent how memory utilization varies over time, providing helpful visual hints about the object types and the data types that exhibit a growing trend in memory consumption.

In the preceding screenshot, we can see that several lines on the graph are trending upwards:

  • The red top line, horizontal and flat, shows how much memory has been reserved as heap space in the JVM. In this case, we see a heap size of 512 MB, which can usually be configured in the JVM with command line parameters like -Xmx.
  • The second line from the top, blue, represents the total memory in use in the heap, independent of their type.
  • The third, fourth, and fifth lines show how much memory space each specific type has been using historically in the heap. We can easily spot that java.util.LinkedHashMap$Entry and java.lang.UUID display growing trends, whereas byte[] has a flat line and seems stable in memory usage.

Types that exhibit constantly growing trend of memory utilization with time deserve a closer look. Profiler helps you focus your attention on these cases. Associating the information presented by the Profiler with your own knowledge of your application and code base, you can evaluate whether the amount of memory being used for a specific data type can be considered normal, or if it might be a memory leak – the unintentional holding of memory by an application due to the failure in freeing-up unused objects. In our example above, java.util.LinkedHashMap$Entry and java.lang.UUIDare good candidates for investigation.

To make this functionality available to customers, CodeGuru Profiler uses the power of Java Flight Recorder (JFR), which is now openly available with Java 8 (since OpenJDK release 262) and above. The Amazon CodeGuru Profiler agent for Java, which already does an awesome job capturing data about CPU utilization, has been extended to periodically collect memory retention metrics from JFR and submit them for processing and visualization via Amazon CodeGuru Profiler. Thanks to its high stability and low overhead, the Profiler agent can be safely deployed to services in production, because it is exactly there, under real workloads, that really interesting memory issues are most likely to show up.

Summary

For more information about CodeGuru Profiler and other AI-powered services in the Amazon CodeGuru family, see Amazon CodeGuru. If you haven’t tried the CodeGuru Profiler yet, start your 90-day free trial right now and understand why continuous profiling is becoming a must-have in every production environment. For Amazon CodeGuru customers who are already enjoying the benefits of always-on profiling, this new feature is available at no extra cost. Just update your Profiler agent to version 1.1.0 or newer, and enable Heap Summary in your agent configuration.

 

Happy profiling!

AWS is the first global cloud service provider to comply with the new K-ISMS-P standard

Post Syndicated from Seulun Sung original https://aws.amazon.com/blogs/security/aws-is-the-first-global-cloud-service-provider-to-comply-with-the-new-k-isms-p-standard/

We’re excited to announce that Amazon Web Services (AWS) has achieved certification under the Korea-Personal Information & Information Security Management System (K-ISMS-P) standard (effective from December 16, 2020 to December 15, 2023). The assessment by the Korea Internet & Security Agency (KISA) covered the operation of infrastructure (including compute, storage, networking, databases, and security) in the AWS Asia Pacific (Seoul) Region. AWS was the first global cloud service provider (CSP) to obtain K-ISMS certification (the previous version of K-ISMS-P) back in 2017. Now AWS is the first global CSP to achieve compliance with the K-ISMS portion of the new K-ISMS-P standard.

Sponsored by KISA and affiliated with the Korean Ministry of Science and ICT (MSIT), K-ISMS-P serves as a standard for evaluating whether enterprises and organizations operate and manage their information security management systems consistently and securely, such that they thoroughly protect their information assets. The new K-ISMS-P standard combined the K-ISMS and K-PIMS (Personal Information Management System) standards with updated control items. Accordingly, the new K-ISMS certification and K-ISMS-P certification (personal information–focused) are introduced under the updated standard.

In this year’s audit, 110 services running in the Asia Pacific (Seoul) Region are included. The newly launched Availability Zone in 2020 is also added to the certification scope.

This certification helps enterprises and organizations across South Korea, regardless of industry, meet KISA compliance requirements more efficiently. Achieving this certification demonstrates the proactive approach AWS has taken to meet compliance set by the South Korean government and to deliver secure AWS services to customers. In addition, we’ve launched Quick Start and Operational Best Practices (conformance pack) pages to provide customers with a compliance framework that they can utilize for their K-ISMS-P compliance needs. Enterprises and organizations can use these toolkits and AWS certification to reduce the effort and cost of getting their own K-ISMS-P certification. You can download the AWS K-ISMS certification under the K-ISMS-P standard from AWS Artifact. To learn more about the AWS K-ISMS certification, see the AWS K-ISMS page. If you have any questions, don’t hesitate to contact your AWS Account Manager.

If you have feedback about this post, submit comments in the Comments section below.

Want more AWS Security how-to content, news, and feature announcements? Follow us on Twitter.

Author

Seulun Sung

Seulun is a Security Audit Program Manager at AWS, leading security certification programs, with a focus on the K-ISMS-P program in South Korea. She has a decade of experience in deploying global policies and processes to local Regions and helping customers adopt regulations. She is passionate about helping to build customers’ trust and provide them assurance on cloud security.

Serving driver-partners data at scale using mirror cache

Post Syndicated from Grab Tech original https://engineering.grab.com/mirror-cache-blog

Since the early beginnings, driver-partners have been the centerpiece of the wide-range of services or features provided by the Grab platform. Over time, many backend microservices were developed to support our driver-partners such as earnings, ratings, insurance, etc. All of these different microservices require certain information, such as name, phone number, email, active car types, and so on, to curate the services provided to the driver-partners.

We built the Drivers Data service to provide drivers-partners data to other microservices. The service attracts a high QPS and handles 10K requests during peak hours. Over the years, we have tried different strategies to serve driver-partners data in a resilient and cost-effective manner, while accounting for low response time. In this blog post, we talk about mirror cache, an in-memory local caching solution built to serve driver-partners data efficiently.

What we started with

Figure 1. Drivers Data service architecture
Figure 1. Drivers Data service architecture

Our Drivers Data service previously used MySQL DB as persistent storage and two caching layers – standalone local cache (RAM of the EC2 instances) as primary cache and Redis as secondary for eventually consistent reads. With this setup, the cache hit ratio was very low.

Figure 2. Request flow chart
Figure 2. Request flow chart

We opted for a cache aside strategy. So when a client request comes, the Drivers Data service responds in the following manner:

  • If data is present in the in-memory cache (local cache), then the service directly sends back the response.
  • If data is not present in the in-memory cache and found in Redis, then the service sends back the response and updates the local cache asynchronously with data from Redis.
  • If data is not present either in the in-memory cache or Redis, then the service responds back with the data fetched from the MySQL DB and updates both Redis and local cache asynchronously.
Figure 3. Percentage of response from different sources
Figure 3. Percentage of response from different sources

The measurement of the response source revealed that during peak hours ~25% of the requests were being served via standalone local cache, ~20% by MySQL DB, and ~55% via Redis.

The low cache hit rate is caused by the driver-partners data loading patterns: low frequency per driver over time but the high frequency in a short amount of time. When a driver-partner is a candidate for a job or is involved in an ongoing job, different services make multiple requests to the Drivers Data service to fetch that specific driver-partner information. The frequency of calls for a specific driver-partner reduces if he/she is not involved in the job allocation process or is not doing any job at the moment.

While low frequency per driver over time impacts the Redis cache hit rate, high frequency in short amounts of time mostly contributes to in-memory cache hit rate. In our investigations, we found that local caches of different nodes in the Drivers Data service cluster were making redundant calls to Redis and DB for fetching the same data that are already present in a node local cache.

Making in-memory cache available on every instance while the data is in active use, we could greatly increase the in-memory cache hit rate, and that’s what we did.

Mirror cache design goals

We set the following design goals:

  • Support a local least recently used (LRU) cache use-case.
  • Support active cache invalidation.
  • Support best effort replication between local cache instances (EC2 instances). If any instance successfully fetches the latest data from the database, then it should try to replicate or mirror this latest data across all the other nodes in the cluster. If replication fails and the item is expired or not found, then the nodes should fetch it from the database.
  • Support async data replication across nodes to ensure updates for the same key happens only with more recent data. For any older updates, the current data in the cache is ignored. The ordering of cache updates is not guaranteed due to the async replication.
  • Ability to handle auto-scaling.

The building blocks

Figure 4. Mirror cache
Figure 4. Mirror cache

The mirror cache library runs alongside the Drivers Data service inside each of the EC2 instances of the cluster. The two main components are in-memory cache and replicator.

In-memory cache

The in-memory cache is used to store multiple key/value pairs in RAM. There is a TTL associated with each key/value pair. We wanted to use a cache that can provide high hit ratio, memory bound, high throughput, and concurrency. After evaluating several options, we went with dgraph’s open-source concurrent caching library Ristretto as our in-memory local cache. We were particularly impressed by its use of the TinyLFU admission policy to ensure a high hit ratio.

Replicator

The replicator is responsible for mirroring/replicating each key/value entry among all the live instances of the Drivers Data service. The replicator has three main components: Membership Store, Notifier, and gRPC Server.

Membership Store

The Membership Store registers callbacks with our service discovery service to notify mirror cache in case any nodes are added or removed from the Drivers Data service cluster.

It maintains two maps – nodes in the same AZ (AWS availability zone) as itself (the current node of the Drivers Data service in which mirror cache is running) and the nodes in the other AZs.

Notifier

Each service (Drivers Data) node runs a single instance of mirror cache. So effectively, each node has one notifier.

  • Combine several (key/value) pairs updates to form a batch.
  • Propagate the batch updates among all the nodes in the same AZ as itself.
  • Send the batch updates to exactly one notifier (node) in different AZs who, in turn, are responsible for updating all the nodes in their own AZs with the latest batch of data. This communication technique helps to reduce cross AZ data transfer overheads.

In the case of auto-scaling, there is a warm-up period during which the notifier doesn’t notify the other nodes in the cluster. This is done to minimize duplicate data propagation. The warm-up period is configurable.

gRPC Server

An exclusive gRPC server runs for mirror cache. The different nodes of the Drivers Data service use this server to receive new cache updates from the other nodes in the cluster.

Here’s the structure of each cache update entity:

message Entity {
    string key = 1; // Key for cache entry.
    bytes value = 2; // Value associated with the key.
    Metadata metadata = 3; // Metadata related to the entity.
    replicationType replicate = 4; // Further actions to be undertaken by the mirror cache after updating its own in-memory cache.
    int64 TTL = 5; // TTL associated with the data.
    bool  delete = 6; // If delete is set as true, then mirror cache needs to delete the key from it's local cache.
}

enum replicationType {
    Nothing = 0; // Stop propagation of the request.
    SameRZ = 1; // Notify the nodes in the same Region and AZ.
}

message Metadata {
    int64 updatedAt = 1; // Same as updatedAt time of DB.
}

The server first checks if the local cache should update this new value or not. It tries to fetch the existing value for the key. If the value is not found, then the new key/value pair is added. If there is an existing value, then it compares the updatedAt time to ensure that stale data is not updated in the cache.

If the replicationType is Nothing, then the mirror cache stops further replication. In case the replicationType is SameRZ then the mirror cache tries to propagate this cache update among all the nodes in the same AZ as itself.

Run at scale

Figure 5. Drivers Data Service new architecture
Figure 5. Drivers Data Service new architecture

The behavior of the service hasn’t changed and the requests are being served in the same manner as before. The only difference here is the replacement of the standalone local cache in each of the nodes with mirror cache. It is the responsibility of mirror cache to replicate any cache updates to the other nodes in the cluster.

After mirror cache was fully rolled out to production, we rechecked our metrics related to the response source and saw a huge improvement. The graph showed that during peak hours ~75% of the response was from in-memory local cache. About 15% of the response was served by MySQL DB and a further 10% via Redis.

The local cache hit ratio was at 0.75, a jump of 0.5 from before and there was a 5% drop in the number of DB calls too.

Figure 6. New percentage of response from different sources
Figure 6. New percentage of response from different sources

Limitations and future improvements

Mirror cache is eventually consistent, so it is not a good choice for systems that need strong consistency.

Mirror cache stores all the data in volatile memory (RAM) and they are wiped out during deployments, resulting in a temporary load increase to Redis and DB.

Also, many new driver-partners are added everyday to the Grab system, and we might need to increase the cache size to maintain a high hit ratio. To address these issues we plan to use SSD in the future to store a part of the data and use RAM only to store hot data.

Conclusion

Mirror cache really helped us scale the Drivers Data service better and serve driver-partners data to the different microservices at low latencies. It also helped us achieve our original goal of an increase in the local cache hit ratio.

We also extended mirror cache in some other services and found similar promising results.


A huge shout out to Haoqiang Zhang and Roman Atachiants for their inputs into the final design. Special thanks to the Driver Backend team at Grab for their contribution.


Join us

Grab is more than just the leading ride-hailing and mobile payments platform in Southeast Asia. We use data and technology to improve everything from transportation to payments and financial services across a region of more than 620 million people. We aspire to unlock the true potential of Southeast Asia and look for like-minded individuals to join us on this ride.

If you share our vision of driving South East Asia forward, apply to join our team today.

Running cost optimized Spark workloads on Kubernetes using EC2 Spot Instances

Post Syndicated from Emma White original https://aws.amazon.com/blogs/compute/running-cost-optimized-spark-workloads-on-kubernetes-using-ec2-spot-instances/

This post is written by Kinnar Sen, Senior Solutions Architect, EC2 Spot 

Apache Spark is an open-source, distributed processing system used for big data workloads. It provides API operations to perform multiple tasks such as streaming, extract transform load (ETL), query, machine learning (ML), and graph processing. Spark supports four different types of cluster managers (Spark standalone, Apache Mesos, Hadoop YARN, and Kubernetes), which are responsible for scheduling and allocation of resources in the cluster. Spark can run with native Kubernetes support since 2018 (Spark 2.3). AWS customers that have already chosen Kubernetes as their container orchestration tool can also choose to run Spark applications in Kubernetes, increasing the effectiveness of their operations and compute resources.

In this post, I illustrate the deployment of scalable, resilient, and cost optimized Spark application using Kubernetes via Amazon Elastic Kubernetes Service (Amazon EKS) and Amazon EC2 Spot Instances. Learn how to save money on big data workloads by implementing this solution.

Overview

Amazon EC2 Spot Instances

Amazon EC2 Spot Instances let you take advantage of unused EC2 capacity in the AWS Cloud. Spot Instances are available at up to a 90% discount compared to On-Demand Instance prices. Capacity pools are a group of EC2 instances that belong to particular instance family, size, and Availability Zone (AZ). If EC2 needs capacity back for On-Demand Instance usage, Spot Instances can be interrupted by EC2 with a two-minute notification. There are many graceful ways to handle the interruption to ensure that the application is well architected for resilience and fault tolerance. This can be automated via the application and/or infrastructure deployments. Spot Instances are ideal for stateless, fault tolerant, loosely coupled and flexible workloads that can handle interruptions.

Amazon Elastic Kubernetes Service

Amazon EKS is a fully managed Kubernetes service that makes it easy for you to run Kubernetes on AWS without needing to install, operate, and maintain your own Kubernetes control plane. It provides a highly available and scalable managed control plane. It also provides managed worker nodes, which let you create, update, or terminate shut down worker nodes for your cluster with a single command. It is a great choice for deploying flexible and fault tolerant containerized applications. Amazon EKS supports creating and managing Amazon EC2 Spot Instances using Amazon EKS-managed node groups following Spot best practices. This enables you to take advantage of the steep savings and scale that Spot Instances provide for interruptible workloads running in your Kubernetes cluster. Using EKS-managed node groups with Spot Instances requires less operational effort compared to using self-managed nodes. In addition to launching Spot Instances in managed node groups, it is possible to specify multiple instance types in EKS managed node groups. You can find more in this blog.

Apache Spark and Kubernetes

When a spark application is submitted to the Kubernetes cluster the following happens:

  • A Spark driver is created.
  • The driver and the run within pods.
  • The Spark driver then requests for executors, which are scheduled to run within pods. The executors are managed by the driver.
  • The application is launched and once it completes, the executor pods are cleaned up. The driver pod persists the logs and remains in a completed state until the pod is cleared by garbage collection or manually removed. The driver in a completed stage does not consume any memory or compute resources.

Spark Deployment on Kubernetes Cluster

When a spark application runs on clusters managed by Kubernetes, the native Kubernetes scheduler is used. It is possible to schedule the driver/executor pods on a subset of available nodes. The applications can be launched either by a vanilla ‘spark submit’, a workflow orchestrator like Apache Airflow or the spark operator. I use vanilla ‘spark submit’ in this blog. is also able to schedule Spark applications on EKS clusters as described in this launch blog, but Amazon EMR on EKS is out of scope for this post.

Cost optimization

For any organization running big data workloads there are three key requirements: scalability, performance, and low cost. As the size of data increases, there is demand for more compute capacity and the total cost of ownership increases. It is critical to optimize the cost of big data applications. Big Data frameworks (in this case, Spark) are distributed to manage and process high volumes of data. These frameworks are designed for failure, can run on machines with different configurations, and are inherently resilient and flexible.

If Spark deploys on Kubernetes, the executor pods can be scheduled on EC2 Spot Instances and driver pods on On-Demand Instances. This reduces the overall cost of deployment – Spot Instances can save up to 90% over On-Demand Instance prices. This also enables faster results by scaling out executors running on Spot Instances. Spot Instances, by design, can be interrupted when EC2 needs the capacity back. If a driver pod is running on a Spot Instance, which is interrupted then the application fails and the application must be re-submitted. To avoid this situation, the driver pod can be scheduled on On-Demand Instances only. This adds a layer of resiliency to the Spark application running on Kubernetes. To cost optimize the deployment, all the executor pods are scheduled on Spot Instances as that’s where the bulk of compute happens. Spark’s inherent resiliency has the driver launch new executors to replace the ones that fail due to Spot interruptions.

There are a couple of key points to note here.

  • The idea is to start with minimum number of nodes for both On-Demand and Spot Instances (one each) and then auto-scale usingCluster Autoscaler and EC2 Auto Scaling  Cluster Autoscaler for AWS provides integration with Auto Scaling groups. If there are not sufficient resources, the driver and executor pods go into pending state. The Cluster Autoscaler detects pods in pending state and scales worker nodes within the identified Auto Scaling group in the cluster using EC2 Auto Scaling.
  • The scaling for On-Demand and Spot nodes is exclusive of one another. So, if multiple applications are launched the driver and executor pods can be scheduled in different node groups independently per the resource requirements. This helps reduce job failures due to lack of resources for the driver, thus adding to the overall resiliency of the system.
  • Using EKS Managed node groups
    • This requires significantly less operational effort compared to using self-managed nodegroup and enables:
      • Auto enforcement of Spot best practices like Capacity Optimized allocation strategy, Capacity Rebalancing and use multiple instances types.
      • Proactive replacement of Spot nodes using rebalance notifications.
      • Managed draining of Spot nodes via re-balance recommendations.
    • The nodes are auto-labeled so that the pods can be scheduled with NodeAffinity.
      • eks.amazonaws.com/capacityType: SPOT
      • eks.amazonaws.com/capacityType: ON_DEMAND

Now that you understand the products and best practices of used in this tutorial, let’s get started.

Tutorial: running Spark in EKS managed node groups with Spot Instances

In this tutorial, I review steps, which help you launch cost optimized and resilient Spark jobs inside Kubernetes clusters running on EKS. I launch a word-count application counting the words from an Amazon Customer Review dataset and write the output to an Amazon S3 folder. To run the Spark workload on Kubernetes, make sure you have eksctl and kubectl installed on your computer or on an AWS Cloud9 environment. You can run this by using an AWS IAM user or role that has the AdministratorAccess policy attached to it, or check the minimum required permissions for using eksctl. The spot node groups in the Amazon EKS cluster can be launched both in a managed or a self-managed way, in this post I use the former. The config files for this tutorial can be found here. The job is finally launched in cluster mode.

Create Amazon S3 Access Policy

First, I must create an Amazon S3 access policy to allow the Spark application to read/write from Amazon S3. Amazon S3 Access is provisioned by attaching the policy by ARN to the node groups. This associates Amazon S3 access to the NodeInstanceRole and, hence, the node groups then have access to Amazon S3. Download the Amazon S3 policy file from here and modify the <<output folder>> to an Amazon S3 bucket you created. Run the following to create the policy. Note the ARN.

aws iam create-policy --policy-name spark-s3-policy --policy-document file://spark-s3.json

Cluster and node groups deployment

Create an EKS cluster using the following command:

eksctl create cluster –name= sparkonk8 --node-private-networking  --without-nodegroup --asg-access –region=<<AWS Region>>

The cluster takes approximately 15 minutes to launch.

Create the nodegroup using the nodeGroup config file. Replace the <<Policy ARN>> string using the ARN string from the previous step.

eksctl create nodegroup -f managedNodeGroups.yml

Scheduling driver/executor pods

The driver and executor pods can be assigned to nodes using affinity. PodTemplates can be used to configure the detail, which is not supported by Spark launch configuration by default. This feature is available from Spark 3.0.0, requiredDuringScheduling node affinity is used to schedule the driver and executor jobs. Sample podTemplates have been uploaded here.

Launching a Spark application

Create a service account. The spark driver pod uses the service account to create and watch executor pods using Kubernetes API server.

kubectl create serviceaccount spark
kubectl create clusterrolebinding spark-role --clusterrole='edit'  --serviceaccount=default:spark --namespace=default

Download the Cluster Autoscaler and edit it to add the cluster-name. 

curl -LO https://raw.githubusercontent.com/kubernetes/autoscaler/master/cluster-autoscaler/cloudprovider/aws/examples/cluster-autoscaler-autodiscover.yaml

Install the Cluster AutoScaler using the following command:

kubectl apply -f cluster-autoscaler-autodiscover.yaml

Get the details of Kubernetes master to get the head URL.

kubectl cluster-info 

command output

Use the following instructions to build the docker image.

Download the application file (script.py) from here and upload into the Amazon S3 bucket created.

Download the pod template files from here. Submit the application.

bin/spark-submit \
--master k8s://<<MASTER URL>> \
--deploy-mode cluster \
--name 'Job Name' \
--conf spark.eventLog.dir=s3a:// <<S3 BUCKET>>/logs \
--conf spark.eventLog.enabled=true \
--conf spark.history.fs.inProgressOptimization.enabled=true \
--conf spark.history.fs.update.interval=5s \
--conf spark.kubernetes.container.image=<<ECR Spark Docker Image>> \
--conf spark.kubernetes.container.image.pullPolicy=IfNotPresent \
--conf spark.kubernetes.driver.podTemplateFile='../driver_pod_template.yml' \
--conf spark.kubernetes.executor.podTemplateFile='../executor_pod_template.yml' \
--conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \
--conf spark.dynamicAllocation.enabled=true \
--conf spark.dynamicAllocation.shuffleTracking.enabled=true \
--conf spark.dynamicAllocation.maxExecutors=100 \
--conf spark.dynamicAllocation.executorAllocationRatio=0.33 \
--conf spark.dynamicAllocation.sustainedSchedulerBacklogTimeout=30 \
--conf spark.dynamicAllocation.executorIdleTimeout=60s \
--conf spark.driver.memory=8g \
--conf spark.kubernetes.driver.request.cores=2 \
--conf spark.kubernetes.driver.limit.cores=4 \
--conf spark.executor.memory=8g \
--conf spark.kubernetes.executor.request.cores=2 \
--conf spark.kubernetes.executor.limit.cores=4 \
--conf spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem \
--conf spark.hadoop.fs.s3a.connection.ssl.enabled=false \
--conf spark.hadoop.fs.s3a.fast.upload=true \
s3a://<<S3 BUCKET>>/script.py \
s3a://<<S3 BUCKET>>/output 

A couple of key points to note here

  • podTemplateFile is used here, which enables scheduling of the driver pods to On-Demand Instances and executor pods to Spot Instances.
  • Spark provides a mechanism to allocate dynamically resources dynamically based on workloads. In the latest release of Spark (3.0.0), dynamicAllocation can be used with Kubernetes cluster manager. The executors that do not store, active, shuffled files can be removed to free up the resources. DynamicAllocation works well in tandem with Cluster Autoscaler for resource allocation and optimizes resource for jobs. We are using dynamicAllocation here to enable optimized resource sharing.
  • The application file and output are both in Amazon S3.

Output Files in S3

  • Spark Event logs are redirected to Amazon S3. Spark on Kubernetes creates local temporary files for logs and removes them once the application completes. The logs are redirected to Amazon S3 and Spark History Server can be used to analyze the logs. Note, you can create more instrumentation using tools like Prometheus and Grafana to monitor and manage the cluster.

Spark History Server + Dynamic Allocation

Observations

EC2 Spot Interruptions

The following diagram and log screenshot details from Spark History server showcases the behavior of a Spark application in case of an EC2 Spot interruption.

Four Spark applications launched in parallel in a cluster and one of the Spot nodes was interrupted. A couple of executor pods were terminated shut down in three of the four applications, but due to the resilient nature of Spark new executors were launched and the applications finished almost around the same time.
The Spark Driver identified the shut down executors, which handled the shuffle files and relaunched the tasks running on those executors.
Spark jobs

The Spark Driver identified the shut down executors, which handled the shuffle files and relaunched the tasks running on those executors.

Dynamic Allocation

Dynamic Allocation works with the caveat that it is an experimental feature.

dynamic allocation

Cost Optimization

Cost Optimization is achieved in several different ways from this tutorial.

  • Use of 100% Spot Instances for the Spark executors
  • Use of dynamicAllocation along with cluster autoscaler does make optimized use of resources and hence save cost
  • With the deployment of one driver and executor nodes to begin with and then scaling up on demand reduces the waste of a continuously running cluster

Cluster Autoscaling

Cluster Autoscaling is triggered as it is designed when there are pending (Spark executor) pods.

The Cluster Autoscaler logs can be fetched by:

kubectl logs -f deployment/cluster-autoscaler -n kube-system —tail=10  

Cluster Autoscaler Logs 

Cleanup

If you are trying out the tutorial, run the following steps to make sure that you don’t encounter unwanted costs.

Delete the EKS cluster and the nodegroups with the following command:

eksctl delete cluster --name sparkonk8

Delete the Amazon S3 Access Policy with the following command:

aws iam delete-policy --policy-arn <<POLICY ARN>>

Delete the Amazon S3 Output Bucket with the following command:

aws s3 rb --force s3://<<S3_BUCKET>>

Conclusion

In this blog, I demonstrated how you can run Spark workloads on a Kubernetes Cluster using Spot Instances, achieving scalability, resilience, and cost optimization. To cost optimize your Spark based big data workloads, consider running spark application using Kubernetes and EC2 Spot Instances.

 

 

 

Building a Jenkins Pipeline with AWS SAM

Post Syndicated from Eric Johnson original https://aws.amazon.com/blogs/compute/building-a-jenkins-pipeline-with-aws-sam/

This post is courtesy of Tarun Kumar Mall, SDE at AWS.

This post shows how to set up a multi-stage pipeline on a Jenkins host for a serverless application, using the AWS Serverless Application Model (AWS SAM).

Overview

This tutorial uses Jenkins Pipeline plugin. A commit to the main branch of the repository starts and deploys the application, using the AWS SAM CLI. This tutorial deploys a small serverless API application called HelloWorldApi.

The pipeline consists of stages to build and deploy the application. Jenkins first ensures that the build environment is set up and installs any necessary tools. Next, Jenkins prepares the build artifacts. It promotes the artifacts to the next stage, where they are deployed to a beta environment using the AWS SAM CLI. Integration tests are run after deployment. If the tests pass, the application is deployed to the production environment.

CICD workflow diagram

CICD workflow diagram

The following prerequisites are required:

Setting up the backend application and development stack

Using AWS CloudFormation to define the infrastructure, you can create multiple environments or stacks from the same infrastructure definition. A “dev stack” is a copy of production infrastructure deployed to a developer account for testing purposes.

As serverless services use a pay-for-value model, it can be cost effective to use a high-fidelity copy of your production stack. Dev stacks are created by each developer as needed and deleted without having any negative impact on production.

For complex applications, it may not be feasible for every developer to have their own stack. However, for this tutorial, setting up the dev stack first for testing is recommended. Setting up a dev stack takes you through a manual process of how a stack is created. Later, this process is used to automate the setup using Jenkins.

To create a dev stack:

  1. Clone backend application repository https://github.com/aws-samples/aws-sam-jenkins-pipeline-tutorial
    git clone https://github.com/aws-samples/aws-sam-jenkins-pipeline-tutorial.git
  2. Build the application and run the guided deploy command:
    cd aws-sam-jenkins-pipeline-tutorial
    sam build
    sam deploy --guided

    AWS SAM guided deploy output

    AWS SAM guided deploy output

This sets up a development stack and saves the settings in the samconfig.toml file with configuration environment specific to a user. This also triggers a deployment.

  1. After deployment, make a small code change. For example, in the file hello-world/app.js change the message Hello world to Hello world from user <your name>.
  2. Deploy the updated code:
    sam build
    sam deploy -–config-env <your_username>

With this command, each developer can create their own configuration environment. They can use this for deploying to their development stack and testing changes before pushing changes to the repository.

Once deployment finishes, the API endpoint is displayed in the console output. You can use this endpoint to make GET requests and test the API manually.

Deployment output

Deployment output

To update and run the integration test:

  1. Open the hello-world/tests/integ/test-integ-api.js file.
  2. Update the assert statement in line 32 to include <your name> from the previous step:
    it("verifies if response contains my username", async () => {
      assert.include(apiResponse.data.message, "<your name>");
    });
  3. Open package.json and add the line in bold:
    {
      ...
      "scripts": {
        "test": "mocha tests/unit/",
        "integ-test": "mocha tests/integ/"
      }
      ...
    }
  4. From the terminal, run the following commands:
    cd hello-world
    npm install
    AWS_REGION=us-west-2 STACK_NAME=sam-app-user1-dev-stack npm run integ-test
    If you are using Microsoft Windows, instead run:
    cd hello-world
    npm install
    set AWS_REGION=us-west-2
    set STACK_NAME=sam-app-user1-dev-stack
    npm run integ-test

    Test results

    Test results

You have deployed a fully configured development stack with working integration tests. To push the code to GitHub:

  1. Create a new repository in GitHub.
    1. From the GitHub account homepage, choose New.
    2. Enter a repository name and choose Create Repository.
    3. Copy the repository URL.
  2. From the root directory of the AWS SAM project, run:
    git init
    git commit -am “first commit”
    git remote add origin <your-repository-url>
    git push -u origin main

Creating an IAM user for Jenkins

To create an IAM user for the Jenkins deployment:

  1. Sign in to the AWS Management Console and navigate to IAM.
  2. Select Users from side navigation and choose Add user.
  3. Enter the User name as sam-jenkins-demo-credentials and grant Programmatic access to this user.
  4. On the next page, select Attach existing policies directly and choose Create Policy.
  5. Select the JSON tab and enter the following policy. Replace <YOUR_ACCOUNT_ID> with your AWS account ID:
    {
        "Version": "2012-10-17",
        "Statement": [
            {
                "Sid": "CloudFormationTemplate",
                "Effect": "Allow",
                "Action": [
                    "cloudformation:CreateChangeSet"
                ],
                "Resource": [
                    "arn:aws:cloudformation:*:aws:transform/Serverless-2016-10-31"
                ]
            },
            {
                "Sid": "CloudFormationStack",
                "Effect": "Allow",
                "Action": [
                    "cloudformation:CreateChangeSet",
                    "cloudformation:DeleteStack",
                    "cloudformation:DescribeChangeSet",
                    "cloudformation:DescribeStackEvents",
                    "cloudformation:DescribeStacks",
                    "cloudformation:ExecuteChangeSet",
                    "cloudformation:GetTemplateSummary"
                ],
                "Resource": [
                    "arn:aws:cloudformation:*:<YOUR_ACCOUNT_ID>:stack/*"
                ]
            },
            {
                "Sid": "S3",
                "Effect": "Allow",
                "Action": [
                    "s3:CreateBucket",
                    "s3:GetObject",
                    "s3:PutObject"
                ],
                "Resource": [
                    "arn:aws:s3:::*/*"
                ]
            },
            {
                "Sid": "Lambda",
                "Effect": "Allow",
                "Action": [
                    "lambda:AddPermission",
                    "lambda:CreateFunction",
                    "lambda:DeleteFunction",
                    "lambda:GetFunction",
                    "lambda:GetFunctionConfiguration",
                    "lambda:ListTags",
                    "lambda:RemovePermission",
                    "lambda:TagResource",
                    "lambda:UntagResource",
                    "lambda:UpdateFunctionCode",
                    "lambda:UpdateFunctionConfiguration"
                ],
                "Resource": [
                    "arn:aws:lambda:*:<YOUR_ACCOUNT_ID>:function:*"
                ]
            },
            {
                "Sid": "IAM",
                "Effect": "Allow",
                "Action": [
                    "iam:AttachRolePolicy",
                    "iam:CreateRole",
                    "iam:DeleteRole",
                    "iam:DetachRolePolicy",
                    "iam:GetRole",
                    "iam:PassRole",
                    "iam:TagRole"
                ],
                "Resource": [
                    "arn:aws:iam::<YOUR_ACCOUNT_ID>:role/*"
                ]
            },
            {
                "Sid": "APIGateway",
                "Effect": "Allow",
                "Action": [
                    "apigateway:DELETE",
                    "apigateway:GET",
                    "apigateway:PATCH",
                    "apigateway:POST",
                    "apigateway:PUT"
                ],
                "Resource": [
                    "arn:aws:apigateway:*::*"
                ]
            }
        ]
    }
  6. Choose Review Policy and add a policy name on the next page.
  7. Choose Create Policy button.
  8. Return to the previous tab to continue creating the IAM user. Choose Refresh and search for the policy name you created. Select the policy.
  9. Choose Next Tags and then Review.
  10. Choose Create user and save the Access key ID and Secret access key.

Configuring Jenkins

To configure AWS credentials in Jenkins:

  1. On the Jenkins dashboard, go to Manage Jenkins > Manage Plugins in the Available tab. Search for the Pipeline: AWS Steps plugin and choose Install without restart.
  2. Navigate to Manage Jenkins > Manage Credentials > Jenkins (global) > Global Credentials > Add Credentials.
  3. Select Kind as AWS credentials and use the ID sam-jenkins-demo-credentials.
  4. Enter the access key ID and secret access key and choose OK.

    Jenkins credential configuration

    Jenkins credential configuration

  5. Create Amazon S3 buckets for each Region in the pipeline. S3 bucket names must be unique within a partition:
    aws s3 mb s3://sam-jenkins-demo-us-west-2-<your_name> --region us-west-2
    aws s3 mb s3://sam-jenkins-demo-us-east-1-<your_name> --region us-east-1
  6. Create a file named Jenkinsfile at the root of the project and add:
    pipeline {
      agent any
     
      stages {
        stage('Install sam-cli') {
          steps {
            sh 'python3 -m venv venv && venv/bin/pip install aws-sam-cli'
            stash includes: '**/venv/**/*', name: 'venv'
          }
        }
        stage('Build') {
          steps {
            unstash 'venv'
            sh 'venv/bin/sam build'
            stash includes: '**/.aws-sam/**/*', name: 'aws-sam'
          }
        }
        stage('beta') {
          environment {
            STACK_NAME = 'sam-app-beta-stage'
            S3_BUCKET = 'sam-jenkins-demo-us-west-2-user1'
          }
          steps {
            withAWS(credentials: 'sam-jenkins-demo-credentials', region: 'us-west-2') {
              unstash 'venv'
              unstash 'aws-sam'
              sh 'venv/bin/sam deploy --stack-name $STACK_NAME -t template.yaml --s3-bucket $S3_BUCKET --capabilities CAPABILITY_IAM'
              dir ('hello-world') {
                sh 'npm ci'
                sh 'npm run integ-test'
              }
            }
          }
        }
        stage('prod') {
          environment {
            STACK_NAME = 'sam-app-prod-stage'
            S3_BUCKET = 'sam-jenkins-demo-us-east-1-user1'
          }
          steps {
            withAWS(credentials: 'sam-jenkins-demo-credentials', region: 'us-east-1') {
              unstash 'venv'
              unstash 'aws-sam'
              sh 'venv/bin/sam deploy --stack-name $STACK_NAME -t template.yaml --s3-bucket $S3_BUCKET --capabilities CAPABILITY_IAM'
            }
          }
        }
      }
    }
  7. Commit and push the code to the GitHub repository by running following commands:
    git commit -am “Adding Jenkins pipeline config.”
    git push origin -u main

Next, create a Jenkins Pipeline project:

  1. From the Jenkins dashboard, choose New Item, select Pipeline, and enter the project name sam-jenkins-demo-pipeline.

    Jenkins Pipeline creation wizard

    Jenkins Pipeline creation wizard

  2. Under Build Triggers, select Poll SCM and enter * * * * *. This polls the repository for changes every minute.

    Jenkins build triggers configuration

    Jenkins build triggers configuration

  3. Under the Pipeline section, select Definition as Pipeline script from SCM.
    • Select GIT under SCM and enter the repository URL.
    • Set Branches to build to */main.
    • Set the Script Path to Jenkinsfile.

      Jenkins pipeline configuration

      Jenkins pipeline configuration

  4. Save the project.

After the build finishes, you see the pipeline:

Jenkins pipeline stages

Jenkins pipeline stages

Review the logs for the beta stage to check that the integration test is completed successfully.

Jenkins stage logs

Jenkins stage logs

Conclusion

This tutorial uses a Jenkins Pipeline to add an automated CI/CD pipeline to an AWS SAM-generated example application. Jenkins automatically builds, tests, and deploys the changes after each commit to the repository.

Using Jenkins, developers can gain the benefits of continuous integration and continuous deployment of serverless applications to the AWS Cloud with minimal configuration.

For more information, see the Jenkins Pipeline and AWS Serverless Application Model documentation.

We want to hear your feedback! Is this approach useful for your organization? Do you want to see another implementation? Contact us on Twitter @edjgeek or via comments!

Amazon SES celebrates 10 years of email sending and deliverability

Post Syndicated from Simon Poile original https://aws.amazon.com/blogs/messaging-and-targeting/amazon-ses-celebrates-10-years-of-email-sending-and-deliverability/

Amazon Simple Email Service (Amazon SES) turns 10 years old today. Back on January 25th 2011, Amazon Web Services (AWS) had only 15 services. Today, AWS has grown to over 180 services. Jeff Barr launched Amazon SES as part of his web evangelist blog. Much of what he wrote about then is still true today. Even 10 years later, email is an important channel for customer communications. Developers still want to rely on a trusted global partner to deliver email at scale. However, mailbox providers are even more protective of their end users’ security. They actively work to ensure that any perceived, unwanted email doesn’t make it to the inbox.

Inbox providers use several factors to determine the legitimacy of email traffic. Over the last decade, we have worked diligently to measure many of those factors in Amazon SES to help our customers achieve great deliverability. The focus for much of that work has been a combination of investments into reputation, engagement, and trust. I want to outline what we’ve accomplished to improve your email sending over the last 10 years.

Reputation

Reputation is the measurement mailbox providers use to determine how closely you follow their sending standards. Amazon SES measures perceived reputation through metrics such as bounce rate or complaint rate in the reputation dashboard. The reputation dashboard also shares overall Amazon SES account sending status like “Healthy” or “Under Review.” Some Inbox providers, or ISPs, also provide feedback to help us measure the effectiveness of a specific IP or domain in sending trustworthy traffic.

You can influence reputation in Amazon SES through:

  • Setting up dedicated IPs: Set up IPs in Amazon SES for your own specific sending with appropriate warm-up plans. Split IPs out by use case such as separating password resets from marketing messages.
  • Customer owned IPs (New in 2020): You can now transition IPs you’ve invested in through your own data center or with another ESP to Amazon SES without interruption.
  • Following sending volume best practices: Nothing can flag your IP addresses faster than non-predictable sending patterns. We help you manage this through sending quotas.
  • Use our SES email simulator: Test your application sending without messages leaving the sandbox.

 

Engagement

Engagement is the rate by which customers are interacting with your content. Amazon SES helps you measure engagement through conversion rates (such as open or click-through) and unsubscribe rates. These are measured in the event publishing click stream. This area is more of an art in our deliverability calculus because success varies by industry and use case.

You can influence engagement in Amazon SES through:

  • Customizing content as much as possible, but follow content best practices to avoid setting off content filters. Mailbox providers often utilize behavioral content filtering using AI to determine if your content is relevant based on engagement behavior.
  • Use consent and list management (New in 2020) with customized topics and opt-out pages. It’s important to offer recipients a way to select what emails they want to receive from you and give them an option to opt-out. This is a great new feature that we’ve added based on customer feedback.
  • Remove emails that are not engaging from your lists. Some customers have a time limit, for example, 60 days, before they are automatically removed from an active email list.

 

Trust

Earning trust on email sending is done through the adherence to proper sending behavior, as measured by both individual ISPs as well as industry watch-groups. Trust is closely related to reputation.  We measure trust through messages in the reputation dashboard based on feedback loops, Real-time Blocklists (RBLs), and spam-traps. You can also see the complaint rate associated to your sending in the complaint area of the reputation dashboard. It has statuses like healthy or under review.

You can influence trust in Amazon SES through:

 

Deliverability is a multi-dimensional part of email sending, beyond just setting up an SMTP (Simple Mail Transfer Protocol) endpoint, with constant complexities. But, we’re here to help. In addition to these investments in deliverability, we’ve also expanded Amazon SES to 18 regions, including the government cloud. It’s been an exciting time at AWS, and we look forward to supporting all of our customers in the years to come with Amazon SES.

 

 

 

 

The collective thoughts of the interwebz

By continuing to use the site, you agree to the use of cookies. more information

The cookie settings on this website are set to "allow cookies" to give you the best browsing experience possible. If you continue to use this website without changing your cookie settings or you click "Accept" below then you are consenting to this.

Close