[$] Zig 2024 roadmap

Post Syndicated from daroc original https://lwn.net/Articles/959915/

The Zig language
2024 roadmap
was presented in a talk last week on
Zig Showtime (a show covering
Zig news). Andrew Kelley, the benevolent dictator for life of the Zig project,
presented his goals
for the language, largely focusing on compiler performance and continuing
progress toward stabilization for the language. He discussed details of his plan
for incremental compilation, and addressed the sustainability of the project in
terms of both code contributions and financial support.

Security updates for Friday

Post Syndicated from corbet original https://lwn.net/Articles/960604/

Security updates have been issued by Debian (chromium, man-db, and openjdk-17), Fedora (chromium, indent, jupyterlab, kernel, and python-notebook), Gentoo (glibc), Oracle (firefox, thunderbird, and tigervnc), Red Hat (rpm), SUSE (cpio, gdb, gstreamer, openconnect, slurm, slurm_18_08, slurm_20_02, slurm_20_11, slurm_22_05, slurm_23_02, squid, webkit2gtk3, and xerces-c), and Ubuntu (imagemagick and xorg-server, xwayland).

Барселонагейт: Показанията на Ивайла Бакалова за Борислава Йовчева и Бойко Борисов

Post Syndicated from Екип на Биволъ original https://bivol.bg/barcelonagate-bakalova.html

петък 2 февруари 2024


Публикуваме показанията на Ивайла Бакалова дадени по делото Барселонагейт, в които тя недвусмислено прави връзка между Борислава Йовчева и бившия премиер Бойко Борисов. След като делото беше иззето от наблюдаващия…

A Self-Enforcing Protocol to Solve Gerrymandering

Post Syndicated from Bruce Schneier original https://www.schneier.com/blog/archives/2024/02/a-self-enforcing-protocol-to-solve-gerrymandering.html

In 2009, I wrote:

There are several ways two people can divide a piece of cake in half. One way is to find someone impartial to do it for them. This works, but it requires another person. Another way is for one person to divide the piece, and the other person to complain (to the police, a judge, or his parents) if he doesn’t think it’s fair. This also works, but still requires another person—­at least to resolve disputes. A third way is for one person to do the dividing, and for the other person to choose the half he wants.

The point is that unlike protocols that require a neutral third party to complete (arbitrated), or protocols that require that neutral third party to resolve disputes (adjudicated), self-enforcing protocols just work. Cut-and-choose works because neither side can cheat. And while the math can get really complicated, the idea generalizes to multiple people.

Well, someone just solved gerrymandering in this way. Prior solutions required either a bipartisan commission to create fair voting districts (arbitrated), or require a judge to approve district boundaries (adjudicated), their solution is self-enforcing.

And it’s trivial to explain:

  • One party defines a map of equal-population contiguous districts.
  • Then, the second party combines pairs of contiguous districts to create the final map.

It’s not obvious that this solution works. You could imagine that all the districts are defined so that one party has a slight majority. In that case, no combination of pairs will make that map fair. But real-world gerrymandering is never that clean. There’s “cracking,” where a party’s voters are split amongst several districts to dilute its power; and “packing,” where a party’s voters are concentrated in a single district so its influence can be minimized elsewhere. It turns out that this “define-combine procedure” works; the combining party can undo any damage that the defining party does—that the results are fair. The paper has all the details, and they’re fascinating.

Of course, a theoretical solution is not a political solution. But it’s really neat to have a theoretical solution.

На второ четене: „Странните преживелици на един пощальон“

Post Syndicated from original https://www.toest.bg/na-vtoro-chetene-strannite-prezhivelitsi-na-edin-poshtalyon/

„Странните преживелици на един пощальон“ от Дени Терио

На второ четене: „Странните преживелици на един пощальон“

Превод от френски Калоян Праматаров, изд. „Лист“, 2019

Книгата лесно може да ви заблуди. Зад заглавието, което ми навяваше представа за по-скоро бързо, лековато, приключенско и забавно четиво (от рода на многото за книжарничките и кафенетата), всъщност стои нещо съвсем различно. Стои съвсем друг тип роман, или по-скоро

повест – изящна, поетична и философски задълбочена, с щипка мистериозност и чувствена романтика.

Написана на език, зареден с някаква почти старомодна естетика, тя звучи така, сякаш разказва за отминали времена, а не за съвремието ни (когато всъщност се развива действието), маркерите за което са сведени до минимум. За това допринася и сюжетът, построен изцяло около размяната на истински, книжни, писани на ръка писма – нещо почти ексцентрично в наши дни, или най-малкото занимание за носталгици, мечтатели и безнадеждни романтици.

Такъв се оказва и главният герой Билодо, прикрит зад рутината и сигурността на своя ежедневен образ на редови пощальон в Квебек, изпълнителен и изцяло отдаден на професията си. 27-годишният мъж сякаш наистина принадлежи на друга епоха – той изпълнява задълженията си по разнасянето на скучни сметки и формуляри с вече излезли от употреба добросъвестност и усърдие; следва разчертани маршрути и безопасни навици; посещава едно и също място за хранене; има само един колега приятел, а съвременният социален живот му е напълно чужд за сметка на часовете, прекарани у дома с джаз и стари компютърни игри. Единственото му прегрешение – престъпването на целия този ред – е, че преди да ги достави на клиентите си, Билодо чете, копира и съхранява съкровищата от пощата – писаните на ръка писма.

Истински писма, написани от реални хора, които предпочитаха пред змийската студенина на клавиатурата и мигновеността на интернет чувствения акт на писането на ръка, сладостното униние от очакването на отговора, хора, умеещи да пишат писма, за които това бе съзнателен избор, а в някои случаи се долавяше, че става дума за принцип, за заемане на позиция в полза на един начин на живот, по-малко подвластен на надпреварата с времето и стремежа към бърз резултат.

Сред този сноп от „еднопосочни“ сюжетни нишки (тъй като разполага само с текстовете на подателите), които пази в папки и чиито отговори измисля наум, Билодо ще попадне на преписка, която напълно ще го обсеби – между учителка от Гваделупа и канадски професор по литература, впуснали се в т.нар. ренку, или „нанизано стихотворение“, традиция, водеща началото си от поетическите турнири в императорския двор на средновековна Япония.

Изключително трудно е да се говори за тази повест, без човек незабавно да се натъкне на сюжетен спойлер. Достатъчно е обаче да кажем, че размяната на хайку между двамата непознати ще завладее съзнанието на Билодо, а един трагичен инцидент ще отвори пред него възможност, която ще го отведе до неведомите дълбини и ефирните висини на поезията, ще превземе, отгледа и просвети думите му и няма да му даде мира, докато не ги превърне в изящни приношения на олтара на любовта. След поредица от авантюристични ходове от негова страна, всичко това ще превърти живота му по напълно неочакван начин.

Най-ценното в „Странните преживелици на един пощальон“ е фактът, че

повестта отдава почит на японските поетични форми хайку и танка, но и изобщо на японската културна традиция като цяло.

Текстът е изпълнен с поредица от красиви стихове, въпреки че те не спазват точно изискването за 5–7–5 срички, както е при хайку, нито за танка. Тук е важно да уточним, че не става дума за пропуск в чудесния български превод на Калоян Праматаров, а за решение, което присъства и в оригинала.

Билодо сам ще се впусне в преследване на духа на хайку, завещан ни от майстори като Мацуо Башо, Нагата Кои и Кобаяши Иса – а именно „съполагането на неизменното и мимолетното“, „деликатното равновесие между фуеки (неизменното, вечността, която ни надхвърля) и рюко (преходното, ефимерното, което ни пронизва)“.

Той пишеше, търсеше съпричастността на думите, опитваше се да ги улови в полет, преди да са се пръснали, да ги хване като пеперуди в мрежата на страницата и да ги закарфичи върху хартията.

В книгата се говори и за редица други принципи, като например уаби (неподправената красота в съзвучие с природата) и саби (простота, безметежност, самота). Централен за сюжета обаче ще се окаже дзен символът енсо – кръг, изписван с едно-единствено движение на ръката в пълното отсъствие на мисли (сатори), с цел постигане на просветление. Изпълнението на това „духовно упражнение на медитация върху небитието“ разкрива духовното състояние на твореца. Терио умело построява своя сюжет върху тази идея, върху кръга на цикличното време и природа, на вечното завръщане в изходната точка, на неизбежното начало. 

Завихряйки се като водата,
срещнала скала,
времето бяга в кръг.

Повестта обаче може да се чете и изобщо като търсене на смисъла, динамиката, философията зад „направата“ и духа на поезията. В търсенията и опитите си Билодо ще премине през куп въпроси. Би ли могъл един обикновен пощальон да стане поет, т.е. за всеки ли е поезията, или както вярват романтиците, поетът е богоизбран и боговдъхновен, отделен от тълпата? Може ли това изкуство да се постигне по имитативен път, чрез усвояване и възпроизвеждане на външните му форми? Може ли човек да се научи на поезия чрез един практически подход? Какво трябва да направи, за да стане и да бъде поет. Ключ ли са вживяването в личността и биографията на големите автори, на образците и близкият прочит на творбите им към тайната на поезията (въпроси, които отварят поле за осмисляне и на видовете литературна критика и теория)?

И още: съществува ли химия между точните думи, способна да ги трансцендира? Не предпочита ли винаги истинският поет обектите на въображението си пред тези в реалността? Трябва ли да има превъзходни сетива за всяка подробност от заобикалящия ни свят, способна да ни „осигури“ материал за поезия:

Гъсеницата, лазеща по тротоара, ажурната арка, която образуваха над улицата дървесните клони, катеричките, боричкащи се между краката на градинската пейка, и розовите кюлоти, провесени на простора и развявани от вятъра: дали от всичко това можеше да се получат стихотворения? (…) Трябва ли да стори още нещо, да принесе някаква жертва, има ли ключ, който е пропуснал да пъхне в сложната ключалка на портата, отказваща му достъп до поезията?

Всички тези въпроси си задава Билодо в своя поход към сърцевината на поетичното преживяване, сравнимо с кулминацията на оргазма, докато се страхува да не накърни самата същност на Поезията. Докато се пита дали да изобразява видимото и уловимото, или пронизан от него, „да изрази по-величествени и красиви неща, да разкъса шевовете на пристягащата го дреха, да достигне по-надалече, най-сетне да разкрие сърцето си“.

На второ четене: „Странните преживелици на един пощальон“

Прекрасно малко бижу е тази повест – първата книга на Дени Терио, издадена на български от „Лист“, последвана от романа „Игуаната“ (2021, отново в превод на Калоян Праматаров). В привидната ѝ лекота, където на моменти прозира далечен нюанс на човеколюбива насмешка и самоиронична приповдигнатост, се оформя обликът на източната поетична традиция и на Поета, кореспондиращ с образа му от романтизма в западната традиция. „Странните преживелици на един пощальон“ в крайна сметка ни превежда и по един позабравен път на любовта – от платоничното към чувственото, осъществено изцяло през еротиката и сетивния заряд на езика.  


Активните дарители на „Тоест“ получават постоянна отстъпка в размер на 20% от коричната цена на всички заглавия от каталога на издателство „Лист“, както и на няколко други български издателства в рамките на партньорската програма Читателски клуб „Тоест“. За повече информация прочетете на toest.bg/club.

Никой от нас не чете единствено най-новите книги. Тогава защо само за тях се пише? „На второ четене“ е рубрика, в която отваряме списъците с книги, публикувани преди поне година, четем ги и препоръчваме любимите си от тях. Рубриката е част от партньорската програма Читателски клуб „Тоест“. Изборът на заглавия обаче е единствено на авторите – Стефан Иванов и Антония Апостолова, които биха ви препоръчали тези книги и ако имаше как веднъж на две седмици да се разходите с тях в книжарницата.

HPE Servers and Kioxia SAS and NVMe SSDs to the ISS in the HPE Spaceborne Computer-2

Post Syndicated from Cliff Robinson original https://www.servethehome.com/hpe-servers-and-kioxia-sas-and-nvme-ssds-to-the-iss-in-the-hpe-spaceborne-computer-2/

HPE servers and Kioxia SAS and NVMe SSDs make their way to the ISS in the HPE Spaceborne Computer-2 after liftoff this week

The post HPE Servers and Kioxia SAS and NVMe SSDs to the ISS in the HPE Spaceborne Computer-2 appeared first on ServeTheHome.

Thanksgiving 2023 security incident

Post Syndicated from Matthew Prince http://blog.cloudflare.com/author/matthew-prince/ original https://blog.cloudflare.com/thanksgiving-2023-security-incident

On Thanksgiving Day, November 23, 2023, Cloudflare detected a threat actor on our self-hosted Atlassian server. Our security team immediately began an investigation, cut off the threat actor’s access, and on Sunday, November 26, we brought in CrowdStrike’s Forensic team to perform their own independent analysis.

Yesterday, CrowdStrike completed its investigation, and we are publishing this blog post to talk about the details of this security incident.

We want to emphasize to our customers that no Cloudflare customer data or systems were impacted by this event. Because of our access controls, firewall rules, and use of hard security keys enforced using our own Zero Trust tools, the threat actor’s ability to move laterally was limited. No services were implicated, and no changes were made to our global network systems or configuration. This is the promise of a Zero Trust architecture: it’s like bulkheads in a ship where a compromise in one system is limited from compromising the whole organization.

From November 14 to 17, a threat actor did reconnaissance and then accessed our internal wiki (which uses Atlassian Confluence) and our bug database (Atlassian Jira). On November 20 and 21, we saw additional access indicating they may have come back to test access to ensure they had connectivity.

They then returned on November 22 and established persistent access to our Atlassian server using ScriptRunner for Jira, gained access to our source code management system (which uses Atlassian Bitbucket), and tried, unsuccessfully, to access a console server that had access to the data center that Cloudflare had not yet put into production in São Paulo, Brazil.

They did this by using one access token and three service account credentials that had been taken, and that we failed to rotate, after the Okta compromise of October 2023. All threat actor access and connections were terminated on November 24 and CrowdStrike has confirmed that the last evidence of threat activity was on November 24 at 10:44.

(Throughout this blog post all dates and times are UTC.)

Even though we understand the operational impact of the incident to be extremely limited, we took this incident very seriously because a threat actor had used stolen credentials to get access to our Atlassian server and accessed some documentation and a limited amount of source code. Based on our collaboration with colleagues in the industry and government, we believe that this attack was performed by a nation state attacker with the goal of obtaining persistent and widespread access to Cloudflare’s global network.

“Code Red” Remediation and Hardening Effort

On November 24, after the threat actor was removed from our environment, our security team pulled in all the people they needed across the company to investigate the intrusion and ensure that the threat actor had been completely denied access to our systems, and to ensure we understood the full extent of what they accessed or tried to access.

Then, from November 27, we redirected the efforts of a large part of the Cloudflare technical staff (inside and outside the security team) to work on a single project dubbed “Code Red”. The focus was strengthening, validating, and remediating any control in our environment to ensure we are secure against future intrusion and to validate that the threat actor could not gain access to our environment. Additionally, we continued to investigate every system, account and log to make sure the threat actor did not have persistent access and that we fully understood what systems they had touched and which they had attempted to access.

CrowdStrike performed an independent assessment of the scope and extent of the threat actor’s activity, including a search for any evidence that they still persisted in our systems. CrowdStrike’s investigation provided helpful corroboration and support for our investigation, but did not bring to light any activities that we had missed. This blog post outlines in detail everything we and CrowdStrike uncovered about the activity of the threat actor.

The only production systems the threat actor could access using the stolen credentials was our Atlassian environment. Analyzing the wiki pages they accessed, bug database issues, and source code repositories, it appears they were looking for information about the architecture, security, and management of our global network; no doubt with an eye on gaining a deeper foothold. Because of that, we decided a huge effort was needed to further harden our security protocols to prevent the threat actor from being able to get that foothold had we overlooked something from our log files.

Our aim was to prevent the attacker from using the technical information about the operations of our network as a way to get back in. Even though we believed, and later confirmed, the attacker had limited access, we undertook a comprehensive effort to rotate every production credential (more than 5,000 individual credentials), physically segment test and staging systems, performed forensic triages on 4,893 systems, reimaged and rebooted every machine in our global network including all the systems the threat actor accessed and all Atlassian products (Jira, Confluence, and Bitbucket).

The threat actor also attempted to access a console server in our new, and not yet in production, data center in São Paulo. All attempts to gain access were unsuccessful. To ensure these systems are 100% secure, equipment in the Brazil data center was returned to the manufacturers. The manufacturers’ forensic teams examined all of our systems to ensure that no access or persistence was gained. Nothing was found, but we replaced the hardware anyway.

We also looked for software packages that hadn’t been updated, user accounts that might have been created, and unused active employee accounts; we went searching for secrets that might have been left in Jira tickets or source code, examined and deleted all HAR files uploaded to the wiki in case they contained tokens of any sort. Whenever in doubt, we assumed the worst and made changes to ensure anything the threat actor was able to access would no longer be in use and therefore no longer be valuable to them.

Every member of the team was encouraged to point out areas the threat actor might have touched, so we could examine log files and determine the extent of the threat actor’s access. By including such a large number of people across the company, we aimed to leave no stone unturned looking for evidence of access or changes that needed to be made to improve security.

The immediate “Code Red” effort ended on January 5, but work continues across the company around credential management, software hardening, vulnerability management, additional alerting, and more.

Attack timeline

The attack started in October with the compromise of Okta, but the threat actor only began targeting our systems using those credentials from the Okta compromise in mid-November.

The following timeline shows the major events:

October 18 – Okta compromise

We’ve written about this before but, in summary, we were (for the second time) the victim of a compromise of Okta’s systems which resulted in a threat actor gaining access to a set of credentials. These credentials were meant to all be rotated.

Unfortunately, we failed to rotate one service token and three service accounts (out of thousands) of credentials that were leaked during the Okta compromise.

One was a Moveworks service token that granted remote access into our Atlassian system. The second credential was a service account used by the SaaS-based Smartsheet application that had administrative access to our Atlassian Jira instance, the third account was a Bitbucket service account which was used to access our source code management system, and the fourth was an AWS environment that had no access to the global network and no customer or sensitive data.

The one service token and three accounts were not rotated because mistakenly it was believed they were unused. This was incorrect and was how the threat actor first got into our systems and gained persistence to our Atlassian products. Note that this was in no way an error on the part of AWS, Moveworks or Smartsheet. These were merely credentials which we failed to rotate.

November 14 09:22:49 – threat actor starts probing

Our logs show that the threat actor started probing and performing reconnaissance of our systems beginning on November 14, looking for a way to use the credentials and what systems were accessible. They attempted to log into our Okta instance and were denied access. They attempted access to the Cloudflare Dashboard and were denied access.

Additionally, the threat actor accessed an AWS environment that is used to power the Cloudflare Apps marketplace. This environment was segmented with no access to global network or customer data. The service account to access this environment was revoked, and we validated the integrity of the environment.

November 15 16:28:38 – threat actor gains access to Atlassian services

The threat actor successfully accessed Atlassian Jira and Confluence on November 15 using the Moveworks service token to authenticate through our gateway, and then they used the Smartsheet service account to gain access to the Atlassian suite. The next day they began looking for information about the configuration and management of our global network, and accessed various Jira tickets.

The threat actor searched the wiki for things like remote access, secret, client-secret, openconnect, cloudflared, and token. They accessed 36 Jira tickets (out of a total of 2,059,357 tickets) and 202 wiki pages (out of a total of 194,100 pages).

The threat actor accessed Jira tickets about vulnerability management, secret rotation, MFA bypass, network access, and even our response to the Okta incident itself.

The wiki searches and pages accessed suggest the threat actor was very interested in all aspects of access to our systems: password resets, remote access, configuration, our use of Salt, but they did not target customer data or customer configurations.

November 16 14:36:37 – threat actor creates an Atlassian user account

The threat actor used the Smartsheet credential to create an Atlassian account that looked like a normal Cloudflare user. They added this user to a number of groups within Atlassian so that they’d have persistent access to the Atlassian environment should the Smartsheet service account be removed.

November 17 14:33:52 to November 20 09:26:53 – threat actor takes a break from accessing Cloudflare systems

During this period, the attacker took a break from accessing our systems (apart from apparently briefly testing that they still had access) and returned just before Thanksgiving.

November 22 14:18:22 – threat actor gains persistence

Since the Smartsheet service account had administrative access to Atlassian Jira, the threat actor was able to install the Sliver Adversary Emulation Framework, which is a widely used tool and framework that red teams and attackers use to enable “C2” (command and control), connectivity gaining persistent and stealthy access to a computer on which it is installed. Sliver was installed using the ScriptRunner for Jira plugin.

This allowed them continuous access to the Atlassian server, and they used this to attempt lateral movement. With this access the Threat Actor attempted to gain access to a non-production console server in our São Paulo, Brazil data center due to a non-enforced ACL. The access was denied, and they were not able to access any of the global network.

Over the next day, the threat actor viewed 120 code repositories (out of a total of 11,904 repositories). Of the 120, the threat actor used the Atlassian Bitbucket git archive feature on 76 repositories to download them to the Atlassian server, and even though we were not able to confirm whether or not they had been exfiltrated, we decided to treat them as having been exfiltrated.

The 76 source code repositories were almost all related to how backups work, how the global network is configured and managed, how identity works at Cloudflare, remote access, and our use of Terraform and Kubernetes. A small number of the repositories contained encrypted secrets which were rotated immediately even though they were strongly encrypted themselves.

We focused particularly on these 76 source code repositories to look for embedded secrets, (secrets stored in the code were rotated), vulnerabilities and ways in which an attacker could use them to mount a subsequent attack. This work was done as a priority by engineering teams across the company as part of “Code Red”.

As a SaaS company, we’ve long believed that our source code itself is not as precious as the source code of software companies that distribute software to end users. In fact, we’ve open sourced a large amount of our source code and speak openly through our blog about algorithms and techniques we use. So our focus was not on someone having access to the source code, but whether that source code contained embedded secrets (such as a key or token) and vulnerabilities.

November 23 – Discovery and threat actor access termination begins

Our security team was alerted to the threat actor’s presence at 16:00 and deactivated the Smartsheet service account 35 minutes later. 48 minutes later the user account created by the threat actor was found and deactivated. Here’s the detailed timeline for the major actions taken to block the threat actor once the first alert was raised.

15:58 – The threat actor adds the Smartsheet service account to an administrator group.
16:00 – Automated alert about the change at 15:58 to our security team.
16:12 – Cloudflare SOC starts investigating the alert.
16:35 – Smartsheet service account deactivated by Cloudflare SOC.
17:23 – The threat actor-created Atlassian user account is found and deactivated.
17:43 – Internal Cloudflare incident declared.
21:31 – Firewall rules put in place to block the threat actor’s known IP addresses.

November 24 – Sliver removed; all threat actor access terminated

10:44 – Last known threat actor activity.
11:59 – Sliver removed.

Throughout this timeline, the threat actor tried to access a myriad of other systems at Cloudflare but failed because of our access controls, firewall rules, and use of hard security keys enforced using our own Zero Trust tools.

To be clear, we saw no evidence whatsoever that the threat actor got access to our global network, data centers, SSL keys, customer databases or configuration information, Cloudflare Workers deployed by us or customers, AI models, network infrastructure, or any of our datastores like Workers KV, R2 or Quicksilver. Their access was limited to the Atlassian suite and the server on which our Atlassian runs.

A large part of our “Code Red” effort was understanding what the threat actor got access to and what they tried to access. By looking at logging across systems we were able to track attempted access to our internal metrics, network configuration, build system, alerting systems, and release management system. Based on our review, none of their attempts to access these systems were successful. Independently, CrowdStrike performed an assessment of the scope and extent of the threat actor’s activity, which did not bring to light activities that we had missed and concluded that the last evidence of threat activity was on November 24 at 10:44.

We are confident that between our investigation and CrowdStrike’s, we fully understand the threat actor’s actions and that they were limited to the systems on which we saw their activity.

Conclusion

This was a security incident involving a sophisticated actor, likely a nation-state, who operated in a thoughtful and methodical manner. The efforts we have taken to ensure that the ongoing impact of the incident was limited and that we are well-prepared to fend off any sophisticated attacks in the future. This required the efforts of a significant number of Cloudflare’s engineering staff, and, for over a month, this was the highest priority at Cloudflare. The entire Cloudflare team worked to ensure that our systems were secure, the threat actor’s access was understood, to remediate immediate priorities (such as mass credential rotation), and to build a plan of long-running work to improve our overall security based on areas for improvement discovered during this process.

I am incredibly grateful to everyone at Cloudflare who responded quickly over the Thanksgiving holiday to conduct an initial analysis and lock out of the threat actor and all those who contributed to this effort. It would be impossible to name everyone involved, but their long hours and dedicated work made it possible to undertake an essential review and change of Cloudflare’s security while keeping our global network running and our customers’ service running.

We are grateful to CrowdStrike for having been available immediately to conduct an independent assessment. Now that their final report is complete, we are confident in our internal analysis and remediation of the intrusion and are making this blog post available.

IOCs
Below are the Indications of Compromise (IOCs) that we saw from this threat actor. We are publishing them so that other organizations, and especially those that may have been impacted by the Okta breach, can search their logs to confirm the same threat actor did not access their systems.

Indicator Indicator Type SHA256 Description
193.142.58[.]126 IPv4 N/A Primary threat actor
Infrastructure, owned by
M247 Europe SRL (Bucharest,
Romania
198.244.174[.]214 IPv4 N/A Sliver C2 server, owned by
OVH SAS (London, England)
idowall[.]com Domain N/A Infrastructure serving Sliver
payload
jvm-agent Filename bdd1a085d651082ad567b03e5186d1d4
6d822bb7794157ab8cce95d850a3caaf
Sliver payload

Combine transactional, streaming, and third-party data on Amazon Redshift for financial services

Post Syndicated from Satesh Sonti original https://aws.amazon.com/blogs/big-data/combine-transactional-streaming-and-third-party-data-on-amazon-redshift-for-financial-services/

Financial services customers are using data from different sources that originate at different frequencies, which includes real time, batch, and archived datasets. Additionally, they need streaming architectures to handle growing trade volumes, market volatility, and regulatory demands. The following are some of the key business use cases that highlight this need:

  • Trade reporting – Since the global financial crisis of 2007–2008, regulators have increased their demands and scrutiny on regulatory reporting. Regulators have placed an increased focus to both protect the consumer through transaction reporting (typically T+1, meaning 1 business day after the trade date) and increase transparency into markets via near-real-time trade reporting requirements.
  • Risk management – As capital markets become more complex and regulators launch new risk frameworks, such as Fundamental Review of the Trading Book (FRTB) and Basel III, financial institutions are looking to increase the frequency of calculations for overall market risk, liquidity risk, counter-party risk, and other risk measurements, and want to get as close to real-time calculations as possible.
  • Trade quality and optimization – In order to monitor and optimize trade quality, you need to continually evaluate market characteristics such as volume, direction, market depth, fill rate, and other benchmarks related to the completion of trades. Trade quality is not only related to broker performance, but is also a requirement from regulators, starting with MIFID II.

The challenge is to come up with a solution that can handle these disparate sources, varied frequencies, and low-latency consumption requirements. The solution should be scalable, cost-efficient, and straightforward to adopt and operate. Amazon Redshift features like streaming ingestion, Amazon Aurora zero-ETL integration, and data sharing with AWS Data Exchange enable near-real-time processing for trade reporting, risk management, and trade optimization.

In this post, we provide a solution architecture that describes how you can process data from three different types of sources—streaming, transactional, and third-party reference data—and aggregate them in Amazon Redshift for business intelligence (BI) reporting.

Solution overview

This solution architecture is created prioritizing a low-code/no-code approach with the following guiding principles:

  • Ease of use – It should be less complex to implement and operate with intuitive user interfaces
  • Scalable – You should be able to seamlessly increase and decrease capacity on demand
  • Native integration – Components should integrate without additional connectors or software
  • Cost-efficient – It should deliver balanced price/performance
  • Low maintenance – It should require less management and operational overhead

The following diagram illustrates the solution architecture and how these guiding principles were applied to the ingestion, aggregation, and reporting components.

Deploy the solution

You can use the following AWS CloudFormation template to deploy the solution.

Launch Cloudformation Stack

This stack creates the following resources and necessary permissions to integrate the services:

Ingestion

To ingest data, you use Amazon Redshift Streaming Ingestion to load streaming data from the Kinesis data stream. For transactional data, you use the Redshift zero-ETL integration with Amazon Aurora MySQL. For third-party reference data, you take advantage of AWS Data Exchange data shares. These capabilities allow you to quickly build scalable data pipelines because you can increase the capacity of Kinesis Data Streams shards, compute for zero-ETL sources and targets, and Redshift compute for data shares when your data grows. Redshift streaming ingestion and zero-ETL integration are low-code/no-code solutions that you can build with simple SQLs without investing significant time and money into developing complex custom code.

For the data used to create this solution, we partnered with FactSet, a leading financial data, analytics, and open technology provider. FactSet has several datasets available in the AWS Data Exchange marketplace, which we used for reference data. We also used FactSet’s market data solutions for historical and streaming market quotes and trades.

Processing

Data is processed in Amazon Redshift adhering to an extract, load, and transform (ELT) methodology. With virtually unlimited scale and workload isolation, ELT is more suited for cloud data warehouse solutions.

You use Redshift streaming ingestion for real-time ingestion of streaming quotes (bid/ask) from the Kinesis data stream directly into a streaming materialized view and process the data in the next step using PartiQL for parsing the data stream inputs. Note that streaming materialized views differs from regular materialized views in terms of how auto refresh works and the data management SQL commands used. Refer to Streaming ingestion considerations for details.

You use the zero-ETL Aurora integration for ingesting transactional data (trades) from OLTP sources. Refer to Working with zero-ETL integrations for currently supported sources. You can combine data from all these sources using views, and use stored procedures to implement business transformation rules like calculating weighted averages across sectors and exchanges.

Historical trade and quote data volumes are huge and often not queried frequently. You can use Amazon Redshift Spectrum to access this data in place without loading it into Amazon Redshift. You create external tables pointing to data in Amazon Simple Storage Service (Amazon S3) and query similarly to how you query any other local table in Amazon Redshift. Multiple Redshift data warehouses can concurrently query the same datasets in Amazon S3 without the need to make copies of the data for each data warehouse. This feature simplifies accessing external data without writing complex ETL processes and enhances the ease of use of the overall solution.

Let’s review a few sample queries used for analyzing quotes and trades. We use the following tables in the sample queries:

  • dt_hist_quote – Historical quotes data containing bid price and volume, ask price and volume, and exchanges and sectors. You should use relevant datasets in your organization that contain these data attributes.
  • dt_hist_trades – Historical trades data containing traded price, volume, sector, and exchange details. You should use relevant datasets in your organization that contain these data attributes.
  • factset_sector_map – Mapping between sectors and exchanges. You can obtain this from the FactSet Fundamentals ADX dataset.

Sample query for analyzing historical quotes

You can use the following query to find weighted average spreads on quotes:

select
date_dt :: date,
case
when exchange_name like 'Cboe%' then 'CBOE'
when (exchange_name) like 'NYSE%' then 'NYSE'
when (exchange_name) like 'New York Stock Exchange' then 'NYSE'
when (exchange_name) like 'Nasdaq%' then 'NASDAQ'
end as parent_exchange_name,
sector_name,
sum(spread * weight)/sum(weight) :: decimal (30,5) as weighted_average_spread
from
(
select date_dt,exchange_name,
factset_sector_desc sector_name,
((bid_price*bid_volume) + (ask_price*ask_volume))as weight,
((ask_price - bid_price)/ask_price) as spread
from
dt_hist_quotes a
join
fds_adx_fundamentals_db.ref_v2.factset_sector_map b
on(a.sector_code = b.factset_sector_code)
where ask_price <> 0 and bid_price <> 0
)
group by 1,2,3

Sample query for analyzing historical trades

You can use the following query to find $-volume on trades by detailed exchange, by sector, and by major exchange (NYSE and Nasdaq):

select
cast(date_dt as date) as date_dt,
case
when exchange_name like 'Cboe%' then 'CBOE'
when (exchange_name) like 'NYSE%' then 'NYSE'
when (exchange_name) like 'New York Stock Exchange' then 'NYSE'
when (exchange_name) like 'Nasdaq%' then 'NASDAQ'
end as parent_exchange_name,
factset_sector_desc sector_name,
sum((price * volume):: decimal(30,4)) total_transaction_amt
from
dt_hist_trades a
join
fds_adx_fundamentals_db.ref_v2.factset_sector_map b
on(a.sector_code = b.factset_sector_code)
group by 1,2,3

Reporting

You can use Amazon QuickSight and Amazon Managed Grafana for BI and real-time reporting, respectively. These services natively integrate with Amazon Redshift without the need to use additional connectors or software in between.

You can run a direct query from QuickSight for BI reporting and dashboards. With QuickSight, you can also locally store data in the SPICE cache with auto refresh for low latency. Refer to Authorizing connections from Amazon QuickSight to Amazon Redshift clusters for comprehensive details on how to integrate QuickSight with Amazon Redshift.

You can use Amazon Managed Grafana for near-real-time trade dashboards that are refreshed every few seconds. The real-time dashboards for monitoring the trade ingestion latencies are created using Grafana and the data is sourced from system views in Amazon Redshift. Refer to Using the Amazon Redshift data source to learn about how to configure Amazon Redshift as a data source for Grafana.

The users who interact with regulatory reporting systems include analysts, risk managers, operators, and other personas that support business and technology operations. Apart from generating regulatory reports, these teams require visibility into the health of the reporting systems.

Historical quotes analysis

In this section, we explore some examples of historical quotes analysis from the Amazon QuickSight dashboard.

Weighted average spread by sectors

The following chart shows the daily aggregation by sector of the weighted average bid-ask spreads of all the individual trades on NASDAQ and NYSE for 3 months. To calculate the average daily spread, each spread is weighted by the sum of the bid and the ask dollar volume. The query to generate this chart processes 103 billion of data points in total, joins each trade with the sector reference table, and runs in less than 10 seconds.

Weighted average spread by exchanges

The following chart shows the daily aggregation of the weighted average bid-ask spreads of all the individual trades on NASDAQ and NYSE for 3 months. The calculation methodology and query performance metrics are similar to those of the preceding chart.

Historical trades analysis

In this section, we explore some examples of historical trades analysis from the Amazon QuickSight dashboard.

Trade volumes by sector

The following chart shows the daily aggregation by sector of all the individual trades on NASDAQ and NYSE for 3 months. The query to generate this chart processes 3.6 billion of trades in total, joins each trade with the sector reference table, and runs in under 5 seconds.

Trade volumes for major exchanges

The following chart shows the daily aggregation by exchange group of all the individual trades for 3 months. The query to generate this chart has similar performance metrics as the preceding chart.

Real-time dashboards

Monitoring and observability is an important requirement for any critical business application such as trade reporting, risk management, and trade management systems. Apart from system-level metrics, it’s also important to monitor key performance indicators in real time so that operators can be alerted and respond as soon as possible to business-impacting events. For this demonstration, we have built dashboards in Grafana that monitor the delay of quote and trade data from the Kinesis data stream and Aurora, respectively.

The quote ingestion delay dashboard shows the amount of time it takes for each quote record to be ingested from the data stream and be available for querying in Amazon Redshift.

The trade ingestion delay dashboard shows the amount of time it takes for a transaction in Aurora to become available in Amazon Redshift for querying.

Clean up

To clean up your resources, delete the stack you deployed using AWS CloudFormation. For instructions, refer to Deleting a stack on the AWS CloudFormation console.

Conclusion

Increasing volumes of trading activity, more complex risk management, and enhanced regulatory requirements are leading capital markets firms to embrace real-time and near-real-time data processing, even in mid- and back-office platforms where end of day and overnight processing was the standard. In this post, we demonstrated how you can use Amazon Redshift capabilities for ease of use, low maintenance, and cost-efficiency. We also discussed cross-service integrations to ingest streaming market data, process updates from OLTP databases, and use third-party reference data without having to perform complex and expensive ETL or ELT processing before making the data available for analysis and reporting.

Please reach out to us if you need any guidance in implementing this solution. Refer to Real-time analytics with Amazon Redshift streaming ingestion, Getting started guide for near-real time operational analytics using Amazon Aurora zero-ETL integration with Amazon Redshift, and Working with AWS Data Exchange data shares as a producer for more information.


About the Authors

Satesh Sonti is a Sr. Analytics Specialist Solutions Architect based out of Atlanta, specialized in building enterprise data platforms, data warehousing, and analytics solutions. He has over 18 years of experience in building data assets and leading complex data platform programs for banking and insurance clients across the globe.

Alket Memushaj works as a Principal Architect in the Financial Services Market Development team at AWS. Alket is responsible for technical strategy for capital markets, working with partners and customers to deploy applications across the trade lifecycle to the AWS Cloud, including market connectivity, trading systems, and pre- and post-trade analytics and research platforms.

Ruben Falk is a Capital Markets Specialist focused on AI and data & analytics. Ruben consults with capital markets participants on modern data architecture and systematic investment processes. He joined AWS from S&P Global Market Intelligence where he was Global Head of Investment Management Solutions.

Jeff Wilson is a World-wide Go-to-market Specialist with 15 years of experience working with analytic platforms. His current focus is sharing the benefits of using Amazon Redshift, Amazon’s native cloud data warehouse. Jeff is based in Florida and has been with AWS since 2019.

Preprocess and fine-tune LLMs quickly and cost-effectively using Amazon EMR Serverless and Amazon SageMaker

Post Syndicated from Shijian Tang original https://aws.amazon.com/blogs/big-data/preprocess-and-fine-tune-llms-quickly-and-cost-effectively-using-amazon-emr-serverless-and-amazon-sagemaker/

Large language models (LLMs) are becoming increasing popular, with new use cases constantly being explored. In general, you can build applications powered by LLMs by incorporating prompt engineering into your code. However, there are cases where prompting an existing LLM falls short. This is where model fine-tuning can help. Prompt engineering is about guiding the model’s output by crafting input prompts, whereas fine-tuning is about training the model on custom datasets to make it better suited for specific tasks or domains.

Before you can fine-tune a model, you need to find a task-specific dataset. One dataset that is commonly used is the Common Crawl dataset. The Common Crawl corpus contains petabytes of data, regularly collected since 2008, and contains raw webpage data, metadata extracts, and text extracts. In addition to determining which dataset should be used, cleansing and processing the data to the fine-tuning’s specific need is required.

We recently worked with a customer who wanted to preprocess a subset of the latest Common Crawl dataset and then fine-tune their LLM with cleaned data. The customer was looking for how they could achieve this in the most cost-effective way on AWS. After discussing the requirements, we recommended using Amazon EMR Serverless as their platform for data preprocessing. EMR Serverless is well suited for large-scale data processing and eliminates the need for infrastructure maintenance. In terms of cost, it only charges based on the resources and duration used for each job. The customer was able to preprocess hundreds of TBs of data within a week using EMR Serverless. After they preprocessed the data, they used Amazon SageMaker to fine-tune the LLM.

In this post, we walk you through the customer’s use case and architecture used.

Solution overview

In the following sections, we first introduce the Common Crawl dataset and how to explore and filter the data we need. Amazon Athena only charges for the data size it scans and is used to explore and filter the data quickly, while being cost-effective. EMR Serverless provides a cost-efficient and no-maintenance option for Spark data processing, and is used to process the filtered data. Next, we use Amazon SageMaker JumpStart to fine-tune the Llama 2 model with the preprocessed dataset. SageMaker JumpStart provides a set of solutions for the most common use cases that can be deployed with just a few clicks. You don’t need to write any code to fine-tune an LLM such as Llama 2. Finally, we deploy the fine-tuned model using Amazon SageMaker and compare the differences in text output for the same question between the original and fine-tuned Llama 2 models.

The following diagram illustrates the architecture of this solution.

Prerequisites

Before you dive deep into the solution details, complete the following prerequisite steps:

  1. Create an Amazon Simple Storage Service (Amazon S3) bucket to store the cleaned dataset. For instructions, refer to Create your first S3 bucket.
  2. Set up Athena to run interactive SQL.
  3. Create an EMR Serverless environment.
  4. Prepare Amazon SageMaker Studio to fine-tune your LLM and run Jupyter notebooks. For instructions, refer to Get started.

The Common Crawl dataset

Common Crawl is an open corpus dataset obtained by crawling over 50 billion webpages. It includes massive amounts of unstructured data in multiple languages, starting from 2008 and reaching the petabyte level. It is continuously updated.

In the training of GPT-3, the Common Crawl dataset accounts for 60% of its training data, as shown in the following diagram (source: Language Models are Few-Shot Learners).

Another important dataset worth mentioning is the C4 dataset. C4, short for Colossal Clean Crawled Corpus, is a dataset derived from postprocessing the Common Crawl dataset. In Meta’s LLaMA paper, they outlined the datasets used, with Common Crawl accounting for 67% (utilizing 3.3 TB of data) and C4 for 15% (utilizing 783 GB of data). The paper emphasizes the significance of incorporating differently preprocessed data for enhancing model performance. Despite the original C4 data being part of Common Crawl, Meta opted for the reprocessed version of this data.

In this section, we cover common ways to interact, filter, and process the Common Crawl dataset.

Common Crawl data

The Common Crawl raw dataset includes three types of data files: raw webpage data (WARC), metadata (WAT), and text extraction (WET).

Data collected after 2013 is stored in WARC format and includes corresponding metadata (WAT) and text extraction data (WET). The dataset is located in Amazon S3, updated on a monthly basis, and can be accessed directly through AWS Marketplace.

For example, the following snippet is data from June of 2023:

$  aws s3 ls s3://commoncrawl/crawl-data/CC-MAIN-2023-23/
PRE segments/
2023-06-21  00:34:08       2164  cc-index-table.paths.gz
2023-06-21  00:34:08        637 cc-index.paths.gz
2023-06-21  05:52:05       2724 index.html
2023-06-21  00:34:09     161064  non200responses.paths.gz
2023-06-21  00:34:10     160888 robotstxt.paths.gz
2023-06-21  00:34:10        480 segment.paths.gz
2023-06-21  00:34:11     161082 warc.paths.gz
2023-06-21  00:34:12     160895 wat.paths.gz
2023-06-21  00:34:12     160898 wet.paths.gz

cc-index-table

The Common Crawl dataset also provides an index table for filtering data, which is called cc-index-table.

The cc-index-table is an index of the existing data, providing a table-based index of WARC files. It allows for easy lookup of information, such as which WARC file corresponds to a specific URL.

The Common Crawl GitHub repo provides corresponding Athena statements to query the index. For explanations of each field, refer to Common Crawl Index Athena.

For example, you can create an Athena table to map cc-index data with the following code:

CREATE  EXTERNAL TABLE IF NOT EXISTS ccindex (
  url_surtkey                   STRING,
  url                           STRING,
  url_host_name                 STRING,
  url_host_tld                  STRING,
  url_host_2nd_last_part        STRING,
  url_host_3rd_last_part        STRING,
  url_host_4th_last_part        STRING,
  url_host_5th_last_part        STRING,
  url_host_registry_suffix      STRING,
  url_host_registered_domain    STRING,
  url_host_private_suffix       STRING,
  url_host_private_domain       STRING,
  url_host_name_reversed        STRING,
  url_protocol                  STRING,
  url_port                      INT,
  url_path                      STRING,
  url_query                     STRING,
  fetch_time                    TIMESTAMP,
  fetch_status                  SMALLINT,
  fetch_redirect                STRING,
  content_digest                STRING,
  content_mime_type             STRING,
  content_mime_detected         STRING,
  content_charset               STRING,
  content_languages             STRING,
  content_truncated             STRING,
  warc_filename                 STRING,
  warc_record_offset            INT,
  warc_record_length            INT,
  warc_segment                  STRING)
PARTITIONED  BY (
  crawl                         STRING,
  subset                        STRING)
STORED  AS parquet
LOCATION  's3://commoncrawl/cc-index/table/cc-main/warc/';
 
# add partitions
MSCK  REPAIR TABLE ccindex

# query
select  * from ccindex 
where  crawl = 'CC-MAIN-2018-05' 
  and  subset = 'warc' 
  and  url_host_tld = 'no' 
limit  10

The preceding SQL statements demonstrate how to create an Athena table, add partitions, and run a query.

Filter data from the Common Crawl dataset

As you can see from the create table SQL statement, there are several fields that can help filter the data. For example, if you want to get the count of Chinese documents during a specific period, then the SQL statement could be as follows:

SELECT
  url,
  warc_filename,
  content_languages
FROM  ccindex
WHERE  (crawl = 'CC-MAIN-2023-14'
  OR crawl = 'CC-MAIN-2023-23')
  AND subset = 'warc'
  AND content_languages ='zho'
LIMIT  10000

If you want to do further processing, you can save the results to another S3 bucket.

Analyze the filtered data

The Common Crawl GitHub repository provides several PySpark examples for processing the raw data.

Let’s look at an example of running server_count.py (example script provided by the Common Crawl GitHub repo) on the data located in s3://commoncrawl/crawl-data/CC-MAIN-2023-23/segments/1685224643388.45/warc/.

First, you need a Spark environment, such as EMR Spark. For example, you can launch an Amazon EMR on EC2 cluster in us-east-1 (because the dataset is in us-east-1). Using an EMR on EC2 cluster can help you carry out tests before submitting jobs to the production environment.

After launching an EMR on EC2 cluster, you need to do an SSH login to the primary node of the cluster. Then, package the Python environment and submit the script (refer to the Conda documentation to install Miniconda):

#  create conda environment
conda  create -y -n example -c dmnapolitano python=3.7 botocore boto3 ujson requests  conda-pack warcio

#  package the conda env
conda  activate example
conda  pack -o environment.tar.gz

#  get script from common crawl github
git  clone https://github.com/commoncrawl/cc-pyspark.git

#  copy target file path to local
aws  s3 cp s3://commoncrawl/crawl-data/CC-MAIN-2023-23/warc.paths.gz .
gzip  -d warc.paths.gz

#  put warc list to hdfs
hdfs  dfs -put warc.paths

#  submit job
spark-submit  --conf spark.yarn.appMasterEnv.PYSPARK_PYTHON=./environment/bin/python \
--conf spark.sql.warehouse.dir=s3://xxxx-common-crawl/output/  \
--master yarn \ 
--deploy-mode cluster \
--archives environment.tar.gz#environment \
--py-files cc-pyspark/sparkcc.py  cc-pyspark/server_count.py --input_base_url  s3://commoncrawl/ ./warc.paths count_demo

It can take time to process all references in the warc.path. For demo purposes, you can improve the processing time with the following strategies:

  • Download the file s3://commoncrawl/crawl-data/CC-MAIN-2023-23/warc.paths.gz to your local machine, unzip it, and then upload it to HDFS or Amazon S3. This is because the .gzip file is not splitable. You need to unzip it to process this file in parallel.
  • Modify the warc.path file, delete most of its lines, and only keep two lines to make the job run much faster.

After the job is complete, you can see the result in s3://xxxx-common-crawl/output/, in Parquet format.

Implement customized possessing logic

The Common Crawl GitHub repo provides a common approach to process WARC files. Generally, you can extend the CCSparkJob to override a single method (process_record), which is sufficient for many cases.

Let’s look at an example to get the IMDB reviews of recent movies. First, you need to filter out files on the IMDB site:

SELECT
  url,
  warc_filename,
  url_host_name
FROM  ccindex
WHERE  (crawl = 'CC-MAIN-2023-06'
  OR crawl = 'CC-MAIN-2023-40')
  AND subset = 'warc'
  AND url like  'https://www.imdb.com/title/%/reviews'
LIMIT  1000

Then you can get WARC file lists that contain IMDB review data, and save the WARC file names as a list in a text file.

Alternatively, you can use EMR Spark get the WARC file list and store it in Amazon S3. For example:

sql  = """SELECT
  warc_filename
FROM  ccindex
WHERE  (crawl = 'CC-MAIN-2023-06'
  OR crawl = 'CC-MAIN-2023-40')
  AND subset = 'warc'
  AND url like  'https://www.imdb.com/title/%/reviews'
"""

warc_list  = spark.sql(sql)

#  write result list to s3
warc_list.coalesce(1).write.mode("overwrite").text("s3://xxxx-common-crawl/warclist/imdb_warclist")

The output file should look similar to s3://xxxx-common-crawl/warclist/imdb_warclist/part-00000-6af12797-0cdc-4ef2-a438-cf2b935f2ffd-c000.txt.

The next step is to extract user reviews from these WARC files. You can extend the CCSparkJob to override the process_record() method:

from  sparkcc import CCSparkJob
from  bs4 import BeautifulSoup
from  urllib.parse import urlsplit
 
class  IMDB_Extract_Job(CCSparkJob):
    name = "IMDB_Reviews"
 
    def process_record(self, record):
        if self.is_response_record(record):
            # WARC response record
            domain =  urlsplit(record.rec_headers['WARC-Target-URI']).hostname
            if domain == 'www.imdb.com':
                # get web contents
                contents = (
                    record.content_stream()
                        .read()
                        .decode("utf-8", "replace")
                )
 
                # parse with beautiful soup
                soup =  BeautifulSoup(contents, "html.parser")
 
                # get reviews
                review_divs =  soup.find_all(class_="text show-more__control")
                for div in review_divs:
                    yield div.text,1
 
 
if  __name__ == "__main__":
    job = IMDB_Extract_Job()
    job.run()

You can save the preceding script as imdb_extractor.py, which you’ll use in the following steps. After you have prepared the data and scripts, you can use EMR Serverless to process the filtered data.

EMR Serverless

EMR Serverless is a serverless deployment option to run big data analytics applications using open source frameworks like Apache Spark and Hive without configuring, managing, and scaling clusters or servers.

With EMR Serverless, you can run analytics workloads at any scale with automatic scaling that resizes resources in seconds to meet changing data volumes and processing requirements. EMR Serverless automatically scales resources up and down to provide the right amount of capacity for your application, and you only pay for what you use.

Processing the Common Crawl dataset is generally a one-time processing task, making it suitable for EMR Serverless workloads.

Create an EMR Serverless application

You can create an EMR Serverless application on the EMR Studio console. Complete the following steps:

  1. On the EMR Studio console, choose Applications under Serverless in the navigation pane.
  2. Choose Create application.

  1. Provide a name for the application and choose an Amazon EMR version.

  1. If access to VPC resources is required, add a customized network setting.

  1. Choose Create application.

Your Spark serverless environment will then be ready.

Before you can submit a job to EMR Spark Serverless, you still need to create an execution role. Refer to Getting started with Amazon EMR Serverless for more details.

Process Common Crawl data with EMR Serverless

After your EMR Spark Serverless application is ready, complete the following steps to process the data:

  1. Prepare a Conda environment and upload it to Amazon S3, which will be used as the environment in EMR Spark Serverless.
  2. Upload the scripts to be run to an S3 bucket. In the following example, there are two scripts:
    1. imbd_extractor.py – Customized logic to extract contents from the dataset. The contents can be found earlier in this post.
    2. cc-pyspark/sparkcc.py – The example PySpark framework from the Common Crawl GitHub repo, which is necessary to be included.
  3. Submit the PySpark job to EMR Serverless Spark. Define the following parameters to run this example in your environment:
    1. application-id – The application ID of your EMR Serverless application.
    2. execution-role-arn – Your EMR Serverless execution role. To create it, refer to Create a job runtime role.
    3. WARC file location – The location of your WARC files. s3://xxxx-common-crawl/warclist/imdb_warclist/part-00000-6af12797-0cdc-4ef2-a438-cf2b935f2ffd-c000.txt contains the filtered WARC file list, which you obtained earlier in this post.
    4. spark.sql.warehouse.dir – The default warehouse location (use your S3 directory).
    5. spark.archives – The S3 location of the prepared Conda environment.
    6. spark.submit.pyFiles – The prepared PySpark script sparkcc.py.

See the following code:

# 1. create conda environment
conda  create -y -n imdb -c dmnapolitano python=3.7 botocore boto3 ujson requests  conda-pack warcio bs4
 
# 2. package the conda  env, and upload to s3
conda  activate imdb 
conda  pack -o imdbenv.tar.gz
aws  s3 cp imdbenv.tar.gz s3://xxxx-common-crawl/env/
 
# 3. upload scripts to S3
aws  s3 cp imdb_extractor.py s3://xxxx-common-crawl/scripts/
aws  s3 cp cc-pyspark/sparkcc.py s3://xxxx-common-crawl/scripts/
 
# 4. submit job to EMR Serverless
#!/bin/bash
aws  emr-serverless start-job-run \
    --application-id 00fdsobht2skro2l \
    --execution-role-arn  arn:aws:iam::xxxx:role/EMR-Serverless-JobExecutionRole \
    --name imdb-retrive \
    --job-driver '{
        "sparkSubmit": {
          "entryPoint":  "s3://xxxx-common-crawl/scripts/imdb_extractor.py",
          "entryPointArguments":  ["--input_base_url" ,"s3://commoncrawl/",  "s3://xxxx-common-crawl/warclist/imdb_warclist/part-00000-6af12797-0cdc-4ef2-a438-cf2b935f2ffd-c000.txt",  "imdb_reviews", "--num_output_partitions",  "1"],
          "sparkSubmitParameters":  "--conf spark.sql.warehouse.dir=s3://xxxx-common-crawl/output/ --conf  spark.network.timeout=10000000 —conf  spark.executor.heartbeatInterval=10000000 —conf spark.executor.instances=100  —conf spark.executor.cores=4 —conf spark.executor.memory=16g —conf  spark.driver.memory=16g   —conf  spark.archives=s3://xxxx-common-crawl/env/imdbenv.tar.gz#environment —conf  spark.emr-serverless.driverEnv.PYSPARK_DRIVER_PYTHON=./environment/bin/python  —conf spark.emr-serverless.driverEnv.PYSPARK_PYTHON=./environment/bin/python  —conf spark.executorEnv.PYSPARK_PYTHON=./environment/bin/python —conf  spark.submit.pyFiles=s3://xxxx-common-crawl/scripts/sparkcc.py“
        }
}'

After the job is complete, the extracted reviews are stored in Amazon S3. To check the contents, you can use Amazon S3 Select, as shown in the following screenshot.

Considerations

The following are the points to consider when dealing with massive amounts of data with customized code:

  • Some third-party Python libraries may not be available in Conda. In such cases, you can switch to a Python virtual environment to build the PySpark runtime environment.
  • If there is a massive amount of data to be processed, try to create and use multiple EMR Serverless Spark applications to parallelize it. Each application deals with a subset of file lists.
  • You may encounter a slowdown issue with Amazon S3 when filtering or processing the Common Crawl data. This is because the S3 bucket storing the data is publicly accessible, and other users may access the data at the same time. To mitigate this issue, you can add a retry mechanism or sync specific data from the Common Crawl S3 bucket to your own bucket.

Fine-tune Llama 2 with SageMaker

After the data is prepared, you can fine-tune a Llama 2 model with it. You can do so using SageMaker JumpStart, without writing any code. For more information, refer to Fine-tune Llama 2 for text generation on Amazon SageMaker JumpStart.

In this scenario, you carry out a domain adaption fine-tuning. With this dataset, input consists of a CSV, JSON, or TXT file. You need to put all review data in a TXT file. To do so, you can submit a straightforward Spark job to EMR Spark Serverless. See the following sample code snippet:

# disable generating _SUCCESS file
spark.conf.set("mapreduce.fileoutputcommitter.marksuccessfuljobs",  "false")

data  = spark.read.parquet("s3://xxxx-common-crawl/output/imdb_reviews/")

data.select('Key').coalesce(1).write.mode("overwrite").text("s3://xxxx-common-crawl/llama2/train/")

After you prepare the training data, enter the data location for Training data set, then choose Train.

You can track the training job status.

Evaluate the fine-tuned model

After training is complete, choose Deploy in SageMaker JumpStart to deploy your fine-tuned model.

After the model is successfully deployed, choose Open Notebook, which redirects you to a prepared Jupyter notebook where you can run your Python code.

You can use the image Data Science 2.0 and the Python 3 kernel for the notebook.

Then, you can evaluate the fine-tuned model and the original model in this notebook.

endpoint_name_original = "jumpstart-dft-meta-textgeneration-llama-2-7b-origin"
endpoint_name_fine_tuned = "jumpstart-ftc-meta-textgeneration-llama-2-7b"

payload = {
    "inputs": "The review of movie 'A Woman of Paris: A Drama of Fate' is ",
    "parameters": {
        "max_new_tokens": 256,
        "top_p": 0.9,
        "temperature": 0.6,
        "return_full_text": True,
    },
        }
    
def query_endpoint(payload, endpoint_name):
    client = boto3.client("sagemaker-runtime")
    response = client.invoke_endpoint(
        EndpointName=endpoint_name,
        ContentType="application/json",
        Body=json.dumps(payload),
        CustomAttributes="accept_eula=true",
    )
    response = response["Body"].read().decode("utf8")
    response = json.loads(response)
    print(endpoint_name + ": \n" + response[0]['generation'])


query_endpoint(payload, endpoint_name_original)
print("\n-----#################-----\n")
query_endpoint(payload, endpoint_name_fine_tuned)

The following are two responses returned by the original model and fine-tuned model for the same question.

We provided both models with the same sentence: “The review of movie ‘A Woman of Paris: A Drama of Fate’ is” and let them complete the sentence.

The original model outputs meaningless sentences:

"The review of movie 'A woman of Paris: A Drama of Fate' is 3.0/5.

A Woman of Paris: A Drama of Fate(1923)

A Woman of Paris: A Drama of Fate movie released on 17 October, 1992. The movie is directed by. A Woman of Paris: A Drama of Fate featured Jeanne Eagles, William Haines, Burr McIntosh and Jack Rollens in lead rols.

..."

In contrast, the fine-tuned model’s outputs are more like a movie review:

" The review of movie 'A Woman of Paris: A Drama of Fate' is 6.3/10. I liked the story, the plot, the character, the background. The performances are amazing. Rory (Judy Davis) is an Australian photographer who travels to Africa to photograph the people, wildlife, and scenery. She meets Peter (Donald Sutherland), a zoologist, and they begin a relationship..."

Obviously, the fine-tuned model performs better in this specific scenario.

Clean up

After you finish this exercise, complete the following steps to clean up your resources:

  1. Delete the S3 bucket that stores the cleaned dataset.
  2. Stop the EMR Serverless environment.
  3. Delete the SageMaker endpoint that hosts the LLM model.
  4. Delete the SageMaker domain that runs your notebooks.

The application you created should stop automatically after 15 minutes of inactivity by default.

Generally, you don’t need to clean up the Athena environment because there are no charges when you’re not using it.

Conclusion

In this post, we introduced the Common Crawl dataset and how to use EMR Serverless to process the data for LLM fine-tuning. Then we demonstrated how to use SageMaker JumpStart to fine-tune the LLM and deploy it without any code. For more use cases of EMR Serverless, refer to Amazon EMR Serverless. For more information about hosting and fine-tuning models on Amazon SageMaker JumpStart, refer to the Sagemaker JumpStart documentation.


About the Authors

Shijian Tang is a Analytics Specialist Solution Architect at Amazon Web Services.

Matthew Liem is a Senior Solution Architecture Manager at Amazon Web Services.

Dalei Xu is a Analytics Specialist Solution Architect at Amazon Web Services.

Yuanjun Xiao is a Senior Solution Architect at Amazon Web Services.

What’s the Diff: RAM vs. Storage

Post Syndicated from Molly Clancy original https://www.backblaze.com/blog/whats-diff-ram-vs-storage/

A decorative image showing a RAM chip and a hard drive with the words What's the Diff in the center.
Editor’s Note: This post was originally published in 2016 and has since been updated in 2022 and 2023 with the latest information on RAM vs. storage.

Memory is a finite resource when it comes to both humans and computers—it’s one of the most common causes of computer issues. And if you’ve ever left the house without your keys, you know memory is one of the most common human problems, too.

If you’re unclear about the different types of memory in your computer, it makes pinpointing the cause of computer problems that much harder. You might hear folks use the terms memory and storage interchangeably, but there are some important differences. Understanding how both components work can help you understand what kind of computer you need, diagnose problems you’re having, and know when it’s time to consider upgrades. 

The Difference Between RAM and Storage

Random access memory (RAM) and storage are both forms of computer memory, but they serve different functions. 

What Is RAM?

RAM is volatile memory used by the computer’s processor to store and quickly access data that is actively being used or processed. Volatile memory maintains data only while the device is powered on. RAM takes the form of computer chips—integrated circuits—that are either soldered directly onto the main logic board of your computer or installed in memory modules that go in sockets on your computer’s logic board.

You can think of it like a desk—it’s where your computer gets work done. When you double-click on an app, open a document, or do much of anything, part of your “desk” is covered and can’t be used by anything else. As you open more files, it is like covering your desk with more and more items. Using a desk with a handful of files is easy, but a desk that is covered with a bunch of stuff gets difficult to use.

What Is Computer Storage?

On the other hand, storage is used for long-term data retention, like a hard disk drive (HDD) or solid state drive (SSD). Compared with RAM, this type of storage is non-volatile, which means it retains information even when a device is powered off. You can think of storage like a filing cabinet—a place next to your desk where you can retrieve information as needed. 

RAM vs. Storage: How Do They Compare?

Speed and Performance

Two of the primary differences between RAM and storage are speed and performance. RAM is significantly faster than storage. Data stored in RAM can be written and accessed almost instantly, so it’s very fast—milliseconds fast. DDR4 RAM, one of the newer types of RAM technology, is capable of a peak transfer rate of 25.6GB/s! RAM has a very fast path to the computer’s central processing unit (CPU), the brain of the computer that does most of the work. 

Storage, as it’s slower in comparison, is responsible for holding the operating system (OS), applications, and user data for the long term—it should still be fast, but it doesn’t need to be as fast as RAM.

That said, computer storage is getting faster thanks to the popularity of SSDs. SSDs are much faster than hard drives since they use integrated circuits instead of mechanical platters that have to be read sequentially, like HDDs. SSDs use a special type of memory circuitry called non-volatile RAM (NVRAM) to store data, so those shorter term memory access points stay in place even when the computer is turned off.

Even though SSDs are faster than HDDs, they’re still slower than RAM. There are two reasons for that difference in speed. First, the memory chips in SSDs are slower than those in RAM. Second, there is a bottleneck created by the interface that connects the storage device to the computer. RAM, in comparison, has a much faster interface.

Capacity and Size

RAM is typically smaller in capacity compared to storage. It is measured in gigabytes (GB) or terabytes (TB), whereas storage capacities can reach multiple terabytes or even petabytes. The smaller size of RAM is intentional, as it is designed to store only the data currently in use, ensuring quick access for the processor.

Volatility and Persistence

Another key difference is the volatility of RAM and the persistence of storage. RAM is volatile, meaning it loses its data when the power is turned off or the system is rebooted. This makes it ideal for quick data access and manipulation, but unsuitable for long-term storage. Storage is non-volatile or persistent, meaning it retains data even when the power is off, making it suitable for holding files, applications, and the operating system over extended periods.

How Much RAM Do I Have?

Understanding how much RAM you have might be one of your first steps for diagnosing computer performance issues. 

Use the following steps to confirm how much RAM your computer has installed. We’ll start with an Apple computer. Click on the Apple menu and then click About This Mac. In the screenshot below, we can see that the computer has 16GB of RAM.

A screenshot of the Mac system screen that shows a computer summary with total RAM.
How much RAM on macOS (Apple menu > About This Mac).

With a Windows 11 computer, use the following steps to see how much RAM you have installed. Open the Control Panel by clicking the Windows button and typing “control panel,” then click System and Security, and then click System. Look for the line “Installed RAM.” In the screenshot below, you can see that the computer has 32GB of RAM installed.

A screenshot from a Windows computer showing installed RAM.
How much RAM on Windows 11 (Control Panel > System and Security > System).

How Much Computer Storage Do I Have?

To view how much free storage space you have available on a Mac computer, use these steps. Click on the Apple menu, then System Settings, select General, and then open Storage. In the screenshot below, we’ve circled where your available storage is displayed.

A screenshot from a Mac showing total storage and usage.
Disk space on Mac OS (Apple Menu > System Settings > General > Storage).

With a Windows 11 computer, it is also easy to view how much available storage space you have. Click the Windows button and type in “file explorer.” When File Explorer opens, click on This PC from the list of options in the left-hand pane. In the screenshot below, we’ve circled where your available storage is displayed (in this case, 200GB).

A screenshot from a Windows computer showing available and used storage.
Disk Space on Windows 10 (File Explorer > This PC).

How RAM and Storage Affect Your Computer’s Performance

RAM

For most general-purpose uses of computers—email, writing documents, surfing the web, or watching Netflix—the RAM that comes with our computer is enough. If you own your computer for a long enough time, you might need to add a bit more to keep up with memory demands from newer apps and OSes. Specifically, more RAM makes it possible for you to use more apps, documents, and larger files at the same time.

People that work with very large files like large databases, videos, and images can benefit significantly from having more RAM. If you regularly use large files, it is worth checking to see if your computer’s RAM is upgradeable.

Adding More RAM to Your Computer

In some situations, adding more RAM is worth the expense. For example, editing videos and high-resolution images takes a lot of memory. In addition, high-end audio recording and editing as well as some scientific work require significant RAM.

However, not all computers allow you to upgrade RAM. For example, the Chromebook typically has a fixed amount of RAM, and you cannot install more. So, when you’re buying a new computer—particularly if you plan on using that computer for more than five years, make sure to 1) understand how much RAM your computer has, and, 2) if you can upgrade the computer’s RAM. 

When your computer’s RAM is filled up, your computer has to get creative to keep working. Specifically, your computer starts to temporarily use your hard drive or SSD as “virtual memory.” If you have relatively fast storage like an SSD, virtual memory will be fast. On the other hand, using a traditional hard drive will be fairly slow.

Storage

Besides RAM, the most serious bottleneck to improving performance in your computer can be your storage. Even with plenty of RAM installed, computers need to read and write information from the storage system (i.e., the HDD or the SSD).

Hard drives come in different speeds and sizes. For laptops and desktops, the most common RPM rates are between 5400–7200RPM. In some cases, you might even decide to use a 10,000RPM drive. Faster drives cost more, are louder, have greater cooling needs, and use more power, but they may be a good option.

New disk technologies enable hard drives to be bigger and faster. These technologies include filling the drive with helium instead of air to reduce disk platter friction and using heat or microwaves to improve disk density, such as with heat-assisted magnetic recording (HAMR) drives and microwave-assisted magnetic recording (MAMR) drives.

Today, SSDs are becoming increasingly popular for computer storage. This type of computer storage is popular because it is faster, cooler, and takes up less space than traditional hard drives. They’re also less susceptible to magnetic fields and physical jolts, which makes them great for laptops. 

For more about the difference between HDDs and SSDs, check out our post, “Hard Disk Drive (HDD) vs. Solid-state Drive (SSD): What’s the Diff?”

Adding More Computer Storage

As a user’s disk storage needs increase, typically they will look to larger drives to store more data. The first step might be to replace an existing drive with a larger, faster drive. Or you might decide to install a second drive. One approach is to use different drives for different purposes. For example, use an SSD for the operating system, and then store your business videos on a larger SSD.

If more storage space is needed, you can also use an external drive, most often using USB or Thunderbolt to connect to the computer. This can be a single drive or multiple drives and might use a data storage virtualization technology such as RAID to protect the data.

If you have really large amounts of data, or simply wish to make it easy to share data with others in your location or elsewhere, you might consider network-attached storage (NAS). A NAS device can hold multiple drives, typically uses a data virtualization technology like RAID, and is accessible to anyone on your local network and—if you wish—on the internet, as well. NAS devices can offer a great deal of storage and other services that typically have been offered only by dedicated network servers in the past.

Back Up Early and Often

As a cloud storage company, we’d be remiss not to mention that you should back up your computer. No matter how you configure your computer’s storage, remember that technology can fail (we know a thing or two about that). You always want a backup so you can restore everything easily. The best backup strategy shouldn’t be dependent on any single device, either. Your backup strategy should always include three copies of your data on two different mediums with one off-site.

FAQs About Differences Between RAM and Storage

What is the difference between internal storage and RAM and internal storage?

Internal storage is a method of data storage that writes data to a disk, holding onto that data until it’s erased. Think of it as your computer’s brain. RAM is a method of communicating data between your device’s CPU and its internal storage. Think of it as your brain’s short-term memory and ability to multi-task. The data the RAM receives is volatile, so it will only last until it’s no longer needed, usually when you turn off the power or reset the computer.

Is it better to have more RAM or more storage?

If you’re looking for better PC performance, you can upgrade either RAM or storage for a boost in performance. More RAM will make it easier for your computer to perform multiple tasks at once, while upgrading your storage will improve battery life, make it faster to open applications and files, and give you more space for photos and applications. This is especially true if you’re switching your storage from a hard disk drive (HDD) to a solid state drive (SSD).

Does RAM give you more storage?

More RAM does not provide you with more free space. If your computer is giving you notifications that you’re getting close to running out of storage or you’ve already started having to delete files to make room for new ones, you should upgrade the internal storage, not the RAM.

Is memory and storage the same?

Memory and storage are also not the same thing, even though the words are often used interchangeably. Memory is another term for RAM.

The post What’s the Diff: RAM vs. Storage appeared first on Backblaze Blog | Cloud Storage & Cloud Backup.

[$] The hard life of a virtual-filesystem developer

Post Syndicated from corbet original https://lwn.net/Articles/960088/

Filesystem development is not an easy task; the performance demands are
typically high, and the consequences for mistakes usually involve lost data
and irate users. The implementation of a virtual (or “pseudo”) filesystem
— a filesystem implemented within the kernel and lacking a normal backing
store — can also be challenging, but for different reasons. A series of
conversations around the eventfs virtual filesystem has turned a spotlight
on the difficulty of creating a virtual filesystem for Linux.

Damn Small Linux 2024 released

Post Syndicated from corbet original https://lwn.net/Articles/960446/

A new version of the Damn Small
Linux
distribution has come out with an updated definition of “damn
small”:

The new goal of DSL is to pack as much usable desktop distribution
into an image small enough to fit on a single CD, or a hard limit
of 700MB. This project is meant to service older computers and have
them continue to be useful far into the future. Such a notion sits
well with my values. I think of this project as my way of keeping
otherwise usable hardware out of landfills.