All posts by Matthew Garrett

Not here

2026-01-06 Matthew Garrett

Post Syndicated from Matthew Garrett original https://mjg59.dreamwidth.org/74084.html

Hello! I am not posting here any more. You can find me here instead. Most Planets should be updated already (I’ve an MR open for Planet Gnome), but if you’re subscribed to my feed directly please update it.

comments

How did IRC ping timeouts end up in a lawsuit?

2025-12-17 Matthew Garrett

Post Syndicated from Matthew Garrett original https://mjg59.dreamwidth.org/73777.html

I recently won a lawsuit against Roy and Rianne Schestowitz, the authors and publishers of the Techrights and Tuxmachines websites. The short version of events is that they were subject to an online harassment campaign, which they incorrectly blamed me for. They responded with a large number of defamatory online posts about me, which the judge described as unsubstantiated character assassination and consequently awarded me significant damages. That’s not what this post is about, as such. It’s about the sole meaningful claim made that tied me to the abuse.

In the defendants’ defence and counterclaim[1], 15.27 asserts in part The facts linking the Claimant to the sock puppet accounts include, on the IRC network: simultaneous dropped connections to the mjg59_ and
elusive_woman accounts. This is so unlikely to be coincidental that the natural inference is that the same person posted under both names. “elusive_woman” here is an account linked to the harassment, and “mjg59_” is me. This is actually a surprisingly interesting claim to make, and it’s worth going into in some more detail.

The event in question occurred on the 28th of April, 2023. You can see a line reading *elusive_woman has quit (Ping timeout: 2m30s), followed by one reading *mjg59_ has quit (Ping timeout: 2m30s). The timestamp listed for the first is 09:52, and for the second 09:53. Is that actually simultaneous? We can actually gain some more information – if you hover over the timestamp links on the right hand side you can see that the link is actually accurate to the second even if that’s not displayed. The first event took place at 09:52:52, and the second at 09:53:03. That’s 11 seconds apart, which is clearly not simultaneous, but maybe it’s close enough. Figuring out more requires knowing what a “ping timeout” actually means here.

The IRC server in question is running Ergo (link to source code), and the relevant function is handleIdleTimeout(). The logic here is fairly simple – track the time since activity was last seen from the client. If that time is longer than DefaultIdleTimeout (which defaults to 90 seconds) and a ping hasn’t been sent yet, send a ping to the client. If a ping has been sent and the timeout is greater than DefaultTotalTimeout (which defaults to 150 seconds), disconnect the client with a “Ping timeout” message. There’s no special logic for handling the ping reply – a pong simply counts as any other client activity and resets the “last activity” value and timeout.

What does this mean? Well, for a start, two clients running on the same system will only have simultaneous ping timeouts if their last activity was simultaneous. Let’s imagine a machine with two clients, A and B. A sends a message at 02:22:59. B sends a message 2 seconds later, at 02:23:01. The idle timeout for A will fire at 02:24:29, and for B at 02:24:31. A ping is sent for A at 02:24:29 and is responded to immediately – the idle timeout for A is now reset to 02:25:59, 90 seconds later. The machine hosting A and B has its network cable pulled out at 02:24:30. The ping to B is sent at 02:24:31, but receives no reply. A minute later, at 02:25:31, B quits with a “Ping timeout” message. A ping is sent to A at 02:25:59, but receives no reply. A minute later, at 02:26:59, A quits with a “Ping timeout” message. Despite both clients having their network interrupted simultaneously, the ping timeouts occur 88 seconds apart.

So, two clients disconnecting with ping timeouts 11 seconds apart is not incompatible with the network connection being interrupted simultaneously – depending on activity, simultaneous network interruption may result in disconnections up to 90 seconds apart. But another way of looking at this is that network interruptions may occur up to 90 seconds apart and generate simultaneous disconnections[2]. Without additional information it’s impossible to determine which is the case.

This already casts doubt over the assertion that the disconnection was simultaneous, but if this is unusual enough it’s still potentially significant. Unfortunately for the Schestowitzes, even looking just at the elusive_woman account, there were several cases where elusive_woman and another user had a ping timeout within 90 seconds of each other – including one case where elusive_woman and schestowitz[TR] disconnect 40 seconds apart. By the Schestowitzes argument, it’s also a natural inference that elusive_woman and schestowitz[TR] (one of Roy Schestowitz’s accounts) are the same person.

We didn’t actually need to make this argument, though. In England it’s necessary to file a witness statement describing the evidence that you’re going to present in advance of the actual court hearing. Despite being warned of the consequences on multiple occasions the Schestowitzes never provided any witness statements, and as a result weren’t allowed to provide any evidence in court, which made for a fairly foregone conclusion.

[1] As well as defending themselves against my claim, the Schestowitzes made a counterclaim on the basis that I had engaged in a campaign of harassment against them. This counterclaim failed.

[2] Client A and client B both send messages at 02:22:59. A falls off the network at 02:23:00, has a ping sent at 02:24:29, and has a ping timeout at 02:25:29. B falls off the network at 02:24:28, has a ping sent at 02:24:29, and has a ping timeout at 02:25:29. Simultaneous disconnects despite over a minute of difference in the network interruption.

comments

Where are we on XChat security?

2025-10-21 Matthew Garrett

Post Syndicated from Matthew Garrett original https://mjg59.dreamwidth.org/73625.html

AWS had an outage today and Signal was unavailable for some users for a while. This has confused some people, including Elon Musk, who are concerned that having a dependency on AWS means that Signal could somehow be compromised by anyone with sufficient influence over AWS (it can’t). Which means we’re back to the richest man in the world recommending his own “X Chat”, saying The messages are fully encrypted with no advertising hooks or strange “AWS dependencies” such that I can’t read your messages even if someone put a gun to my head.

Elon is either uninformed about his own product, lying, or both.

As I wrote back in June, X Chat genuinely end-to-end encrypted, but ownership of the keys is complicated. The encryption key is stored using the Juicebox protocol, sharded between multiple backends. Two of these are asserted to be HSM backed – a discussion of the commissioning ceremony was recently posted here. I have not watched the almost 7 hours of video to verify that this was performed correctly, and I also haven’t been able to verify that the public keys included in the post were the keys generated during the ceremony, although that may be down to me just not finding the appropriate point in the video (sorry, Twitter’s video hosting doesn’t appear to have any skip feature and would frequently just sit spinning if I tried to seek to far and I should probably just download them and figure it out but I’m not doing that now). With enough effort it would probably also have been possible to fake the entire thing – I have no reason to believe that this has happened, but it’s not externally verifiable.

But let’s assume these published public keys are legitimately the ones used in the HSM Juicebox realms[1] and that everything was done correctly. Does that prevent Elon from obtaining your key and decrypting your messages? No.

On startup, the X Chat client makes an API call called GetPublicKeysResult, and the public keys of the realms are returned. Right now when I make that call I get the public keys listed above, so there’s at least some indication that I’m going to be communicating with actual HSMs. But what if that API call returned different keys? Could Elon stick a proxy in front of the HSMs and grab a cleartext portion of the key shards? Yes, he absolutely could, and then he’d be able to decrypt your messages.

(I will accept that there is a plausible argument that Elon is telling the truth in that even if you held a gun to his head he’s not smart enough to be able to do this himself, but that’d be true even if there were no security whatsoever, so it still says nothing about the security of his product)

The solution to this is remote attestation – a process where the device you’re speaking to proves its identity to you. In theory the endpoint could attest that it’s an HSM running this specific code, and we could look at the Juicebox repo and verify that it’s that code and hasn’t been tampered with, and then we’d know that our communication channel was secure. Elon hasn’t done that, despite it being table stakes for this sort of thing (Signal uses remote attestation to verify the enclave code used for private contact discovery, for instance, which ensures that the client will refuse to hand over any data until it’s verified the identity and state of the enclave). There’s no excuse whatsoever to build a new end-to-end encrypted messenger which relies on a network service for security without providing a trustworthy mechanism to verify you’re speaking to the real service.

We know how to do this properly. We have done for years. Launching without it is unforgivable.

[1] There are three Juicebox realms overall, one of which doesn’t appear to use HSMs, but you need at least two in order to obtain the key so at least part of the key will always be held in HSMs

comments

Investigating a forged PDF

2025-09-25 Matthew Garrett

Post Syndicated from Matthew Garrett original https://mjg59.dreamwidth.org/73317.html

I had to rent a house for a couple of months recently, which is long enough in California that it pushes you into proper tenant protection law. As landlords tend to do, they failed to return my security deposit within the 21 days required by law, having already failed to provide the required notification that I was entitled to an inspection before moving out. Cue some tedious argumentation with the letting agency, and eventually me threatening to take them to small claims court.

This post is not about that.

Now, under Californian law, the onus is on the landlord to hold and return the security deposit – the agency has no role in this. The only reason I was talking to them is that my lease didn’t mention the name or address of the landlord (another legal violation, but the outcome is just that you get to serve the landlord via the agency). So it was a bit surprising when I received an email from the owner of the agency informing me that they did not hold the deposit and so were not liable – I already knew this.

The odd bit about this, though, is that they sent me another copy of the contract, asserting that it made it clear that the landlord held the deposit. I read it, and instead found a clause reading SECURITY: The security deposit will secure the performance of Tenant’s obligations. IER may, but will not be obligated to, apply all portions of said deposit on account of Tenant’s obligations. Any balance remaining upon termination will be returned to Tenant. Tenant will not have the right to apply the security deposit in payment of the last month’s rent. Security deposit held at IER Trust Account., where IER is International Executive Rentals, the agency in question. Why send me a contract that says you hold the money while you’re telling me you don’t? And then I read further down and found this:

Ok, fair enough, there’s an addendum that says the landlord has it (I’ve removed the landlord’s name, it’s present in the original).

Except. I had no recollection of that addendum. I went back to the copy of the contract I had and discovered:
The same text as the previous picture, but addendum 1 is empty
Huh! But obviously I could just have edited that to remove it (there’s no obvious reason for me to, but whatever), and then it’d be my word against theirs. However, I’d been sent the document via RightSignature, an online document signing platform, and they’d added a certification page that looked like this:
A Signature Certificate, containing a bunch of data about the document including a checksum or the original
Interestingly, the certificate page was identical in both documents, including the checksums, despite the content being different. So, how do I show which one is legitimate? You’d think given this certificate page this would be trivial, but RightSignature provides no documented mechanism whatsoever for anyone to verify any of the fields in the certificate, which is annoying but let’s see what we can do anyway.

First up, let’s look at the PDF metadata. pdftk has a dump_data command that dumps the metadata in the document, including the creation date and the modification date. My file had both set to identical timestamps in June, both listed in UTC, corresponding to the time I’d signed the document. The file containing the addendum? The same creation time, but a modification time of this Monday, shortly before it was sent to me. This time, the modification timestamp was in Pacific Daylight Time, the timezone currently observed in California. In addition, the data included two ID fields, ID0 and ID1. In my document both were identical, in the one with the addendum ID0 matched mine but ID1 was different.

These ID tags are intended to be some form of representation (such as a hash) of the document. ID0 is set when the document is created and should not be modified afterwards – ID1 initially identical to ID0, but changes when the document is modified. This is intended to allow tooling to identify whether two documents are modified versions of the same document. The identical ID0 indicated that the document with the addendum was originally identical to mine, and the different ID1 that it had been modified.

Well, ok, that seems like a pretty strong demonstration. I had the “I have a very particular set of skills” conversation with the agency and pointed these facts out, that they were an extremely strong indication that my copy was authentic and their one wasn’t, and they responded that the document was “re-sealed” every time it was downloaded from RightSignature and that would explain the modifications. This doesn’t seem plausible, but it’s an argument. Let’s go further.

My next move was pdfalyzer, which allows you to pull a PDF apart into its component pieces. This revealed that the documents were identical, other than page 3, the one with the addendum. This page included tags entitled “touchUp_TextEdit”, evidence that the page had been modified using Acrobat. But in itself, that doesn’t prove anything – obviously it had been edited at some point to insert the landlord’s name, it doesn’t prove whether it happened before or after the signing.

But in the process of editing, Acrobat appeared to have renamed all the font references on that page into a different format. Every other page had a consistent naming scheme for the fonts, and they matched the scheme in the page 3 I had. Again, that doesn’t tell us whether the renaming happened before or after the signing. Or does it?

You see, when I completed my signing, RightSignature inserted my name into the document, and did so using a font that wasn’t otherwise present in the document (Courier, in this case). That font was named identically throughout the document, except on page 3, where it was named in the same manner as every other font that Acrobat had renamed. Given the font wasn’t present in the document until after I’d signed it, this is proof that the page was edited after signing.

But eh this is all very convoluted. Surely there’s an easier way? Thankfully yes, although I hate it. RightSignature had sent me a link to view my signed copy of the document. When I went there it presented it to me as the original PDF with my signature overlaid on top. Hitting F12 gave me the network tab, and I could see a reference to a base.pdf. Downloading that gave me the original PDF, pre-signature. Running sha256sum on it gave me an identical hash to the “Original checksum” field. Needless to say, it did not contain the addendum.

Why do this? The only explanation I can come up with (and I am obviously guessing here, I may be incorrect!) is that International Executive Rentals realised that they’d sent me a contract which could mean that they were liable for the return of my deposit, even though they’d already given it to my landlord, and after realising this added the addendum, sent it to me, and assumed that I just wouldn’t notice (or that, if I did, I wouldn’t be able to prove anything). In the process they went from an extremely unlikely possibility of having civil liability for a few thousand dollars (even if they were holding the deposit it’s still the landlord’s legal duty to return it, as far as I can tell) to doing something that looks extremely like forgery.

There’s a hilarious followup. After this happened, the agency offered to do a screenshare with me showing them logging into RightSignature and showing the signed file with the addendum, and then proceeded to do so. One minor problem – the “Send for signature” button was still there, just below a field saying “Uploaded: 09/22/25”. I asked them to search for my name, and it popped up two hits – one marked draft, one marked completed. The one marked completed? Didn’t contain the addendum.

comments

Cordoomceps – replacing an Amiga’s brain with Doom

2025-08-05 Matthew Garrett

Post Syndicated from Matthew Garrett original https://mjg59.dreamwidth.org/73001.html

There’s a lovely device called a pistorm, an adapter board that glues a Raspberry Pi GPIO bus to a Motorola 68000 bus. The intended use case is that you plug it into a 68000 device and then run an emulator that reads instructions from hardware (ROM or RAM) and emulates them. You’re still limited by the ~7MHz bus that the hardware is running at, but you can run the instructions as fast as you want.

These days you’re supposed to run a custom built OS on the Pi that just does 68000 emulation, but initially it ran Linux on the Pi and a userland 68000 emulator process. And, well, that got me thinking. The emulator takes 68000 instructions, emulates them, and then talks to the hardware to implement the effects of those instructions. What if we, well, just don’t? What if we just run all of our code in Linux on an ARM core and then talk to the Amiga hardware?

We’re going to ignore x86 here, because it’s weird – but most hardware that wants software to be able to communicate with it maps itself into the same address space that RAM is in. You can write to a byte of RAM, or you can write to a piece of hardware that’s effectively pretending to be RAM[1]. The Amiga wasn’t unusual in this respect in the 80s, and to talk to the graphics hardware you speak to a special address range that gets sent to that hardware instead of to RAM. The CPU knows nothing about this. It just indicates it wants to write to an address, and then sends the data.

So, if we are the CPU, we can just indicate that we want to write to an address, and provide the data. And those addresses can correspond to the hardware. So, we can write to the RAM that belongs to the Amiga, and we can write to the hardware that isn’t RAM but pretends to be. And that means we can run whatever we want on the Pi and then access Amiga hardware.

And, obviously, the thing we want to run is Doom, because that’s what everyone runs in fucked up hardware situations.

Doom was Amiga kryptonite. Its entire graphical model was based on memory directly representing the contents of your display, and being able to modify that by just moving pixels around. This worked because at the time VGA displays supported having a memory layout where each pixel on your screen was represented by a byte in memory containing an 8 bit value that corresponded to a lookup table containing the RGB value for that pixel.

The Amiga was, well, not good at this. Back in the 80s, when the Amiga hardware was developed, memory was expensive. Dedicating that much RAM to the video hardware was unthinkable – the Amiga 1000 initially shipped with only 256K of RAM, and you could fill all of that with a sufficiently colourful picture. So instead of having the idea of each pixel being associated with a specific area of memory, the Amiga used bitmaps. A bitmap is an area of memory that represents the screen, but only represents one bit of the colour depth. If you have a black and white display, you only need one bitmap. If you want to display four colours, you need two. More colours, more bitmaps. And each bitmap is stored in an independent area of RAM. You never use more memory than you need to display the number of colours you want to.

But that means that each bitplane contains packed information – every byte of data in a bitplane contains the bit value for 8 different pixels, because each bitplane contains one bit of information per pixel. To update one pixel on screen, you need to read from every bitmap, update one bit, and write it back, and that’s a lot of additional memory accesses. Doom, but on the Amiga, was slow not just because the CPU was slow, but because there was a lot of manipulation of data to turn it into the format the Amiga wanted and then push that over a fairly slow memory bus to have it displayed.

The CDTV was an aesthetically pleasing piece of hardware that absolutely sucked. It was an Amiga 500 in a hi-fi box with a caddy-loading CD drive, and it ran software that was just awful. There’s no path to remediation here. No compelling apps were ever released. It’s a terrible device. I love it. I bought one in 1996 because a local computer store had one and I pointed out that the company selling it had gone bankrupt some years earlier and literally nobody in my farming town was ever going to have any interest in buying a CD player that made a whirring noise when you turned it on because it had a fan and eventually they just sold it to me for not much money, and ever since then I wanted to have a CD player that ran Linux and well spoiler 30 years later I’m nearly there. That CDTV is going to be our test subject. We’re going to try to get Doom running on it without executing any 68000 instructions.

We’re facing two main problems here. The first is that all Amigas have a firmware ROM called Kickstart that runs at powerup. No matter how little you care about using any OS functionality, you can’t start running your code until Kickstart has run. This means even documentation describing bare metal Amiga programming assumes that the hardware is already in the state that Kickstart left it in. This will become important later. The second is that we’re going to need to actually write the code to use the Amiga hardware.

First, let’s talk about Amiga graphics. We’ve already covered bitmaps, but for anyone used to modern hardware that’s not the weirdest thing about what we’re dealing with here. The CDTV’s chipset supports a maximum of 64 colours in a mode called “Extra Half-Brite”, or EHB, where you have 32 colours arbitrarily chosen from a palette and then 32 more colours that are identical but with half the intensity. For 64 colours we need 6 bitplanes, each of which can be located arbitrarily in the region of RAM accessible to the chipset (“chip RAM”, distinguished from “fast ram” that’s only accessible to the CPU). We tell the chipset where our bitplanes are and it displays them. Or, well, it does for a frame – after that the registers that pointed at our bitplanes no longer do, because when the hardware was DMAing through the bitplanes to display them it was incrementing those registers to point at the next address to DMA from. Which means that every frame we need to set those registers back.

Making sure you have code that’s called every frame just to make your graphics work sounds intensely irritating, so Commodore gave us a way to avoid doing that. The chipset includes a coprocessor called “copper”. Copper doesn’t have a large set of features – in fact, it only has three. The first is that it can program chipset registers. The second is that it can wait for a specific point in screen scanout. The third (which we don’t care about here) is that it can optionally skip an instruction if a certain point in screen scanout has already been reached. We can write a program (a “copper list”) for the copper that tells it to program the chipset registers with the locations of our bitplanes and then wait until the end of the frame, at which point it will repeat the process. Now our bitplane pointers are always valid at the start of a frame.

Ok! We know how to display stuff. Now we just need to deal with not having 256 colours, and the whole “Doom expects pixels” thing. For the first of these, I stole code from ADoom, the only Amiga doom port I could easily find source for. This looks at the 256 colour palette loaded by Doom and calculates the closest approximation it can within the constraints of EHB. ADoom also includes a bunch of CPU-specific assembly optimisation for converting the “chunky” Doom graphic buffer into the “planar” Amiga bitplanes, none of which I used because (a) it’s all for 68000 series CPUs and we’re running on ARM, and (b) I have a quad core CPU running at 1.4GHz and I’m going to be pushing all the graphics over a 7.14MHz bus, the graphics mode conversion is not going to be the bottleneck here. Instead I just wrote a series of nested for loops that iterate through each pixel and update each bitplane and called it a day. The set of bitplanes I’m operating on here is allocated on the Linux side so I can read and write to them without being restricted by the speed of the Amiga bus (remember, each byte in each bitplane is going to be updated 8 times per frame, because it holds bits associated with 8 pixels), and then copied over to the Amiga’s RAM once the frame is complete.

And, kind of astonishingly, this works! Once I’d figured out where I was going wrong with RGB ordering and which order the bitplanes go in, I had a recognisable copy of Doom running. Unfortunately there were weird graphical glitches – sometimes blocks would be entirely the wrong colour. It took me a while to figure out what was going on and then I felt stupid. Recording the screen and watching in slow motion revealed that the glitches often showed parts of two frames displaying at once. The Amiga hardware is taking responsibility for scanning out the frames, and the code on the Linux side isn’t synchronised with it at all. That means I could update the bitplanes while the Amiga was scanning them out, resulting in a mashup of planes from two different Doom frames being used as one Amiga frame. One approach to avoid this would be to tie the Doom event loop to the Amiga, blocking my writes until the end of scanout. The other is to use double-buffering – have two sets of bitplanes, one being displayed and the other being written to. This consumes more RAM but since I’m not using the Amiga RAM for anything else that’s not a problem. With this approach I have two copper lists, one for each set of bitplanes, and switch between them on each frame. This improved things a lot but not entirely, and there’s still glitches when the palette is being updated (because there’s only one set of colour registers), something Doom does rather a lot, so I’m going to need to implement proper synchronisation.

Except. This was only working if I ran a 68K emulator first in order to run Kickstart. If I tried accessing the hardware without doing that, things were in a weird state. I could update the colour registers, but accessing RAM didn’t work – I could read stuff out, but anything I wrote vanished. Some more digging cleared that up. When you turn on a CPU it needs to start executing code from somewhere. On modern x86 systems it starts from a hardcoded address of 0xFFFFFFF0, which was traditionally a long way any RAM. The 68000 family instead reads its start address from address 0x00000004, which overlaps with where the Amiga chip RAM is. We can’t write anything to RAM until we’re executing code, and we can’t execute code until we tell the CPU where the code is, which seems like a problem. This is solved on the Amiga by powering up in a state where the Kickstart ROM is “overlayed” onto address 0. The CPU reads the start address from the ROM, which causes it to jump into the ROM and start executing code there. Early on, the code tells the hardware to stop overlaying the ROM onto the low addresses, and now the RAM is available. This is poorly documented because it’s not something you need to care if you execute Kickstart which every actual Amiga does
and I’m only in this position because I’ve made poor life choices, but ok that explained things. To turn off the overlay you write to a register in one of the Complex Interface Adaptor (CIA) chips, and things start working like you’d expect.

Except, they don’t. Writing to that register did nothing for me. I assumed that there was some other register I needed to write to first, and went to the extent of tracing every register access that occurred when running the emulator and replaying those in my code. Nope, still broken. What I finally discovered is that you need to pulse the reset line on the board before some of the hardware starts working – powering it up doesn’t put you in a well defined state, but resetting it does.

So, I now have a slightly graphically glitchy copy of Doom running without any sound, displaying on an Amiga whose brain has been replaced with a parasitic Linux. Further updates will likely make things even worse. Code is, of course, available.

[1] This is why we had trouble with late era 32 bit systems and 4GB of RAM – a bunch of your hardware wanted to be in the same address space and so you couldn’t put RAM there so you ended up with less than 4GB of RAM

comments

Secure boot certificate rollover is real but probably won’t hurt you

2025-07-31 Matthew Garrett

Post Syndicated from Matthew Garrett original https://mjg59.dreamwidth.org/72892.html

LWN wrote an article which opens with the assertion “Linux users who have Secure Boot enabled on their systems knowingly or unknowingly rely on a key from Microsoft that is set to expire in September”. This is, depending on interpretation, either misleading or just plain wrong, but also there’s not a good source of truth here, so.

First, how does secure boot signing work? Every system that supports UEFI secure boot ships with a set of trusted certificates in a database called “db”. Any binary signed with a chain of certificates that chains to a root in db is trusted, unless either the binary (via hash) or an intermediate certificate is added to “dbx”, a separate database of things whose trust has been revoked[1]. But, in general, the firmware doesn’t care about the intermediate or the number of intermediates or whatever – as long as there’s a valid chain back to a certificate that’s in db, it’s going to be happy.

That’s the conceptual version. What about the real world one? Most x86 systems that implement UEFI secure boot have at least two root certificates in db – one called “Microsoft Windows Production PCA 2011”, and one called “Microsoft Corporation UEFI CA 2011”. The former is the root of a chain used to sign the Windows bootloader, and the latter is the root used to sign, well, everything else.

What is “everything else”? For people in the Linux ecosystem, the most obvious thing is the Shim bootloader that’s used to bridge between the Microsoft root of trust and a given Linux distribution’s root of trust[2]. But that’s not the only third party code executed in the UEFI environment. Graphics cards, network cards, RAID and iSCSI cards and so on all tend to have their own unique initialisation process, and need board-specific drivers. Even if you added support for everything on the market to your system firmware, a system built last year wouldn’t know how to drive a graphics card released this year. Cards need to provide their own drivers, and these drivers are stored in flash on the card so they can be updated. But since UEFI doesn’t have any sandboxing environment, those drivers could do pretty much anything they wanted to. Someone could compromise the UEFI secure boot chain by just plugging in a card with a malicious driver on it, and have that hotpatch the bootloader and introduce a backdoor into your kernel.

This is avoided by enforcing secure boot for these drivers as well. Every plug-in card that carries its own driver has it signed by Microsoft, and up until now that’s been a certificate chain going back to the same “Microsoft Corporation UEFI CA 2011” certificate used in signing Shim. This is important for reasons we’ll get to.

The “Microsoft Windows Production PCA 2011” certificate expires in October 2026, and the “Microsoft Corporation UEFI CA 2011” one in June 2026. These dates are not that far in the future! Most of you have probably at some point tried to visit a website and got an error message telling you that the site’s certificate had expired and that it’s no longer trusted, and so it’s natural to assume that the outcome of time’s arrow marching past those expiry dates would be that systems will stop booting. Thankfully, that’s not what’s going to happen.

First up: if you grab a copy of the Shim currently shipped in Fedora and extract the certificates from it, you’ll learn it’s not directly signed with the “Microsoft Corporation UEFI CA 2011” certificate. Instead, it’s signed with a “Microsoft Windows UEFI Driver Publisher” certificate that chains to the “Microsoft Corporation UEFI CA 2011” certificate. That’s not unusual, intermediates are commonly used and rotated. But if we look more closely at that certificate, we learn that it was issued in 2023 and expired in 2024. Older versions of Shim were signed with older intermediates. A very large number of Linux systems are already booting certificates that have expired, and yet things keep working. Why?

Let’s talk about time. In the ways we care about in this discussion, time is a social construct rather than a meaningful reality. There’s no way for a computer to observe the state of the universe and know what time it is – it needs to be told. It has no idea whether that time is accurate or an elaborate fiction, and so it can’t with any degree of certainty declare that a certificate is valid from an external frame of reference. The failure modes of getting this wrong are also extremely bad! If a system has a GPU that relies on an option ROM, and if you stop trusting the option ROM because either its certificate has genuinely expired or because your clock is wrong, you can’t display any graphical output[3] and the user can’t fix the clock and, well, crap.

The upshot is that nobody actually enforces these expiry dates – here’s the reference code that disables it. In a year’s time we’ll have gone past the expiration date for “Microsoft Windows UEFI Driver Publisher” and everything will still be working, and a few months later “Microsoft Windows Production PCA 2011” will also expire and systems will keep booting Windows despite being signed with a now-expired certificate. This isn’t a Y2K scenario where everything keeps working because people have done a huge amount of work – it’s a situation where everything keeps working even if nobody does any work.

So, uh, what’s the story here? Why is there any engineering effort going on at all? What’s all this talk of new certificates? Why are there sensationalist pieces about how Linux is going to stop working on old computers or new computers or maybe all computers?

Microsoft will shortly start signing things with a new certificate that chains to a new root, and most systems don’t trust that new root. System vendors are supplying updates[4] to their systems to add the new root to the set of trusted keys, and Microsoft has supplied a fallback that can be applied to all systems even without vendor support[5]. If something is signed purely with the new certificate then it won’t boot on something that only trusts the old certificate (which shouldn’t be a realistic scenario due to the above), but if something is signed purely with the old certificate then it won’t boot on something that only trusts the new certificate.

How meaningful a risk is this? We don’t have an explicit statement from Microsoft as yet as to what’s going to happen here, but we expect that there’ll be at least a period of time where Microsoft signs binaries with both the old and the new certificate, and in that case those objects should work just fine on both old and new computers. The problem arises if Microsoft stops signing things with the old certificate, at which point new releases will stop booting on systems that don’t trust the new key (which, again, shouldn’t happen). But even if that does turn out to be a problem, nothing is going to force Linux distributions to stop using existing Shims signed with the old certificate, and having a Shim signed with an old certificate does nothing to stop distributions signing new versions of grub and kernels. In an ideal world we have no reason to ever update Shim[6] and so we just keep on shipping one signed with two certs.

If there’s a point in the future where Microsoft only signs with the new key, and if we were to somehow end up in a world where systems only trust the old key and not the new key[7], then those systems wouldn’t boot with new graphics cards, wouldn’t be able to run new versions of Windows, wouldn’t be able to run any Linux distros that ship with a Shim signed only with the new certificate. That would be bad, but we have a mechanism to avoid it. On the other hand, systems that only trust the new certificate and not the old one would refuse to boot older Linux, wouldn’t support old graphics cards, and also wouldn’t boot old versions of Windows. Nobody wants that, and for the foreseeable future we’re going to see new systems continue trusting the old certificate and old systems have updates that add the new certificate, and everything will just continue working exactly as it does now.

Conclusion: Outside some corner cases, the worst case is you might need to boot an old Linux to update your trusted keys to be able to install a new Linux, and no computer currently running Linux will break in any way whatsoever.

[1] (there’s also a separate revocation mechanism called SBAT which I wrote about here, but it’s not relevant in this scenario)

[2] Microsoft won’t sign GPLed code for reasons I think are unreasonable, so having them sign grub was a non-starter, but also the point of Shim was to allow distributions to have something that doesn’t change often and be able to sign their own bootloaders and kernels and so on without having to have Microsoft involved, which means grub and the kernel can be updated without having to ask Microsoft to sign anything and updates can be pushed without any additional delays

[3] It’s been a long time since graphics cards booted directly into a state that provided any well-defined programming interface. Even back in 90s, cards didn’t present VGA-compatible registers until card-specific code had been executed (hence DEC Alphas having an x86 emulator in their firmware to run the driver on the card). No driver? No video output.

[4] There’s a UEFI-defined mechanism for updating the keys that doesn’t require a full firmware update, and it’ll work on all devices that use the same keys rather than being per-device

[5] Using the generic update without a vendor-specific update means it wouldn’t be possible to issue further updates for the next key rollover, or any additional revocation updates, but I’m hoping to be retired by then and I hope all these computers will also be retired by then

[6] I said this in 2012 and it turned out to be wrong then so it’s probably wrong now sorry, but at least SBAT means we can revoke vulnerable grubs without having to revoke Shim

[7] Which shouldn’t happen! There’s an update to add the new key that should work on all PCs, but there’s always the chance of firmware bugs

comments

Why is there no consistent single signon API flow?

2025-06-24 Matthew Garrett

Post Syndicated from Matthew Garrett original https://mjg59.dreamwidth.org/72688.html

Single signon is a pretty vital part of modern enterprise security. You have users who need access to a bewildering array of services, and you want to be able to avoid the fallout of one of those services being compromised and your users having to change their passwords everywhere (because they’re clearly going to be using the same password everywhere), or you want to be able to enforce some reasonable MFA policy without needing to configure it in 300 different places, or you want to be able to disable all user access in one place when someone leaves the company, or, well, all of the above. There’s any number of providers for this, ranging from it being integrated with a more general app service platform (eg, Microsoft or Google) or a third party vendor (Okta, Ping, any number of bizarre companies). And, in general, they’ll offer a straightforward mechanism to either issue OIDC tokens or manage SAML login flows, requiring users present whatever set of authentication mechanisms you’ve configured.

This is largely optimised for web authentication, which doesn’t seem like a huge deal – if I’m logging into Workday then being bounced to another site for auth seems entirely reasonable. The problem is when you’re trying to gate access to a non-web app, at which point consistency in login flow is usually achieved by spawning a browser and somehow managing submitting the result back to the remote server. And this makes some degree of sense – browsers are where webauthn token support tends to live, and it also ensures the user always has the same experience.

But it works poorly for CLI-based setups. There’s basically two options – you can use the device code authorisation flow, where you perform authentication on what is nominally a separate machine to the one requesting it (but in this case is actually the same) and as a result end up with a straightforward mechanism to have your users socially engineered into giving Johnny Badman a valid auth token despite webauthn nominally being unphisable (as described years ago), or you reduce that risk somewhat by spawning a local server and POSTing the token back to it – which works locally but doesn’t work well if you’re dealing with trying to auth on a remote device. The user experience for both scenarios sucks, and it reduces a bunch of the worthwhile security properties that modern MFA supposedly gives us.

There’s a third approach, which is in some ways the obviously good approach and in other ways is obviously a screaming nightmare. All the browser is doing is sending a bunch of requests to a remote service and handling the response locally. Why don’t we just do the same? Okta, for instance, has an API for auth. We just need to submit the username and password to that and see what answer comes back. This is great until you enable any kind of MFA, at which point the additional authz step is something that’s only supported via the browser. And basically everyone else is the same.

Of course, when we say “That’s only supported via the browser”, the browser is still just running some code of some form and we can figure out what it’s doing and do the same. Which is how you end up scraping constants out of Javascript embedded in the API response in order to submit that data back in the appropriate way. This is all possible but it’s incredibly annoying and fragile – the contract with the identity provider is that a browser is pointed at a URL, not that any of the internal implementation remains consistent.

I’ve done this. I’ve implemented code to scrape an identity provider’s auth responses to extract the webauthn challenges and feed those to a local security token without using a browser. I’ve also written support for forwarding those challenges over the SSH agent protocol to make this work with remote systems that aren’t running a GUI. This week I’m working on doing the same again, because every identity provider does all of this differently.

There’s no fundamental reason all of this needs to be custom. It could be a straightforward “POST username and password, receive list of UUIDs describing MFA mechanisms, define how those MFA mechanisms work”. That even gives space for custom auth factors (I’m looking at you, Okta Fastpass). But instead I’m left scraping JSON blobs out of Javascript and hoping nobody renames a field, even though I only care about extremely standard MFA mechanisms that shouldn’t differ across different identity providers.

Someone, please, write a spec for this. Please don’t make it be me.

comments

My a11y journey

2025-06-20 Matthew Garrett

Post Syndicated from Matthew Garrett original https://mjg59.dreamwidth.org/72379.html

23 years ago I was in a bad place. I’d quit my first attempt at a PhD for various reasons that were, with hindsight, bad, and I was suddenly entirely aimless. I lucked into picking up a sysadmin role back at TCM where I’d spent a summer a year before, but that’s not really what I wanted in my life. And then Hanna mentioned that her PhD supervisor was looking for someone familiar with Linux to work on making Dasher, one of the group’s research projects, more usable on Linux. I jumped.

The timing was fortuitous. Sun were pumping money and developer effort into accessibility support, and the Inference Group had just received a grant from the Gatsy Foundation that involved working with the ACE Centre to provide additional accessibility support. And I was suddenly hacking on code that was largely ignored by most developers, supporting use cases that were irrelevant to most developers. Being in a relatively green field space sounds refreshing, until you realise that you’re catering to actual humans who are potentially going to rely on your software to be able to communicate. That’s somewhat focusing.

This was, uh, something of an on the job learning experience. I had to catch up with a lot of new technologies very quickly, but that wasn’t the hard bit – what was difficult was realising I had to cater to people who were dealing with use cases that I had no experience of whatsoever. Dasher was extended to allow text entry into applications without needing to cut and paste. We added support for introspection of the current applications UI so menus could be exposed via the Dasher interface, allowing people to fly through menu hierarchies and pop open file dialogs. Text-to-speech was incorporated so people could rapidly enter sentences and have them spoke out loud.

But what sticks with me isn’t the tech, or even the opportunities it gave me to meet other people working on the Linux desktop and forge friendships that still exist. It was the cases where I had the opportunity to work with people who could use Dasher as a tool to increase their ability to communicate with the outside world, whose lives were transformed for the better because of what we’d produced. Watching someone use your code and realising that you could write a three line patch that had a significant impact on the speed they could talk to other people is an incomparable experience. It’s been decades and in many ways that was the most impact I’ve ever had as a developer.

I left after a year to work on fruitflies and get my PhD, and my career since then hasn’t involved a lot of accessibility work. But it’s stuck with me – every improvement in that space is something that has a direct impact on the quality of life of more people than you expect, but is also something that goes almost unrecognised. The people working on accessibility are heroes. They’re making all the technology everyone else produces available to people who would otherwise be blocked from it. They deserve recognition, and they deserve a lot more support than they have.

But when we deal with technology, we deal with transitions. A lot of the Linux accessibility support depended on X11 behaviour that is now widely regarded as a set of misfeatures. It’s not actually good to be able to inject arbitrary input into an arbitrary window, and it’s not good to be able to arbitrarily scrape out its contents. X11 never had a model to permit this for accessibility tooling while blocking it for other code. Wayland does, but suffers from the surrounding infrastructure not being well developed yet. We’re seeing that happen now, though – Gnome has been performing a great deal of work in this respect, and KDE is picking that up as well. There isn’t a full correspondence between X11-based Linux accessibility support and Wayland, but for many users the Wayland accessibility infrastructure is already better than with X11.

That’s going to continue improving, and it’ll improve faster with broader support. We’ve somehow ended up with the bizarre politicisation of Wayland as being some sort of woke thing while X11 represents the Roman Empire or some such bullshit, but the reality is that there is no story for improving accessibility support under X11 and sticking to X11 is going to end up reducing the accessibility of a platform.

When you read anything about Linux accessibility, ask yourself whether you’re reading something written by either a user of the accessibility features, or a developer of them. If they’re neither, ask yourself why they actually care and what they’re doing to make the future better.

comments

Locally hosting an internet-connected server

2025-06-17 Matthew Garrett

Post Syndicated from Matthew Garrett original https://mjg59.dreamwidth.org/72095.html

I’m lucky enough to have a weird niche ISP available to me, so I’m paying $35 a month for around 600MBit symmetric data. Unfortunately they don’t offer static IP addresses to residential customers, and nor do they allow multiple IP addresses per connection, and I’m the sort of person who’d like to run a bunch of stuff myself, so I’ve been looking for ways to manage this.

What I’ve ended up doing is renting a cheap VPS from a vendor that lets me add multiple IP addresses for minimal extra cost. The precise nature of the VPS isn’t relevant – you just want a machine (it doesn’t need much CPU, RAM, or storage) that has multiple world routeable IPv4 addresses associated with it and has no port blocks on incoming traffic. Ideally it’s geographically local and peers with your ISP in order to reduce additional latency, but that’s a nice to have rather than a requirement.

By setting that up you now have multiple real-world IP addresses that people can get to. How do we get them to the machine in your house you want to be accessible? First we need a connection between that machine and your VPS, and the easiest approach here is Wireguard. We only need a point-to-point link, nothing routable, and none of the IP addresses involved need to have anything to do with any of the rest of your network. So, on your local machine you want something like:

[Interface] PrivateKey = privkeyhere ListenPort = 51820 Address = localaddr/32

[Peer] Endpoint = VPS:51820 PublicKey = pubkeyhere AllowedIPs = VPS/0

And on your VPS, something like:

[Interface] Address = vpswgaddr/32 SaveConfig = true ListenPort = 51820 PrivateKey = privkeyhere

[Peer] PublicKey = pubkeyhere AllowedIPs = localaddr/32

The addresses here are (other than the VPS address) arbitrary – but they do need to be consistent, otherwise Wireguard is going to be unhappy and your packets will not have a fun time. Bring that interface up with wg-quick and make sure the devices can ping each other. Hurrah! That’s the easy bit.

Now you want packets from the outside world to get to your internal machine. Let’s say the external IP address you’re going to use for that machine is 321.985.520.309 and the wireguard address of your local system is 867.420.696.005. On the VPS, you’re going to want to do:

iptables -t nat -A PREROUTING -p tcp -d 321.985.520.309 -j DNAT --to-destination 867.420.696.005

Now, all incoming packets for 321.985.520.309 will be rewritten to head towards 867.420.696.005 instead (make sure you’ve set net.ipv4.ip_forward to 1 via sysctl!). Victory! Or is it? Well, no.

What we’re doing here is rewriting the destination address of the packets so instead of heading to an address associated with the VPS, they’re now going to head to your internal system over the Wireguard link. Which is then going to ignore them, because the AllowedIPs statement in the config only allows packets coming from your VPS, and these packets still have their original source IP. We could rewrite the source IP to match the VPS IP, but then you’d have no idea where any of these packets were coming from, and that sucks. Let’s do something better. On the local machine, in the peer, let’s update AllowedIps to 0.0.0.0/0 to permit packets form any source to appear over our Wireguard link. But if we bring the interface up now, it’ll try to route all traffic over the Wireguard link, which isn’t what we want. So we’ll add table = off to the interface stanza of the config to disable that, and now we can bring the interface up without breaking everything but still allowing packets to reach us. However, we do still need to tell the kernel how to reach the remote VPN endpoint, which we can do with ip route add vpswgaddr dev wg0. Add this to the interface stanza as:

PostUp = ip route add vpswgaddr dev wg0 PreDown = ip route del vpswgaddr dev wg0

That’s half the battle. The problem is that they’re going to show up there with the source address still set to the original source IP, and your internal system is (because Linux) going to notice it has the ability to just send replies to the outside world via your ISP rather than via Wireguard and nothing is going to work. Thanks, Linux. Thinux.

But there’s a way to solve this – policy routing. Linux allows you to have multiple separate routing tables, and define policy that controls which routing table will be used for a given packet. First, let’s define a new table reference. On the local machine, edit /etc/iproute2/rt_tables and add a new entry that’s something like:

1 wireguard

where “1” is just a standin for a number not otherwise used there. Now edit your wireguard config and replace table=off with table=wireguard – Wireguard will now update the wireguard routing table rather than the global one. Now all we need to do is to tell the kernel to push packets into the appropriate routing table – we can do that with ip rule add from localaddr lookup wireguard, which tells the kernel to take any packet coming from our Wireguard address and push it via the Wireguard routing table. Add that to your Wireguard interface config as:

PostUp = ip rule add from localaddr lookup wireguard PreDown = ip rule del from localaddr lookup wireguard
and now your local system is effectively on the internet.

You can do this for multiple systems – just configure additional Wireguard interfaces on the VPS and make sure they’re all listening on different ports. If your local IP changes then your local machines will end up reconnecting to the VPS, but to the outside world their accessible IP address will remain the same. It’s like having a real IP without the pain of convincing your ISP to give it to you.

comments

How Twitter could (somewhat) fix their encrypted DMs

2025-06-05 Matthew Garrett

Post Syndicated from Matthew Garrett original https://mjg59.dreamwidth.org/71933.html

As I wrote in my last post, Twitter’s new encrypted DM infrastructure is pretty awful. But the amount of work required to make it somewhat better isn’t large.

When Juicebox is used with HSMs, it supports encrypting the communication between the client and the backend. This is handled by generating a unique keypair for each HSM. The public key is provided to the client, while the private key remains within the HSM. Even if you can see the traffic sent to the HSM, it’s encrypted using the Noise protocol and so the user’s encrypted secret data can’t be retrieved.

But this is only useful if you know that the public key corresponds to a private key in the HSM! Right now there’s no way to know this, but there’s worse – the client doesn’t have the public key built into it, it’s supplied as a response to an API request made to Twitter’s servers. Even if the current keys are associated with the HSMs, Twitter could swap them out with ones that aren’t, terminate the encrypted connection at their endpoint, and then fake your query to the HSM and get the encrypted data that way. Worse, this could be done for specific targeted users, without any indication to the user that this has happened, making it almost impossible to detect in general.

This is at least partially fixable. Twitter could prove to a third party that their Juicebox keys were generated in an HSM, and the key material could be moved into clients. This makes attacking individual users more difficult (the backdoor code would need to be shipped in the public client), but can’t easily help with the website version[1] even if a framework exists to analyse the clients and verify that the correct public keys are in use.

It’s still worse than Signal. Use Signal.

[1] Since they could still just serve backdoored Javascript to specific users. This is, unfortunately, kind of an inherent problem when it comes to web-based clients – we don’t have good frameworks to detect whether the site itself is malicious.

comments

Twitter’s new encrypted DMs aren’t better than the old ones

2025-06-05 Matthew Garrett

Post Syndicated from Matthew Garrett original https://mjg59.dreamwidth.org/71646.html

When Twitter[1] launched encrypted DMs a couple of years ago, it was the worst kind of end-to-end encrypted – technically e2ee, but in a way that made it relatively easy for Twitter to inject new encryption keys and get everyone’s messages anyway. It was also lacking a whole bunch of features such as “sending pictures”, so the entire thing was largely a waste of time. But a couple of days ago, Elon announced the arrival of “XChat”, a new encrypted message platform built on Rust with (Bitcoin style) encryption, whole new architecture. Maybe this time they’ve got it right?

tl;dr – no. Use Signal. Twitter can probably obtain your private keys, and admit that they can MITM you and have full access to your metadata.

The new approach is pretty similar to the old one in that it’s based on pretty straightforward and well tested cryptographic primitives, but merely using good cryptography doesn’t mean you end up with a good solution. This time they’ve pivoted away from using the underlying cryptographic primitives directly and into higher level abstractions, which is probably a good thing. They’re using Libsodium’s boxes for message encryption, which is, well, fine? It doesn’t offer forward secrecy (if someone’s private key is leaked then all existing messages can be decrypted) so it’s a long way from the state of the art for a messaging client (Signal’s had forward secrecy for over a decade!), but it’s not inherently broken or anything. It is, however, written in C, not Rust[2].

That’s about the extent of the good news. Twitter’s old implementation involved clients generating keypairs and pushing the public key to Twitter. Each client (a physical device or a browser instance) had its own private key, and messages were simply encrypted to every public key associated with an account. This meant that new devices couldn’t decrypt old messages, and also meant there was a maximum number of supported devices and terrible scaling issues and it was pretty bad. The new approach generates a keypair and then stores the private key using the Juicebox protocol. Other devices can then retrieve the private key.

Doesn’t this mean Twitter has the private key? Well, no. There’s a PIN involved, and the PIN is used to generate an encryption key. The stored copy of the private key is encrypted with that key, so if you don’t know the PIN you can’t decrypt the key. So we brute force the PIN, right? Juicebox actually protects against that – before the backend will hand over the encrypted key, you have to prove knowledge of the PIN to it (this is done in a clever way that doesn’t directly reveal the PIN to the backend). If you ask for the key too many times while providing the wrong PIN, access is locked down.

But this is true only if the Juicebox backend is trustworthy. If the backend is controlled by someone untrustworthy[3] then they’re going to be able to obtain the encrypted key material (even if it’s in an HSM, they can simply watch what comes out of the HSM when the user authenticates). And now all they need is the PIN. Turning the PIN into an encryption key is done using the Argon2id key derivation function, using 32 iterations and a memory cost of 16MB (the Juicebox white paper says 16KB, but (a) that’s laughably small and (b) the code says 16 * 1024 in an argument that takes kilobytes), which makes it computationally and moderately memory expensive to generate the encryption key used to decrypt the private key. How expensive? Well, on my (not very fast) laptop, that takes less than 0.2 seconds. How many attempts to I need to crack the PIN? Twitter’s chosen to fix that to 4 digits, so a maximum of 10,000. You aren’t going to need many machines running in parallel to bring this down to a very small amount of time, at which point private keys can, to a first approximation, be extracted at will.

Juicebox attempts to defend against this by supporting sharding your key over multiple backends, and only requiring a subset of those to recover the original. I can’t find any evidence that Twitter’s does seem to be making use of this, but all the backends used are under x.com so are presumably under Twitter’s direct control. Trusting the keystore without needing to trust whoever’s hosting it requires a trustworthy communications mechanism between the client and the keystore. If the device you’re talking to can prove that it’s an HSM that implements the attempt limiting protocol and has no other mechanism to export the data, this can be made to work. Signal makes use of something along these lines using Intel SGX for contact list and settings storage and recovery, and Google and Apple also have documentation about how they handle this in ways that make it difficult for them to obtain backed up key material. Twitter has no documentation of this, and as far as I can tell does nothing to prove that the backend is in any way trustworthy.

On the plus side, Juicebox is written in Rust, so Elon’s not 100% wrong. Just mostly wrong.

But ok, at least you’ve got viable end-to-end encryption even if someone can put in some (not all that much, really) effort to obtain your private key and render it all pointless? Actually no, since you’re still relying on the Twitter server to give you the public key of the other party and there’s no out of band mechanism to do that or verify the authenticity of that public key at present. Twitter can simply give you a public key where they control the private key, decrypt the message, and then reencrypt it with the intended recipient’s key and pass it on. The support page makes it clear that this is a known shortcoming and that it’ll be fixed at some point, but they said that about the original encrypted DM support and it never was, so that’s probably dependent on whether Elon gets distracted by something else again. And the server knows who and when you’re messaging even if they haven’t bothered to break your private key, so there’s a lot of metadata leakage.

Signal doesn’t have these shortcomings. Use Signal.

[1] I’ll respect their name change once Elon respects his daughter

[2] There are implementations written in Rust, but Twitter’s using the C one with these JNI bindings

[3] Or someone nominally trustworthy but who’s been compelled to act against your interests – even if Elon were absolutely committed to protecting all his users, his overarching goals for Twitter require him to have legal presence in multiple jurisdictions that are not necessarily above placing employees in physical danger if there’s a perception that they could obtain someone’s encryption keys

comments

Failing upwards: the Twitter encrypted DM failure

2025-03-19 Matthew Garrett

Post Syndicated from Matthew Garrett original https://mjg59.dreamwidth.org/71188.html

Almost two years ago, Twitter launched encrypted direct messages. I wrote about their technical implementation at the time, and to the best of my knowledge nothing has changed. The short story is that the actual encryption primitives used are entirely normal and fine – messages are encrypted using AES, and the AES keys are exchanged via NIST P-256 elliptic curve asymmetric keys. The asymmetric keys are each associated with a specific device or browser owned by a user, so when you send a message to someone you encrypt the AES key with all of their asymmetric keys and then each device or browser can decrypt the message again. As long as the keys are managed appropriately, this is infeasible to break.

But how do you know what a user’s keys are? I also wrote about this last year – key distribution is a hard problem. In the Twitter DM case, you ask Twitter’s server, and if Twitter wants to intercept your messages they replace your key. The documentation for the feature basically admits this – if people with guns showed up there, they could very much compromise the protection in such a way that all future messages you sent were readable. It’s also impossible to prove that they’re not already doing this without every user verifying that the public keys Twitter hands out to other users correspond to the private keys they hold, something that Twitter provides no mechanism to do.

This isn’t the only weakness in the implementation. Twitter may not be able read the messages, but every encrypted DM is sent through exactly the same infrastructure as the unencrypted ones, so Twitter can see the time a message was sent, who it was sent to, and roughly how big it was. And because pictures and other attachments in Twitter DMs aren’t sent in-line but are instead replaced with links, the implementation would encrypt the links but not the attachments – this is “solved” by simply blocking attachments in encrypted DMs. There’s no forward secrecy – if a key is compromised it allows access to not only all new messages created with that key, but also all previous messages. If you log out of Twitter the keys are still stored by the browser, so if you can potentially be extracted and used to decrypt your communications. And there’s no group chat support at all, which is more a functional restriction than a conceptual one.

To be fair, these are hard problems to solve! Signal solves all of them, but Signal is the product of a large number of highly skilled experts in cryptography, and even so it’s taken years to achieve all of this. When Elon announced the launch of encrypted DMs he indicated that new features would be developed quickly – he’s since publicly mentioned the feature a grand total of once, in which he mentioned further feature development that just didn’t happen. None of the limitations mentioned in the documentation have been addressed in the 22 months since the feature was launched.

Why? Well, it turns out that the feature was developed by a total of two engineers, neither of whom is still employed at Twitter. The tech lead for the feature was Christopher Stanley, who was actually a SpaceX employee at the time. Since then he’s ended up at DOGE, where he apparently set off alarms when attempting to install Starlink, and who today is apparently being appointed to the board of Fannie Mae, a government-backed mortgage company.

Anyway. Use Signal.

comments

The GPU, not the TPM, is the root of hardware DRM

2025-01-02 Matthew Garrett

Post Syndicated from Matthew Garrett original https://mjg59.dreamwidth.org/70954.html

As part of their “Defective by Design” anti-DRM campaign, the FSF recently made the following claim:
Today, most of the major streaming media platforms utilize the TPM to decrypt media streams, forcefully placing the decryption out of the user’s control (from here).
This is part of an overall argument that Microsoft’s insistence that only hardware with a TPM can run Windows 11 is with the goal of aiding streaming companies in their attempt to ensure media can only be played in tightly constrained environments.

I’m going to be honest here and say that I don’t know what Microsoft’s actual motivation for requiring a TPM in Windows 11 is. I’ve been talking about TPM stuff for a long time. My job involves writing a lot of TPM code. I think having a TPM enables a number of worthwhile security features. Given the choice, I’d certainly pick a computer with a TPM. But in terms of whether it’s of sufficient value to lock out Windows 11 on hardware with no TPM that would otherwise be able to run it? I’m not sure that’s a worthwhile tradeoff.

What I can say is that the FSF’s claim is just 100% wrong, and since this seems to be the sole basis of their overall claim about Microsoft’s strategy here, the argument is pretty significantly undermined. I’m not aware of any streaming media platforms making use of TPMs in any way whatsoever. There is hardware DRM that the media companies use to restrict users, but it’s not in the TPM – it’s in the GPU.

Let’s back up for a moment. There’s multiple different DRM implementations, but the big three are Widevine (owned by Google, used on Android, Chromebooks, and some other embedded devices), Fairplay (Apple implementation, used for Mac and iOS), and Playready (Microsoft’s implementation, used in Windows and some other hardware streaming devices and TVs). These generally implement several levels of functionality, depending on the capabilities of the device they’re running on – this will range from all the DRM functionality being implemented in software up to the hardware path that will be discussed shortly. Streaming providers can choose what level of functionality and quality to provide based on the level implemented on the client device, and it’s common for 4K and HDR content to be tied to hardware DRM. In any scenario, they stream encrypted content to the client and the DRM stack decrypts it before the compressed data can be decoded and played.

The “problem” with software DRM implementations is that the decrypted material is going to exist somewhere the OS can get at it at some point, making it possible for users to simply grab the decrypted stream, somewhat defeating the entire point. Vendors try to make this difficult by obfuscating their code as much as possible (and in some cases putting some of it in-kernel), but pretty much all software DRM is at least somewhat broken and copies of any new streaming media end up being available via Bittorrent pretty quickly after release. This is why higher quality media tends to be restricted to clients that implement hardware-based DRM.

The implementation of hardware-based DRM varies. On devices in the ARM world this is usually handled by performing the cryptography in a Trusted Execution Environment, or TEE. A TEE is an area where code can be executed without the OS having any insight into it at all, with ARM’s TrustZone being an example of this. By putting the DRM code in TrustZone, the cryptography can be performed in RAM that the OS has no access to, making the scraping described earlier impossible. x86 has no well-specified TEE (Intel’s SGX is an example, but is no longer implemented in consumer parts), so instead this tends to be handed off to the GPU. The exact details of this implementation are somewhat opaque – of the previously mentioned DRM implementations, only Playready does hardware DRM on x86, and I haven’t found any public documentation of what drivers need to expose for this to work.

In any case, as part of the DRM handshake between the client and the streaming platform, encryption keys are negotiated with the key material being stored in the GPU or the TEE, inaccessible from the OS. Once decrypted, the material is decoded (again either on the GPU or in the TEE – even in implementations that use the TEE for the cryptography, the actual media decoding may happen on the GPU) and displayed. One key point is that the decoded video material is still stored in RAM that the OS has no access to, and the GPU composites it onto the outbound video stream (which is why if you take a screenshot of a browser playing a stream using hardware-based DRM you’ll just see a black window – as far as the OS can see, there is only a black window there).

Now, TPMs are sometimes referred to as a TEE, and in a way they are. However, they’re fixed function – you can’t run arbitrary code on the TPM, you only have whatever functionality it provides. But TPMs do have the ability to decrypt data using keys that are tied to the TPM, so isn’t this sufficient? Well, no. First, the TPM can’t communicate with the GPU. The OS could push encrypted material to it, and it would get plaintext material back. But the entire point of this exercise was to avoid the decrypted version of the stream from ever being visible to the OS, so this would be pointless. And rather more fundamentally, TPMs are slow. I don’t think there’s a TPM on the market that could decrypt a 1080p stream in realtime, let alone a 4K one.

The FSF’s focus on TPMs here is not only technically wrong, it’s indicative of a failure to understand what’s actually happening in the industry. While the FSF has been focusing on TPMs, GPU vendors have quietly deployed all of this technology without the FSF complaining at all. Microsoft has enthusiastically participated in making hardware DRM on Windows possible, and user freedoms have suffered as a result, but Playready hardware-based DRM works just fine on hardware that doesn’t have a TPM and will continue to do so.

comments

When should we require that firmware be free?

2024-12-12 Matthew Garrett

Post Syndicated from Matthew Garrett original https://mjg59.dreamwidth.org/70895.html

The distinction between hardware and software has historically been relatively easy to understand – hardware is the physical object that software runs on. This is made more complicated by the existence of programmable logic like FPGAs, but by and large things tend to fall into fairly neat categories if we’re drawing that distinction.

Conversations usually become more complicated when we introduce firmware, but should they? According to Wikipedia, Firmware is software that provides low-level control of computing device hardware, and basically anything that’s generally described as firmware certainly fits into the “software” side of the above hardware/software binary. From a software freedom perspective, this seems like something where the obvious answer to “Should this be free” is “yes”, but it’s worth thinking about why the answer is yes – the goal of free software isn’t freedom for freedom’s sake, but because the freedoms embodied in the Free Software Definition (and by proxy the DFSG) are grounded in real world practicalities.

How do these line up for firmware? Firmware can fit into two main classes – it can be something that’s responsible for initialisation of the hardware (such as, historically, BIOS, which is involved in initialisation and boot and then largely irrelevant for runtime[1]) or it can be something that makes the hardware work at runtime (wifi card firmware being an obvious example). The role of free software in the latter case feels fairly intuitive, since the interface and functionality the hardware offers to the operating system is frequently largely defined by the firmware running on it. Your wifi chipset is, these days, largely a software defined radio, and what you can do with it is determined by what the firmware it’s running allows you to do. Sometimes those restrictions may be required by law, but other times they’re simply because the people writing the firmware aren’t interested in supporting a feature – they may see no reason to allow raw radio packets to be provided to the OS, for instance. We also shouldn’t ignore the fact that sufficiently complicated firmware exposed to untrusted input (as is the case in most wifi scenarios) may contain exploitable vulnerabilities allowing attackers to gain arbitrary code execution on the wifi chipset – and potentially use that as a way to gain control of the host OS (see this writeup for an example). Vendors being in a unique position to update that firmware means users may never receive security updates, leaving them with a choice between discarding hardware that otherwise works perfectly or leaving themselves vulnerable to known security issues.

But even the cases where firmware does nothing other than initialise the hardware cause problems. A lot of hardware has functionality controlled by registers that can be locked during the boot process. Vendor firmware may choose to disable (or, rather, never to enable) functionality that may be beneficial to a user, and then lock out the ability to reconfigure the hardware later. Without any ability to modify that firmware, the user lacks the freedom to choose what functionality their hardware makes available to them. Again, the ability to inspect this firmware and modify it has a distinct benefit to the user.

So, from a practical perspective, I think there’s a strong argument that users would benefit from most (if not all) firmware being free software, and I don’t think that’s an especially controversial argument. So I think this is less of a philosophical discussion, and more of a strategic one – is spending time focused on ensuring firmware is free worthwhile, and if so what’s an appropriate way of achieving this?

I think there’s two consistent ways to view this. One is to view free firmware as desirable but not necessary. This approach basically argues that code that’s running on hardware that isn’t the main CPU would benefit from being free, in the same way that code running on a remote network service would benefit from being free, but that this is much less important than ensuring that all the code running in the context of the OS on the primary CPU is free. The maximalist position is not to compromise at all – all software on a system, whether it’s running at boot or during runtime, and whether it’s running on the primary CPU or any other component on the board, should be free.

Personally, I lean towards the former and think there’s a reasonably coherent argument here. I think users would benefit from the ability to modify the code running on hardware that their OS talks to, in the same way that I think users would benefit from the ability to modify the code running on hardware the other side of a network link that their browser talks to. I also think that there’s enough that remains to be done in terms of what’s running on the host CPU that it’s not worth having that fight yet. But I think the latter is absolutely intellectually consistent, and while I don’t agree with it from a pragmatic perspective I think things would undeniably be better if we lived in that world.

This feels like a thing you’d expect the Free Software Foundation to have opinions on, and it does! There are two primarily relevant things – the Respects your Freedoms campaign focused on ensuring that certified hardware meets certain requirements (including around firmware), and the Free System Distribution Guidelines, which define a baseline for an OS to be considered free by the FSF (including requirements around firmware).

RYF requires that all software on a piece of hardware be free other than under one specific set of circumstances. If software runs on (a) a secondary processor and (b) within which software installation is not intended after the user obtains the product, then the software does not need to be free. (b) effectively means that the firmware has to be in ROM, since any runtime interface that allows the firmware to be loaded or updated is intended to allow software installation after the user obtains the product.

The Free System Distribution Guidelines require that all non-free firmware be removed from the OS before it can be considered free. The recommended mechanism to achieve this is via linux-libre, a project that produces tooling to remove anything that looks plausibly like a non-free firmware blob from the Linux source code, along with any incitement to the user to load firmware – including even removing suggestions to update CPU microcode in order to mitigate CPU vulnerabilities.

For hardware that requires non-free firmware to be loaded at runtime in order to work, linux-libre doesn’t do anything to work around this – the hardware will simply not work. In this respect, linux-libre reduces the amount of non-free firmware running on a system in the same way that removing the hardware would. This presumably encourages users to purchase RYF compliant hardware.

But does that actually improve things? RYF doesn’t require that a piece of hardware have no non-free firmware, it simply requires that any non-free firmware be hidden from the user. CPU microcode is an instructive example here. At the time of writing, every laptop listed here has an Intel CPU. Every Intel CPU has microcode in ROM, typically an early revision that is known to have many bugs. The expectation is that this microcode is updated in the field by either the firmware or the OS at boot time – the updated version is loaded into RAM on the CPU, and vanishes if power is cut. The combination of RYF and linux-libre doesn’t reduce the amount of non-free code running inside the CPU, it just means that the user (a) is more likely to hit since-fixed bugs (including security ones!), and (b) has less guidance on how to avoid them.

As long as RYF permits hardware that makes use of non-free firmware I think it hurts more than it helps. In many cases users aren’t guided away from non-free firmware – instead it’s hidden away from them, leaving them less aware that their freedom is constrained. Linux-libre goes further, refusing to even inform the user that the non-free firmware that their hardware depends on can be upgraded to improve their security.

Out of sight shouldn’t mean out of mind. If non-free firmware is a threat to user freedom then allowing it to exist in ROM doesn’t do anything to solve that problem. And if it isn’t a threat to user freedom, then what’s the point of requiring linux-libre for a Linux distribution to be considered free by the FSF? We seem to have ended up in the worst case scenario, where nothing is being done to actually replace any of the non-free firmware running on people’s systems and where users may even end up with a reduced awareness that the non-free firmware even exists.

[1] Yes yes SMM

comments

Android privacy improvements break key attestation

2024-12-12 Matthew Garrett

Post Syndicated from Matthew Garrett original https://mjg59.dreamwidth.org/70630.html

Sometimes you want to restrict access to something to a specific set of devices – for instance, you might want your corporate VPN to only be reachable from devices owned by your company. You can’t really trust a device that self attests to its identity, for instance by reporting its MAC address or serial number, for a couple of reasons:

These aren’t fixed – MAC addresses are trivially reprogrammable, and serial numbers are typically stored in reprogrammable flash at their most protected
A malicious device could simply lie about them

If we want a high degree of confidence that the device we’re talking to really is the device it claims to be, we need something that’s much harder to spoof. For devices with a TPM this is the TPM itself. Every TPM has an Endorsement Key (EK) that’s associated with a certificate that chains back to the TPM manufacturer. By verifying that certificate path and having the TPM prove that it’s in posession of the private half of the EK, we know that we’re communicating with a genuine TPM[1].

Android has a broadly equivalent thing called ID Attestation. Android devices can generate a signed attestation that they have certain characteristics and identifiers, and this can be chained back to the manufacturer. Obviously providing signed proof of the device identifier is kind of problematic from a privacy perspective, so the short version[2] is that only apps installed using a corporate account rather than a normal user account are able to do this.

But that’s still not ideal – the device identifiers involved included the IMEI and serial number of the device, and those could potentially be used to correlate devices across privacy boundaries since they’re static[3] identifiers that are the same both inside a corporate work profile and in the normal user profile, and also remains static if you move between different employers and use the same phone[4]. So, since Android 12, ID Attestation includes an “Enterprise Specific ID” or ESID. The ESID is based on a hash of device-specific data plus the enterprise that the corporate work profile is associated with. If a device is enrolled with the same enterprise then this ID will remain static, if it’s enrolled with a different enterprise it’ll change, and it just doesn’t exist outside the work profile at all. The other device identifiers are no longer exposed.

But device ID verification isn’t enough to solve the underlying problem here. When we receive a device ID attestation we know that someone at the far end has posession of a device with that ID, but we don’t know that that device is where the packets are originating. If our VPN simply has an API that asks for an attestation from a trusted device before routing packets, we could pass that on to said trusted device and then simply forward the attestation to the VPN server[5]. We need some way to prove that the the device trying to authenticate is actually that device.

The answer to this is key provenance attestation. If we can prove that an encryption key was generated on a trusted device, and that the private half of that key is stored in hardware and can’t be exported, then using that key to establish a connection proves that we’re actually communicating with a trusted device. TPMs are able to do this using the attestation keys generated in the Credential Activation process, giving us proof that a specific keypair was generated on a TPM that we’ve previously established is trusted.

Android again has an equivalent called Key Attestation. This doesn’t quite work the same way as the TPM process – rather than being tied back to the same unique cryptographic identity, Android key attestation chains back through a separate cryptographic certificate chain but contains a statement about the device identity – including the IMEI and serial number. By comparing those to the values in the device ID attestation we know that the key is associated with a trusted device and we can now establish trust in that key.

“But Matthew”, those of you who’ve been paying close attention may be saying, “Didn’t Android 12 remove the IMEI and serial number from the device ID attestation?” And, well, congratulations, you were apparently paying more attention than Google. The key attestation no longer contains enough information to tie back to the device ID attestation, making it impossible to prove that a hardware-backed key is associated with a specific device ID attestation and its enterprise enrollment.

I don’t think this was any sort of deliberate breakage, and it’s probably more an example of shipping the org chart – my understanding is that device ID attestation and key attestation are implemented by different parts of the Android organisation and the impact of the ESID change (something that appears to be a legitimate improvement in privacy!) on key attestation was probably just not realised. But it’s still a pain.

[1] Those of you paying attention may realise that what we’re doing here is proving the identity of the TPM, not the identity of device it’s associated with. Typically the TPM identity won’t vary over the lifetime of the device, so having a one-time binding of those two identities (such as when a device is initially being provisioned) is sufficient. There’s actually a spec for distributing Platform Certificates that allows device manufacturers to bind these together during manufacturing, but I last worked on those a few years back and don’t know what the current state of the art there is

[2] Android has a bewildering array of different profile mechanisms, some of which are apparently deprecated, and I can never remember how any of this works, so you’re not getting the long version

[3] Nominally, anyway. Cough.

[4] I wholeheartedly encourage people not to put work accounts on their personal phones, but I am a filthy hypocrite here

[5] Obviously if we have the ability to ask for attestation from a trusted device, we have access to a trusted device. Why not simply use the trusted device? The answer there may be that we’ve compromised one and want to do as little as possible on it in order to reduce the probability of triggering any sort of endpoint detection agent, or it may be because we want to run on a device with different security properties than those enforced on the trusted device.

comments

What the fuck is an SBAT and why does everyone suddenly care

2024-08-22 Matthew Garrett

Post Syndicated from Matthew Garrett original https://mjg59.dreamwidth.org/70348.html

Short version: Secure Boot Advanced Targeting and if that’s enough for you you can skip the rest you’re welcome.

Long version: When UEFI Secure Boot was specified, everyone involved was, well, a touch naive. The basic security model of Secure Boot is that all the code that ends up running in a kernel-level privileged environment should be validated before execution – the firmware verifies the bootloader, the bootloader verifies the kernel, the kernel verifies any additional runtime loaded kernel code, and now we have a trusted environment to impose any other security policy we want. Obviously people might screw up, but the spec included a way to revoke any signed components that turned out not to be trustworthy: simply add the hash of the untrustworthy code to a variable, and then refuse to load anything with that hash even if it’s signed with a trusted key.

Unfortunately, as it turns out, scale. Every Linux distribution that works in the Secure Boot ecosystem generates their own bootloader binaries, and each of them has a different hash. If there’s a vulnerability identified in the source code for said bootloader, there’s a large number of different binaries that need to be revoked. And, well, the storage available to store the variable containing all these hashes is limited. There’s simply not enough space to add a new set of hashes every time it turns out that grub (a bootloader initially written for a simpler time when there was no boot security and which has several separate image parsers and also a font parser and look you know where this is going) has another mechanism for a hostile actor to cause it to execute arbitrary code, so another solution was needed.

And that solution is SBAT. The general concept behind SBAT is pretty straightforward. Every important component in the boot chain declares a security generation that’s incorporated into the signed binary. When a vulnerability is identified and fixed, that generation is incremented. An update can then be pushed that defines a minimum generation – boot components will look at the next item in the chain, compare its name and generation number to the ones stored in a firmware variable, and decide whether or not to execute it based on that. Instead of having to revoke a large number of individual hashes, it becomes possible to push one update that simply says “Any version of grub with a security generation below this number is considered untrustworthy”.

So why is this suddenly relevant? SBAT was developed collaboratively between the Linux community and Microsoft, and Microsoft chose to push a Windows update that told systems not to trust versions of grub with a security generation below a certain level. This was because those versions of grub had genuine security vulnerabilities that would allow an attacker to compromise the Windows secure boot chain, and we’ve seen real world examples of malware wanting to do that (Black Lotus did so using a vulnerability in the Windows bootloader, but a vulnerability in grub would be just as viable for this). Viewed purely from a security perspective, this was a legitimate thing to want to do.

(An aside: the “Something has gone seriously wrong” message that’s associated with people having a bad time as a result of this update? That’s a message from shim, not any Microsoft code. Shim pays attention to SBAT updates in order to avoid violating the security assumptions made by other bootloaders on the system, so even though it was Microsoft that pushed the SBAT update, it’s the Linux bootloader that refuses to run old versions of grub as a result. This is absolutely working as intended)

The problem we’ve ended up in is that several Linux distributions had not shipped versions of grub with a newer security generation, and so those versions of grub are assumed to be insecure (it’s worth noting that grub is signed by individual distributions, not Microsoft, so there’s no externally introduced lag here). Microsoft’s stated intention was that Windows Update would only apply the SBAT update to systems that were Windows-only, and any dual-boot setups would instead be left vulnerable to attack until the installed distro updated its grub and shipped an SBAT update itself. Unfortunately, as is now obvious, that didn’t work as intended and at least some dual-boot setups applied the update and that distribution’s Shim refused to boot that distribution’s grub.

What’s the summary? Microsoft (understandably) didn’t want it to be possible to attack Windows by using a vulnerable version of grub that could be tricked into executing arbitrary code and then introduce a bootkit into the Windows kernel during boot. Microsoft did this by pushing a Windows Update that updated the SBAT variable to indicate that known-vulnerable versions of grub shouldn’t be allowed to boot on those systems. The distribution-provided Shim first-stage bootloader read this variable, read the SBAT section from the installed copy of grub, realised these conflicted, and refused to boot grub with the “Something has gone seriously wrong” message. This update was not supposed to apply to dual-boot systems, but did anyway. Basically:

1) Microsoft applied an update to systems where that update shouldn’t have been applied
2) Some Linux distros failed to update their grub code and SBAT security generation when exploitable security vulnerabilities were identified in grub

The outcome is that some people can’t boot their systems. I think there’s plenty of blame here. Microsoft should have done more testing to ensure that dual-boot setups could be identified accurately. But also distributions shipping signed bootloaders should make sure that they’re updating those and updating the security generation to match, because otherwise they’re shipping a vector that can be used to attack other operating systems and that’s kind of a violation of the social contract around all of this.

It’s unfortunate that the victims here are largely end users faced with a system that suddenly refuses to boot the OS they want to boot. That should never happen. I don’t think asking arbitrary end users whether they want secure boot updates is likely to result in good outcomes, and while I vaguely tend towards UEFI Secure Boot not being something that benefits most end users it’s also a thing you really don’t want to discover you want after the fact so I have sympathy for it being default on, so I do sympathise with Microsoft’s choices here, other than the failed attempt to avoid the update on dual boot systems.

Anyway. I was extremely involved in the implementation of this for Linux back in 2012 and wrote the first prototype of Shim (which is now a massively better bootloader maintained by a wider set of people and that I haven’t touched in years), so if you want to blame an individual please do feel free to blame me. This is something that shouldn’t have happened, and unless you’re either Microsoft or a Linux distribution it’s not your fault. I’m sorry.

comments

Client-side filtering of private data is a bad idea

2024-08-19 Matthew Garrett

Post Syndicated from Matthew Garrett original https://mjg59.dreamwidth.org/70061.html

(The issues described in this post have been fixed, I have not exhaustively researched whether any other issues exist)

Feeld is a dating app aimed largely at alternative relationship communities (think “classier Fetlife” for the most part), so unsurprisingly it’s fairly popular in San Francisco. Their website makes the claim:

Can people see what or who I’m looking for?
No. You’re the only person who can see which genders or sexualities you’re looking for. Your curiosity and privacy are always protected.

which is based on you being able to restrict searches to people of specific genders, sexualities, or relationship situations. This sort of claim is one of those things that just sits in the back of my head worrying me, so I checked it out.

First step was to grab a copy of the Android APK (there are multiple sites that scrape them from the Play Store) and run it through apk-mitm – Android apps by default don’t trust any additional certificates in the device certificate store, and also frequently implement certificate pinning. apk-mitm pulls apart the apk, looks for known http libraries, disables pinning, and sets the appropriate manifest options for the app to trust additional certificates. Then I set up mitmproxy, installed the cert on a test phone, and installed the app. Now I was ready to start.

What became immediately clear was that the app was using graphql to query. What was a little more surprising is that it appears to have been implemented such that there’s no server state – when browsing profiles, the client requests a batch of profiles along with a list of profiles that the client has already seen. This has the advantage that the server doesn’t need to keep track of a session, but also means that queries just keep getting larger and larger the more you swipe. I’m not a web developer, I have absolutely no idea what the tradeoffs are here, so I point this out as a point of interest rather than anything else.

Anyway. For people unfamiliar with graphql, it’s basically a way to query a database and define the set of fields you want returned. Let’s take the example of requesting a user’s profile. You’d provide the profile ID in question, and request their bio, age, rough distance, status, photos, and other bits of data that the client should show. So far so good. But what happens if we request other data?

graphql supports introspection to request a copy of the database schema, but this feature is optional and was disabled in this case. Could I find this data anywhere else? Pulling apart the apk revealed that it’s a React Native app, so effectively a framework for allowing writing of native apps in Javascript. Sometimes you’ll be lucky and find the actual Javascript source there, but these days it’s more common to find Hermes blobs. Fortunately hermes-dec exists and does a decent job of recovering something that approximates the original input, and from this I was able to find various lists of database fields.

So, remember that original FAQ statement, that your desires would never be shown to anyone else? One of the fields mentioned in the app was “lookingFor”, a field that wasn’t present in the default profile query. What happens if we perform the incredibly complicated hack of exporting a profile query as a curl statement, add “lookingFor” into the set of requested fields, and run it?

Oops.

So, point 1 is that you can’t simply protect data by having your client not ask for it – private data must never be released. But there was a whole separate class of issue that was an even more obvious issue.

Looking more closely at the profile data returned, I noticed that there were fields there that weren’t being displayed in the UI. Those included things like “ageRange”, the range of ages that the profile owner was interested in, and also whether the profile owner had already “liked” or “disliked” your profile (which means a bunch of the profiles you see may already have turned you down, but the app simply didn’t show that). This isn’t ideal, but what was more concerning was that profiles that were flagged as hidden were still being sent to the app and then just not displayed to the user. Another example of this is that the app supports associating your profile with profiles belonging to partners – if one of those profiles was then hidden, the app would stop showing the partnership, but was still providing the profile ID in the query response and querying that ID would still show the hidden profile contents.

Reporting this was inconvenient. There was no security contact listed on the website or in the app. I ended up finding Feeld’s head of trust and safety on Linkedin, paying for a month of Linkedin Pro, and messaging them that way. I was then directed towards a HackerOne program with a link to terms and conditions that 404ed, and it took a while to convince them I was uninterested in signing up to a program without explicit terms and conditions. Finally I was just asked to email security@, and successfully got in touch. I heard nothing back, but after prompting was told that the issues were fixed – I then looked some more, found another example of the same sort of issue, and eventually that was fixed as well. I’ve now been informed that work has been done to ensure that this entire class of issue has been dealt with, but I haven’t done any significant amount of work to ensure that that’s the case.

You can’t trust clients. You can’t give them information and assume they’ll never show it to anyone. You can’t put private data in a database with no additional acls and just rely on nobody ever asking for it. You also can’t find a single instance of this sort of issue and fix it without verifying that there aren’t other examples of the same class. I’m glad that Feeld engaged with me earnestly and fixed these issues, and I really do hope that this has altered their development model such that it’s not something that comes up again in future.

comments

SSH agent extensions as an arbitrary RPC mechanism

2024-06-12 Matthew Garrett

Post Syndicated from Matthew Garrett original https://mjg59.dreamwidth.org/69646.html

A while back, I wrote about using the SSH agent protocol to satisfy WebAuthn requests. The main problem with this approach is that it required starting the SSH agent with a special argument and also involved being a little too friendly with the implementation – things worked because I could provide an arbitrary public key and the implementation never validated that, but it would be legitimate for it to start doing so and then break everything. And it also only worked for keys stored on tokens that ssh supports – there was no way to extend this to other keystores on the client (such as the Secure Enclave on Macs, or TPM-backed keys on PCs). I wanted a better solution.

It turns out that it was far easier than I expected. The ssh agent protocol is documented here, and the interesting part is the extension support extension mechanism. Basically, you can declare an extension and then just tunnel whatever you want over it. As before, my goto was the go ssh agent package which conveniently implements both the client and server side of this. Implementing the local agent is trivial – look up SSH_AUTH_SOCK, connect to it, create a new agent client that can communicate with that by calling NewClient, and then implement the ExtendedAgent interface, create a new socket, and call ServeAgent against that. Most of the ExtendedAgent functions should simply call through to the original agent, with the exception of Extension(). Just add a case statement against extensionType, define some reasonably namespaced extension, and you’re done.

Now you need to use this agent. You probably don’t want to use this for arbitrary hosts (agent forwarding should only be enabled for remote systems you trust, not arbitrary machines you connect to – if you enabled agent forwarding for github and github got compromised, github would be able to use any private keys loaded into your agent, and you probably don’t want that). So the right approach is to add a Host entry to the ssh config with a ForwardAgent stanza pointing at the socket you created in your new agent. This way the configured subset of remote hosts will automatically talk to this new custom agent, while forwarding for anything else will still be at the user’s discretion.

For the remote end things are even easier. Look up SSH_AUTH_SOCK and call NewClient as before, and then simply call client.Extension(). Whatever you stick in the contents argument will simply end up being received at the client end. You now have a communication channel between a the remote system and the local client, and what you do with that is up to you. I’m using it to allow a remote system to obtain auth tokens from Okta and forward WebAuthn challenges that can either be satisfied via a local WebAuthn token or by passing the query off to Mac TouchID, but there’s fundamentally no constraints whatsoever on what can be done here.

(If you want to do this on Windows and still have everything work with existing clients you’ll need to take this into account – Windows didn’t really do Unix sockets until recently so everything there is awful)

comments

Digital forgeries are hard

2024-03-14 Matthew Garrett

Post Syndicated from Matthew Garrett original https://mjg59.dreamwidth.org/69507.html

Closing arguments in the trial between various people and Craig Wright over whether he’s Satoshi Nakamoto are wrapping up today, amongst a bewildering array of presented evidence. But one utterly astonishing aspect of this lawsuit is that expert witnesses for both sides agreed that much of the digital evidence provided by Craig Wright was unreliable in one way or another, generally including indications that it wasn’t produced at the point in time it claimed to be. And it’s fascinating reading through the subtle (and, in some cases, not so subtle) ways that that’s revealed.

One of the pieces of evidence entered is screenshots of data from Mind Your Own Business, a business management product that’s been around for some time. Craig Wright relied on screenshots of various entries from this product to support his claims around having controlled meaningful number of bitcoin before he was publicly linked to being Satoshi. If these were authentic then they’d be strong evidence linking him to the mining of coins before Bitcoin’s public availability. Unfortunately the screenshots themselves weren’t contemporary – the metadata shows them being created in 2020. This wouldn’t fundamentally be a problem (it’s entirely reasonable to create new screenshots of old material), as long as it’s possible to establish that the material shown in the screenshots was created at that point. Sadly, well.

One part of the disclosed information was an email that contained a zip file that contained a raw database in the format used by MYOB. Importing that into the tool allowed an audit record to be extracted – this record showed that the relevant entries had been added to the database in 2020, shortly before the screenshots were created. This was, obviously, not strong evidence that Craig had held Bitcoin in 2009. This evidence was reported, and was responded to with a couple of additional databases that had an audit trail that was consistent with the dates in the records in question. Well, partially. The audit record included session data, showing an administrator logging into the data base in 2011 and then, uh, logging out in 2023, which is rather more consistent with someone changing their system clock to 2011 to create an entry, and switching it back to present day before logging out. In addition, the audit log included fields that didn’t exist in versions of the product released before 2016, strongly suggesting that the entries dated 2009-2011 were created in software released after 2016. And even worse, the order of insertions into the database didn’t line up with calendar time – an entry dated before another entry may appear in the database afterwards, indicating that it was created later. But even more obvious? The database schema used for these old entries corresponded to a version of the software released in 2023.

This is all consistent with the idea that these records were created after the fact and backdated to 2009-2011, and that after this evidence was made available further evidence was created and backdated to obfuscate that. In an unusual turn of events, during the trial Craig Wright introduced further evidence in the form of a chain of emails to his former lawyers that indicated he had provided them with login details to his MYOB instance in 2019 – before the metadata associated with the screenshots. The implication isn’t entirely clear, but it suggests that either they had an opportunity to examine this data before the metadata suggests it was created, or that they faked the data? So, well, the obvious thing happened, and his former lawyers were asked whether they received these emails. The chain consisted of three emails, two of which they confirmed they’d received. And they received a third email in the chain, but it was different to the one entered in evidence. And, uh, weirdly, they’d received a copy of the email that was submitted – but they’d received it a few days earlier. In 2024.

And again, the forensic evidence is helpful here! It turns out that the email client used associates a timestamp with any attachments, which in this case included an image in the email footer – and the mysterious time travelling email had a timestamp in 2024, not 2019. This was created by the client, so was consistent with the email having been sent in 2024, not being sent in 2019 and somehow getting stuck somewhere before delivery. The date header indicates 2019, as do encoded timestamps in the MIME headers – consistent with the mail being sent by a computer with the clock set to 2019.

But there’s a very weird difference between the copy of the email that was submitted in evidence and the copy that was located afterwards! The first included a header inserted by gmail that included a 2019 timestamp, while the latter had a 2024 timestamp. Is there a way to determine which of these could be the truth? It turns out there is! The format of that header changed in 2022, and the version in the email is the new version. The version with the 2019 timestamp is anachronistic – the format simply doesn’t match the header that gmail would have introduced in 2019, suggesting that an email sent in 2022 or later was modified to include a timestamp of 2019.

This is by no means the only indication that Craig Wright’s evidence may be misleading (there’s the whole argument that the Bitcoin white paper was written in LaTeX when general consensus is that it’s written in OpenOffice, given that’s what the metadata claims), but it’s a lovely example of a more general issue.

Our technology chains are complicated. So many moving parts end up influencing the content of the data we generate, and those parts develop over time. It’s fantastically difficult to generate an artifact now that precisely corresponds to how it would look in the past, even if we go to the effort of installing an old OS on an old PC and setting the clock appropriately (are you sure you’re going to be able to mimic an entirely period appropriate patch level?). Even the version of the font you use in a document may indicate it’s anachronistic. I’m pretty good at computers and I no longer have any belief I could fake an old document.

(References: this Dropbox, under “Expert reports”, “Patrick Madden”. Initial MYOB data is in “Appendix PM7”, further analysis is in “Appendix PM42”, email analysis is “Sixth Expert Report of Mr Patrick Madden”)

comments

Debugging an odd inability to stream video

2024-02-20 Matthew Garrett

Post Syndicated from Matthew Garrett original https://mjg59.dreamwidth.org/69343.html

We have a cabin out in the forest, and when I say “out in the forest” I mean “in a national forest subject to regulation by the US Forest Service” which means there’s an extremely thick book describing the things we’re allowed to do and (somewhat longer) not allowed to do. It’s also down in the bottom of a valley surrounded by tall trees (the whole “forest” bit). There used to be AT&T copper but all that infrastructure burned down in a big fire back in 2021 and AT&T no longer supply new copper links, and Starlink isn’t viable because of the whole “bottom of a valley surrounded by tall trees” thing along with regulations that prohibit us from putting up a big pole with a dish on top. Thankfully there’s LTE towers nearby, so I’m simply using cellular data. Unfortunately my provider rate limits connections to video streaming services in order to push them down to roughly SD resolution. The easy workaround is just to VPN back to somewhere else, which in my case is just a Wireguard link back to San Francisco.

This worked perfectly for most things, but some streaming services simply wouldn’t work at all. Attempting to load the video would just spin forever. Running tcpdump at the local end of the VPN endpoint showed a connection being established, some packets being exchanged, and then… nothing. The remote service appeared to just stop sending packets. Tcpdumping the remote end of the VPN showed the same thing. It wasn’t until I looked at the traffic on the VPN endpoint’s external interface that things began to become clear.

This probably needs some background. Most network infrastructure has a maximum allowable packet size, which is referred to as the Maximum Transmission Unit or MTU. For ethernet this defaults to 1500 bytes, and these days most links are able to handle packets of at least this size, so it’s pretty typical to just assume that you’ll be able to send a 1500 byte packet. But what’s important to remember is that that doesn’t mean you have 1500 bytes of packet payload – that 1500 bytes includes whatever protocol level headers are on there. For TCP/IP you’re typically looking at spending around 40 bytes on the headers, leaving somewhere around 1460 bytes of usable payload. And if you’re using a VPN, things get annoying. In this case the original packet becomes the payload of a new packet, which means it needs another set of TCP (or UDP) and IP headers, and probably also some VPN header. This still all needs to fit inside the MTU of the link the VPN packet is being sent over, so if the MTU of that is 1500, the effective MTU of the VPN interface has to be lower. For Wireguard, this works out to an effective MTU of 1420 bytes. That means simply sending a 1500 byte packet over a Wireguard (or any other VPN) link won’t work – adding the additional headers gives you a total packet size of over 1500 bytes, and that won’t fit into the underlying link’s MTU of 1500.

And yet, things work. But how? Faced with a packet that’s too big to fit into a link, there are two choices – break the packet up into multiple smaller packets (“fragmentation”) or tell whoever’s sending the packet to send smaller packets. Fragmentation seems like the obvious answer, so I’d encourage you to read Valerie Aurora’s article on how fragmentation is more complicated than you think. tl;dr – if you can avoid fragmentation then you’re going to have a better life. You can explicitly indicate that you don’t want your packets to be fragmented by setting the Don’t Fragment bit in your IP header, and then when your packet hits a link where your packet exceeds the link MTU it’ll send back a packet telling the remote that it’s too big, what the actual MTU is, and the remote will resend a smaller packet. This avoids all the hassle of handling fragments in exchange for the cost of a retransmit the first time the MTU is exceeded. It also typically works these days, which wasn’t always the case – people had a nasty habit of dropping the ICMP packets telling the remote that the packet was too big, which broke everything.

What I saw when I tcpdumped on the remote VPN endpoint’s external interface was that the connection was getting established, and then a 1500 byte packet would arrive (this is kind of the behaviour you’d expect for video – the connection handshaking involves a bunch of relatively small packets, and then once you start sending the video stream itself you start sending packets that are as large as possible in order to minimise overhead). This 1500 byte packet wouldn’t fit down the Wireguard link, so the endpoint sent back an ICMP packet to the remote telling it to send smaller packets. The remote should then have sent a new, smaller packet – instead, about a second after sending the first 1500 byte packet, it sent that same 1500 byte packet. This is consistent with it ignoring the ICMP notification and just behaving as if the packet had been dropped.

All the services that were failing were failing in identical ways, and all were using Fastly as their CDN. I complained about this on social media and then somehow ended up in contact with the engineering team responsible for this sort of thing – I sent them a packet dump of the failure, they were able to reproduce it, and it got fixed. Hurray!

(Between me identifying the problem and it getting fixed I was able to work around it. The TCP header includes a Maximum Segment Size (MSS) field, which indicates the maximum size of the payload for this connection. iptables allows you to rewrite this, so on the VPN endpoint I simply rewrote the MSS to be small enough that the packets would fit inside the Wireguard MTU. This isn’t a complete fix since it’s done at the TCP level rather than the IP level – so any large UDP packets would still end up breaking)

I’ve no idea what the underlying issue was, and at the client end the failure was entirely opaque: the remote simply stopped sending me packets. The only reason I was able to debug this at all was because I controlled the other end of the VPN as well, and even then I wouldn’t have been able to do anything about it other than being in the fortuitous situation of someone able to do something about it seeing my post. How many people go through their lives dealing with things just being broken and having no idea why, and how do we fix that?

comments

The collective thoughts of the interwebz