I think we can build a better chat protocol than Matrix.
I’ve already kinda started a document on why I think Matrix has some fundamental flaws, so I won’t be going into that too much in this document. (It’s not finished yet.)
Our goal here is to make a generic e2ee message delivery protocol and then build a chat protocol on top of it. Developers should be able to use the protocol to securely transmit arbitrary messages in a particular channel asynchronously between client endpoints. When a channel is created initially, part of its application-level options would be a string like a MIME type indicating how its messages should be understood.
The chat protocol we build on top of this can then be completely free of limitations required by the underlying messaging. If you want to send random blobs of bytes to each other then you should be able to do it, with zero overhead (above the core e2ee protocol). You could even do some CRDT system in them if you wanted to go crazy.
It should be a proper spiritual successor to IRC in the ways that I think Matrix just isn’t and can’t be.
Yeah we’re just gonna use MLS, since it’s great. In the reference implementation, we’re going to use the very nice library OpenMLS. It means we don’t have to invent new cryptography and being based on TreeKEM instead of diffie-hellmans means that we can support post-quantum primitives later on.
MLS is an in-progress IETF standard, and it supports pretty good guarantees around forward secrecy and has already been audited. As long as we don’t massively mess up our usage of it, we shouldn’t have to worry about maybe inventing new cryptography. It also means that implementing client/servers in other languages can be easier as long as there exists an MLS implementation.
Listen to this episode of Security Cryptography Whatever for more information on MLS.
It’s also wicked fast, apparently Cisco is already using it as a way to do e2ee video chat. Yeah really!
This is kinda the hard part.
The MLS protocol abstracts out message delivery to a generic “delivery service” which is assumed to be able to deliver messages reliably. I’m planning on, at first, just mimicking the reasonable federated model that Matrix does, relying on homeservers. There’s nothing about the MLS spec that prevents this, but it’s a bit tricky.
For some background, the hard part is that MLS enforces a total ordering of epoch updates. We can allow for some amount of reordering of messages within an epoch, that’s easy to do. But any time we do something more involved like adding/removing users, a device updating their leaf key, etc., we need to negotiate that globally.
I think the best way to do this is by having homeservers sponsoring users in the channel run an instance of Raft or some other consensus protocol when epoch update proposals are sent to approve it and only then does the proposal actually count. When a proposal is approved, they would all sign it (possibly natively as part of the consensus protocol they’re already doing) along with the MLS epoch index. This code is going to be really annoying to explain because Raft and other consensus protocols also tend to use the term “epoch”. If there’s no conflicting proposals then this should be pretty fast.
Users trying to submit proposals will naturally have to wait for the proposal to get approved, and then they can apply the updates on their end as everyone else does. But in our case, if there are conflicting updates, the proposal can fail and the user would see the other successful proposal, process it, and retry the proposal again on top of the next epoch.
This could pose the possibility that the protocol could get DoSed by some malicious user(s) submitting many repeated proposals, but a naive way to deal with this is for their homeserver (or other homeservers) can wait a little longer to signal acceptance of their proposals if they’re spamming to wait for proposals from other sources to be approved. This might not make it into the first version.
As suggested by the MLS spec, each device will have its own keypair. Within the protocol, each device will be treated as a separate entity. This will be identified with a device identity (DI) pubkey used to sign updates to a device identity record (DIR). These records are self-validating and can be exchanged between homeservers freely.
In the initial versions, we’re going to tie device identities to homeservers, but only loosely, since we want to enable roaming DIs in the future. When an identity is registered with a homeserver, they will sponsor the DI using some signed attestation from the homeserver that they recognize it. This will be included in an update record. It should also include a local name for the user, to enable user:homeserver
-style references. This is similar to a petname system.
To link identities together as a single user, they will include some signed data referencing each other. If both sides match, then other clients should treat them as the same identity. Homeservers should keep indexes of this data to be able to provide the current DIRs for any user they sponsor.
In the first version, homeservers will be identified by hostnames, so spam can be prevented in a similar manner to how it already is in Matrix, where other homeservers can simply throw away any DIs that are sponsored by homeservers they want to block.
In subsequent versions I would want to support roaming identities that use some other form of authentication. Also would want to support homeservers that themselves are identified using only pubkeys or other clever schemes.
Matrix’s tree event structure is clever but it only really makes sense if homeservers are loosely coupled and we want some self healing message history in the scenario of network partitions. This is clever, but one mistake they did was reusing this to try to implement their e2e encryption system. See the other doc for more analysis.
We’re more cleanly separating our layers. For the “text chat” room type we can be very expressive with our message types. I’m not sure how to best go about it, but I don’t want to directly copy Matrix’s message types because I don’t like how edits and redactions are implemented.
Messages should be able to be grouped together into a pack (like git does) and sent over en masse in order to fill in old history. This could be to fill in history on a user’s devices. To aid in this, devices would probably attach signatures signed with chat-level DIR keys. To reduce pack size, groups of messages sent at similar times should be hashed together signed in a merkle tree (or merkle mountain range) and individual signatures thrown out after a while.
We’d also include a separate signaling pathway for unencrypted channels. They’d provide the same “interface” to the application-layer data, so user interfaces wouldn’t know the difference.
Messages would still be signed by their sending DIs in order to authenticate that messages are actually send by who they say they are, but maybe this isn’t always required and we can leave it out when we’re especially trying to conserve bandwidth.
Homeservers should also keep copies of this data and serve it on request instead of relying on other users to fulfill history requests. To implement some administrative requests they’d also have to parse admin command messages to delete messages, etc.
Platforms like Discord and Slack and protocols like Matrix have a notion of groups of rooms that they call guilds (or, wrongly, “servers”), teams, or communities. These channels have some permission structure to add/remove users, delete messages, etc. I’m going to call these “spaces” for now.
There’s kinda two ways to implement this in this protocol.
I think we might want to use a mix of both. The first one could have performance issues with very large groups with very many high volume channels. MLS has been tested to support several thousand users with pretty good performance, which is pretty good, but I’m not sure how practical that would end up being on mobile devices where we want to be energy-conscious. It’s also possible a group might want to make a large public chat and some smaller private chats and we would want these to be separate. But on the other hand if there’s very many channels with all the same users it becomes annoying if a device needs to sign a whole load of key updates that are all (security-wise) identical.
I can imagine a few use cases that would benefit from a hybrid model:
I’m not a UI engineer but we’d have to think of some clever way to show how different rooms in the same space have different security domains so a user can know if some messages are potentially compromised or not.
The server shouldn’t have to care about different e2ee channels within a space as the user access control should manage that, but there might be some benefit from integrating that more. This would require more thought to design as we may want to be able to place limits on what within-space invites that users are able to do in order to better enforce security policies.
Clients should maintain some kind of inbox abstraction to receive messages. For the first version the inbox is a queue maintained by their homeserver, but there could be many sources for it like Nostr. This inbox system shouldn’t and can’t use the full e2ee MLS scheme for it and instead just be encrypted using a static key, because that’s good enough and it’s only used for bootstrapping. This may also be how they first receive room invites.
For spaces, there likely has to be some (encrypted?) registry data to describe the structure of rooms and channels in a space. Otherwise I can’t imagine how the user should learn about that.
I suppose you could also use this for one-off 1:1 messages, or homeserver broadcast notification messages.
File attachments like images images are complicated. They can be very large and we don’t want to force users to download the whole message. I think a neat way to support this would be to partly handle this outside of the core e2ee protocol, where channels/spaces can have a file repository like Teamspeak used to do, with some expiration set (30 days?). Uploads would get encrypted with single-use keys and those keys would be transmitted in-protocol. Homeservers could gossip these file to each other for replication and throw them away after the expiration. If a user wants the file after the expiration they could send a message asking for it to be reuploaded. Small thumbnails could also be embedded into the room messages.
We could also integrate BitTorrent/IPFS support into this somehow for some out-of-band archival, maybe.
We can tune MLS to allow larger-range reordering of application messages and dropping messages to support live speech. These voice channels would have to pre-negotiate epoch state to enable fast joining.
This is definitely not going to be in the first version. I also don’t want to implement a VoIP system, so maybe we could just use Mumble as a library.