Distributed Indexing of Distributed Messages

DHTs like Kademlia construct a key-value distributed storage system. Typically this is content-addressed, where the key is a hash of the value. Examples of this are BitTorrent (infohashes) and IPFS (CIDs). But a more difficult problem I haven’t seen a solution to is indexing this network. Let me clarify…

Let’s say that messages have some structure to them. For any message m, we can construct a list of “tags” the message has. An example of tags would be the hashtags on a tweet, the author of a document, etc. It doesn’t matter what they are, as long as it’s trivial for anyone to extract the tags any message has.

How would we build a system that allows a user to query for messages with some tag t? And how would they have an idea if they’ve found them all? I’ll answer that second one now: in unstructured p2p systems it’s impossible to know (since like, anyone can make a message and not tell anyone about it), but we can maybe give them a rough view of the messages that have been exchanged and what they could be missing.

I’ve been thinking about this problem for some time and I think I’ve developed a partial solution. I’ll discuss these ideas and point out the areas that I’m unsure of, which I hope someone else can figure out.

Trial examples

Naive concept

Simplest possible way to solve this problem is to have some special entities that save all messages and do indexing locally. The users can just ask them to give them all the messages with a tag. This would work, and it’d be very fast. But obviously, the issue with this is that these entities would have to remember every message ever sent, and they’d be able withhold the information that certain messages have a tag. It’s also a privacy leak.

But we can partially address the withholding issue. If this set of special entities was large enough, then users could ask multiple of them in parallel and just do the set union of the responses. If there’s very many of them, they could also fan out their queries to partially mitigate the privacy leak.

But that’s still not very good. This also assumes they’re trustworthy and won’t misbehave. They can also just lie and say “here’s a bunch of IDs of messages that have this tag” but when you look them up they don’t actually have the tag. A naive solution to that problem is to require they sign message responses and start removing them as soon as they misbehave, but that’s a fragile solution so it kinda sucks.

Shoehorning this into Kademlia

I have to briefly explain a key concept with how Kademlia (and other DHTs) works. Kademlia’s key innovation is this concept of the “XOR metric” used to compute a “distance” between two points in a space. The key space is just hashes, so entries in the space are evenly distributed. Peers compute this distance between itself and other peers in order to figure out how to structure their local routing tables. Peers also compute the distance between themselves a message to determine if they should store it or not. If it’s below a certain distance, they’ll keep it. The threshold distance is chosen based on how large the database they want to store is. Since messages are evenly distributed in the key space (by nature of being hashes), this works pretty well and two nodes with the same distance store roughly the same amount of data.

So this gives the start of an idea. We can continue this idea, so that nodes compute their distances to each tag in the key space. Then they’d store the message IDs of all the messages they see that have seen that match the tag, and can gossip with other peers that store the tag for find other messages they have.

The obvious issue with this scheme (and a bunch of variations on it that I thought about) is the issue of what I am calling a “lumpy key space”. Since multiple messages are expected to have the same tag, some tags will naturally have more messages. Since we don’t want the user to make per-tag changes to their configuration, or require that nodes have to know that per-tag information in order to do routing/queries properly, this is an important problem to solve. We should be able to treat all tags equally.

I considered several schemes in this family where we compute distances between the peer, the message, and the tag in different ways, and partition the key space in different ways. I could not figure out a way around this lumpy key space problem. I’ll refrain from enumerating them all. Any solution that seemed like it might work had some flaw.

Distribute the index itself

From the start, we are trying to build a distributing indexing scheme. Treating the whole key space as a flat structure is part of our problem. Database indexes work by maintaining b-trees or some other kind of structure that references entries. So a new insight I eventually had was that a proper solution should involve building this index data structure explicitly and then distributing that the nodes in the tree for the whole system index out across the keys.

As a sketch of a design, we have peers generate index nodes as needed and share them across the network. We can progressively update the index as new messages with tags are found. Index nodes reference child nodes, and these can be stored like any other content-addressed message like Kademlia already does well. This is sorta like that first idea I had, which kinda tries “what if we just ask people what nodes they’ve seen?”.

Solving trustworthiness

The above sketch of a design has a major flaw. How do we stop someone from making fake index nodes that point to malformed/non-existent nodes or including messages that don’t actually have the tags they say they have?

The key innovation here is a SNARK, or “succinct non-interactive argument of knowledge”. This has been used extensively in other kinds of systems, but I have not seem them be used in asynchronous distributed p2p networks like this. There’s a few couple of ways to define them, but for our purposes, we can say that they let us effectively define some function f(w, x): bool and produce a “proof” (a blob of bytes) which attests that, for some value w, the function f returns true, but without revealing x.

We can use this to ensure that our index nodes are well-formed by using recursive SNARK proofs (which a thing that you can do). An index node would come with an attached proof, which attests to its correctness, and transitively attests to the correctness of every node underneath it in its subtree.

The leaf node proofs would vary depending on the contents of the message. These would somehow attest that message ID m being referenced by the leaf node contains a particular tag t. The first layer of branch nodes above the leafs would accept these proofs instead of the recursive branch node proofs.

Tree structure

The index takes the form of a big tree. The tree’s degree will probably be like 256 or even 64K, since high fan-out helps with making queries. The tree is sorta two-tiered. The first tier is the tag key, and below that under each tag’s leaf node is another tier for each message that has that tag.

We need to refer to nodes in two different ways. Nodes have naturally have an ID, which is their hash that we use to query them. But similarly, we have to track the prefix of a node, ie the path down the tree to the node from the root. This is important to remember, since we’re often likely to have multiple “conflicting” nodes with the same prefix.

When constructing a branch node, we generate a proof that attests to the correctness of the node itself, but also the correctness of each of the proofs of the immediate children that the node references. When a peer receives index nodes, they verify the attached proofs and reject them if the proofs are invalid.

When we are generating the index nodes for a new message, we update relevant nodes in our local index, regenerate the proofs for those nodes, and publish those new nodes with proofs when we publish the message to other peers.

So to make queries for messages with a tag, a user will traverse down the tree to find the tag subtree root, and then just progressively search through the key space starting from either end. This should be done in parallel some kind of mixed depth-and-breadth search in order to ensure that a few flaky nodes don’t cause our lookups to hang. This should be highly amenable to batching, since it’s likely that as we’ve explored part of the path that we discover some remote peer we’re assembling a request for is expected to have nodes in completely unrelated subtrees, but we can get responses back for all of them in a single request.

Nowhere here do we address the possibility of a party repeatedly generating index nodes referencing subtrees which they don’t actually share, or which reference messages that haven’t been published. I believe we can sidestep this with some rules around when we decide to accept new nodes in the tree, and during “rezippering”, see below.

We can avoid doing some work by doing what a lot of indexes do when they have a subtree with only a single leaf, by truncating the rest of the index’s path. We can change how we define the proof to allow for directly generating branch nodes for these single-leaf subtrees. I’m handwaving this on purpose because it doesn’t matter that much for the idea.

Merging nodes

Continually, we expect that there will be some “churning” operation we continually have to do to merge conflicting nodes for the same prefix. Not everyone will have to do this, only one party has to, but we can tolerate more than one doing it concurrently for a subtree.

In addition to the references to each child, branch nodes should also contain a sum of the total number of leafs in its subtree. We can extract this from the immediate children if they’re branch nodes. This is important, since it lets nodes roughly measure how “good” a node is when trying to make space considerations.

We have to “rezip” the children under the conflicting nodes, going bottom up and regenerating proofs wherever we have conflicting nodes. This is highly amenable to batching, and when we pass around a bundle of updated nodes, peers can easily identify that “yes, these are better nodes than I have, and this does not discard any information, I can replace my nodes with this smaller set of nodes”. This procedure helps make withholding of nodes safer, since we just keep those non-existent nodes along the edges of the tree. But I’m still not quite sure this works. It feels like we should always be able to unzip sets of conflicting nodes to be able to rezip them.

Since it’s expected that peers will probably have several instances of nodes near the top of the tree which have not been merged yet. When making a query, nodes will have to start queries using data in all of these conflicting roots / near-root nodes. However, we expect that these paths will converge a few layers down when we reach a stable subtree, so the overlapping work only happens near the top of the tree.

Node storage rules

The nodes closer to the roots are more important than nodes farther down. We also expect there to be more conflicting instances of them, since they’re being updated more often. Everyone is expected to store root nodes, since that’s how we start any index.

So we alter our condition for how we decide to store nodes, making it somewhat more resemble the routing table rules in Kademlia. We will store nodes that are higher up in the tree at farther distances. I am not exactly sure how this will work.

If you store a message (due to it being in your message radius), you are also expected to store nodes and proofs up the index tree for it. This does increase the size of the message, which could be bad for relatively small messages, but there could be optimizations here.

Extending the tag space

Let’s say we actually want to support tags with specific values, ie. key-value tags. This is fine, we just add another tier to the index tree. This could slowdown lookups, but we can apply the same path suffix truncation trick.

We might also want to namespace the tags, to avoid conflicts for different purposes where tag data might be overlapping. One way to do this is just require some namespacing system for how tags are defined, or we can add another tier. Alternatively, namespaces could be the hash of some compact sandboxed program that extracts the tags from a message. The programs could be stored content-addressed as usual.

Combined, that makes 4 tiers: namespace -> key -> value -> message

With 32 byte hashes, that means 128 byte paths. That’s pretty long, but if we use 20 byte hashes, that means 80 byte paths, which feels a bit better.

Questions

I am not completely sure about the ideal procedure for querying a tag’s messages. There’s a tradeoff on the degree of the tree between the cost of updating a proof and the time to query nodes, which feels relative.

I am not sure how the path truncation trick will impact updating proofs when we have to demote a leaf due to adding nearby cousins to the tree.

The rules for how to accept “better” nodes (ie with more cumulative leafs) need to be better-characterized in order to deal with child withholding. I think it’s possible we could get away with this by just always going bottom-up when we share new nodes that aren’t upgrades from existing known nodes.

some other stuff probably

Articles Index