Project Ideas

Updated 2024-07-21.

In this doc is a list of ideas that I’ve had that I’m pretty sure I’ll never have the time to work on. But I wanted to list them out so that someone else that’s looking for a project to pick up can work on them.

If you start working on one of these projects in a way that I like and have gotten it to a sufficiently mature state, then send me a message and I might update the entry with a link to it. Only FOSS projects accepted.

YouTube mirroring

PeerTube is pretty cool. It’s a video hosting software package with the goal of allowing popular videos to be shared between viewers to reduce load on the upstream server and to allow videos on one instance of PeerTube to be mirrored to other instances of PeerTube, in a way that allows all of the viewers across the network to share with each other. It does this with BitTorrent and a port of it to operate over WebRTC called WebTorrent. There’s also options that an instance administrator can specify specifc instances or users on other instances to always mirror all videos.

So I was thinking that someone ought to develop something that does mirroring like this but from YouTube. Different people would configure their instances to download from channels they watch, then maybe the streams would be signed, and then could be mirrored to others’ instances. Since it’s hard to authenticate the videos like how BitTorrent allows, this would take the form of a “friend-to-friend” network where you have a modest level of trust in the parties you’re peering with. This would be necessary since if it took off then it would probably invite frivolous litigation and would get YouTube pissed off. This kinda resembles a private tracker or a private Usenet newsgroup where you have to be invited to a close circle but then have free access to the content.

This would be especially useful in the coming years where it seems like YouTube might really take steps to kill adblockers. Someone on HN mentioned that if they keep it up too much, that people will regress to downloading videos from channels/topics that they’re interested in ahead of time so they can be streamed as needed.

Auditable messageboard platform

Because so many people think that blockchains are the answer to auditability, someone outta show how they’re not.

The point is that there’s a ton of use cases where you really just care that some data was posted publicly around a certain time. There’s no settlement operation or anything like that, like maybe what you’d use a blockchain for. This includes all of the “supply chain” applications people are ascribing to blockchains that they absolutely should not be.

I’m imagining a core data model where you have users (or “actors”, maybe) and collections. Users post messages to a collection, and the server could share all the messages that have been posted to the collection. Users could also attach cryptographic signatures to their messages, if they wanted to have higher authenticity.

The server would continually be maintaining commitments of every message each user has published in merkle tree or some other convenient accumulator, as well as a commitment to an index of messages posted to each collection. Perhaps we could add a “tag” construct that can be attached to messages and indexed by the server separately.

On top of these core constructs, you could build anything. The simplest thing is a messageboard, where you can create threads (collections), post a reference to them to a forum (another collection), and users could post messages to the thread by adding messages to the collection. The supply chain example is another one that ought to fit into the data model pretty easily. The server can enforce well-formedness constraints for the high-level application it’s configured for, rejecting messages that don’t semantically match the data model for the collections they’re supposed to be for, the same as any web platform would. Optimizations to present a nice UI quickly can and should be made around the core constructs.

Generality is the key, the core foundation libraries should be very generic and make it easy to program whatever use case you’d want on top of it.

If you wanted to implement some basic voting system, you could extend this to support messages with anonymous senders you could have the server enforce that users specify cryptographic keys. Then questions would each be their own collection and the messages would be votes that include a ring signature over all of the users allowed to vote on a particuar question. Similarly, the server should be able to sign that it acknowledged a message when it’s submitted so that if the message isn’t present in a later dump/commitment, an alarm can be raised that the server might be misbehaving. This could be triggered automatically.

Periodically the server would take the master commitment to the whole system and post it in a public place where you can externally observe it. The purest place to do this would actually probably be OpenTimestamps, but it would also make sense to toot it on Mastodon, or put it on a Lemmy board, (or twitter or reddit), maybe email it to a public mailing list, or any other countless things that it could do.

Then bulk database dumps could be provided along with the updated commitment as a torrent, so anyone can mirror and crossreference the state. Anyone could take the database dump and reconstruct the state of the high-level application using it.

Enhanceificated YouTube mirroring

I was made aware of Town Crier a few weeks ago.

The TLDR of Town Crier is that it uses what’s called a “trusted execution environment” that’s embedded in your processor by the manufacturer. This “TEE” can give an attestation that’s verifiable by a third party that some program was executed correctly and confidentally. I don’t normally like protocols that use TEEs since they’re actually really bad at ensuring those guarantees, often in catastrophic ways, but the full consequences are dependent on the context in which they’re used. However, for this scenario, they might be a good tool.

Town Crier essentially just verifies the execution of a TLS handshake inside the TEE, allowing someone to make cryptographic statements about the response from an HTTP server. You could plausibly leverage this to attest that you received a particular video stream from YouTube’s servers and package it up to be distributed in a p2p network like BitTorrent. This would sidestep some of the trust issues with the YouTube mirroring idea I described above, and seems like it would integrate with Invidious better, perhaps? The goal of this would be to make the system scale better than it does with merely the “friend-to-friend” trust model that would be the case with the other idea. Since Town Crier lets you hide parts of the queries, it would let someone that is paying for the Premium subscription and can bypass some of the limits, allowing for a single ingress point to distribute content to a wider range of consumers.

In the event that someone does try to poison the network by compromising their TEE to forge attestations and publish fake video streams, it seems like it would be reasonably possible to build a reputation system that would let people report video streams that were incorrectly published and blacklist them locally, identified by their device key. It’s not completely solid, but in this there’s relatively little incentive to attack this kind of system. Some properties could be sanity checked directly by querying YouTube without triggering the blocks that only seem to trigger when you actually play a video.

Another extension of this would be to tracking upload credits like private trackers. Fetching streams requires a nontrivial amount of bandwidth and so there’s some scarcity in the amount of content that can be “dredged” up from YouTube. You could probably make this work somehow.

As an alternative to Town Crier, a newer technique could employ DECO and perhaps some web-of-trust model.

SMS relay server

I really don’t like my banks’ websites. I don’t like using them. They also don’t provide an API that is meant for an end user to access their own data, their API offerings are aimed at companies that want to build services around them so they can get an additional revenue stream from selling access to the API. I just want to be able to access my balances and maybe payment history programmatically, so I can check all my balances across all my accounts everywhere. I’ve tried to do it the right way with OFXDirectConnect and the other things Gnucash supports, but it didn’t work.

So the alternative is to become a scrapoor. Just use Selenium to log into the each bank’s website and walk through the login prompt. The problem is that banks use insecure SMS 2FA so we have to be able to receive SMS and get the code to complete the login flow. Normally you’d use Trello for this, but it’s supposed to be sending it to your personal phone number. The solution here should be pretty obvious, have a relay app on your phone that gets woken up by Android whenever a SMS comes in and submits it to a tiny server you’re running that we can poll every few seconds while we’re waiting for the text to come in. As long as the phone is running it should be able to pass off the incoming texts so our main scraper script can receive the messages.

It seems like you’d use the broadcast system to accept SMS intents.

And then the relay server could be stupidly simple, just be a FastAPI service that writes messages to files and lets you filter by string matching.

As I was about to finish writing this section I was checking around on the F-Droid store and found out that the thing I wanted basically already exists. It’s called SMS to URL Forwarder and it does basically exactly what from the Android app part of this. So really I’d just want the relay server part of this. (source code)

Wifi Dead Drop

There was this art(?) project someone developed around this idea of a USB dead drop. The idea was to cement a flash drive into a brick wall so that people could plug their computers into it to share files.

In 2024, we know that it’s kinda a bad idea to plug random flash drives into your laptop, since they might be rubber duckies or something. But we have way cheaper wifi-enabled devices now!

So the idea I had was to recreate the dead drop as a wifi hotspot. You’d have a cheap and power efficient embedded device like an ESP32 that would broadcast a wifi network. Instead of providing internet connectivity, it would just expose an HTTP server that would be presented to the user like a captive login portal would be. Users could upload and download files through this interface, and perhaps the person who set it up could have a passcode they remember to be able to delete mean files.

I did some research, the ESP32 is definitely powerful enough to support this and there are version of them that have built-in wireless network interfaces. A really cool way to deploy it would be to build a small enclosed package with a solar panel that could be placed in some inconspicuous spot on top of a structure, able to broadcast the hotspot to nearby people. I did the research, even a small solar panel would provide enough power to keep an ESP32 running, if at least in a low power state most of the time when not actively exchanging data, so keeping a battery as a buffer would give plenty of overhead.

Alternatively, you could package an ESP32 into a small housing to plug in behind somewhere like behind a vending machine, possibly with a power passthrough to avoid it being seen as something that’s not supposed to be there. I also did some research in this direction and it seems that it could be built for ~$20.

Perhaps an extended version of this would include bluetooth file exchange, or maybe support AirDrop to advertise itself to Apple devices.

GZipped Tarball Indexer

For those unfamiliar, files with the extension .tar.gz are archive files traditionally used on UNIXy systems to package up stuff and pass it around. You can use a .tar in isolation to package up a bunch of files with their metadata into a single one, but typically a layer of GZip compression is added around this, which is nice because it maintains compressor state across all of the files in the archive, resulting in somewhat smaller files. This is in contrast to the .zip format, which compresses each file individually, which means the overall size is slightly smaller, but you can extract individual files efficiently because the index of the file structure is stored outside of the compression.

Sometimes you do want to pick out a specific file from a .tar.gz without having to decompress the whole thing on disk. This is possible if you know the exact file path, but you might not always know that and it still requires decompressing the whole tar until the file position.

So what would be nice is a program that can make a in-memory decompression pass over a .tar.gz and build up an index of each file’s data (essentially, taking out all of the .tar file headers) and record checkpoints of the GZip compressor state periodically, and writes it out to a separate file. This way when we want a specific file, we can check the index we extracted and then look up the checkpoint just before where the data actually exists in the archive and extract just the parts we want, with minimal extra effort to decompress.

This does require doing a first pass, but it’s enough for some use cases.

Smart archiver

I’ve done a lot of work with Minecraft servers. Minecraft stores world data in a format called Anvil, with extension .mca. All an Anvil file really is a compressed NBT file, which is just GZip around this tagged tree structure format similar to msgpack.

Each file stores a 512x512 block (~a meter) “region” out of the infinite grid making up Minecraft worlds, and so a world that’s been around for a while, you can have hundreds of these Anvil files in the world data directory.

In a related vein, all Minecraft code libraries are stored in Jar files, which just zips with some metadata. This isn’t just the game and its core libraries, it’s also all of the (again, usually dozens but sometimes hundreds) of the mods and plugins the server operator installs along with it.

And then the server logs on Spigot servers are also gzipped, at least one file per day, sometimes multiple files if it’s restarted.

Individually, compressing each of these files makes plenty of sense. It keeps thing small and well-contained. But if you wanted to make a backup of a whole server installation (all of these are packaged into a single directory structure usually), then you’re wasting a lot of effort because you’re trying to compress already-compressed data. If you’re doing a zip then it’ll probably realize some file is already compressed and encode it without another layer of compression, but if you’re making an archive you probably want to prioritize space efficiency over ease of access.

So what would be nice is a smart compressor that can identify files like this, decompress them individually, and recompress them all together with a much stronger compressor like Zstd, so we can take advantage of the shared patterns across all of these files (especially the Anvils). It’d have to maintain some bookkeeping like checksums of the original and unpacked files for sanity checking

I realized after having this idea that it’s useful not just for Minecraft servers but any heterogeneous directory structure in general that we might want to compress, but it’s much easier to explain the reasoning given that context. In practice you’d want to support all kinds of formats and stuff that we can transparently decompress to avoid layering compression.


Articles Index