I read this BoingBoing post: Tahoe-LAFS: a P2P filesystem that lets you use the cloud without trusting it.
Can we trust it? Can we really trust it? Reading the architecture document for Tahoe-LAFS, I found this text:
In general, anyone who already knows the contents of a file will be in a strong position to determine who else is uploading or downloading it.
…
Also note that the file size and (when convergence is being used) a keyed hash of the plaintext are not protected.
Listen up, folks. A file storage service that reveals file sizes is one that you cannot blindly trust.
Consider the following scenario: You find yourself with a popular pirated movie file, whose size is 2143658709 bytes, and you decide to “securely” share it. The people spying on you can see that you’ve uploaded an encrypted file of this size to the cloud. What are the odds that you happened to coincidentally come across a file that’s exactly the same size? Slim. Sufficiently slim for a subpoena, perhaps, and bam, you could get busted by the copyright cops.
It gets worse. Suppose you’ve extracted a zip file of some illegal content. Even porn is illegal in some countries. The set of file sizes in that collection is, say, some set of values, and we don’t really care what they are. Call it {x1, …, xn}. The evil government watches you one day, and sees that you’ve uploaded a bunch of files, and, in its scan for illegal porn collections, notes that {x1, …, xn} is a subset of the files you’ve uploaded. They’ve caught you. With an astronomically small chance of a false positive.
The fact that you can’t hide those file sizes in the noise of other files is somewhat counter-intuitive. For example, suppose you have an illegal file collection with 100 text files. All their file sizes are between one and a million bytes in length. You upload that collection amid 99900 other files with randomly chosen lengths between one and a million. By the way, that’s an absurdly high amount of fluff files to be uploading. So now you’ve uploaded a hundred thousand files. Your small subset is safely obfuscated within all those other files, right? Wrong.
What’s the chance that a given set of 100000 numbers, chosen from one to a million, contains a certain subset of 100 numbers? It’s approximately 1/10100. That’s right — one over a googol. Each number between one and a million has a 10% chance of being included in the set of 100000 random file sizes. Actually, it’s worse than that, since some file sizes will be repeated. I think the expected number of distinct values is about 95163. That makes it more like 1/10.5100.
The lesson here is that, if your online upload service knows your file sizes, it cannot be blindly trusted. They’re only useful for keeping secret files that you’ve generated. They’re not useful for secretly storing files whose existence is known to the authorities. When you go about designing the ultimate encrypted peer-to-peer file cloud, please store data opaquely in fixed size blocks, that never get truncated. You can figure out how to do this efficiently. You can have a layer of indirection between uploaded chunks and individual files. And don’t even think about allowing block sizes to be different discrete sizes, such as power of two. You know you’ll mess it up. Be sufficiently paranoid.
Now let’s be even more paranoid. Upload files as rarely as possible. Don’t upload them eagerly. For example, if somebody’s slowly placing a set of illegal movies into a shared directory, eager scanning of a directory and eager uploading of individual files would reveal enough information about individual file sizes to provide an identifiable signature for the entire set. Somebody watching network activity live would notice pauses after each group of chunks that corresponds to a file. This would tell them information about the length of the file, rounded to the nearest whatever-your-chunk-size-is. You’d need to add absurd amounts of random padding to individual files to sufficiently obscure this information. Upload files in groups. It’s safer and less expensive that way. And while you’re at it, don’t just read-a-file-from-disk-and-upload-it, read-a-file-from-disk-and-upload-it, and so on. Are you sure the pauses between files are too small to observe on the network? What if the shared directory you’re uploading is really on a network file system? Are you sure? Read files concurrently and upload them interleavedly. Be sufficiently paranoid.
Follow-ups:
Estimating the number of distinct numbers — on how I got 95163.
Update:
I know, I’ve ignored the statement, “In general, anyone who already knows the contents of a file will be in a strong position to determine who else is uploading or downloading it.” That’s also a reason you can’t blindly trust it, and it obviates the filesize problem. I just wanted to talk about filesizes.