Archive for February, 2010

Helping People

Sunday, February 28th, 2010

Look at me.  I have a blog.  Look, my opinions are on it!  Read my opinions.  Revel in how witty I am.

I have opinions on a lot of things, and you should listen to them.  Hear me talk about programming.  Here, me talk about programming languages.  Revel in how witty I am.  Hear me talk about how you should live your life.  Hear me talk about the latest phenomena in modern internet society.  These are my opinions, and they are mine.

By all means, write a counter-posting, offering a point-by-point rebuttal to my points.  But make a link, pointing back to mine.  Don’t have a blog?  Well then, nobody cares about you.  You don’t exist.  You don’t matter.  You are nothing.  Get a blog.  And make a link to mine.

I have insights into the world that you could only dream of understanding.  Make a link to my posting on the latest news aggregators.

Post comments about my opinion on the news aggregators.  Is this a new username you’ve chosen?  You should think twice before having opinions.  Build up some cred by not having an opinion about anything.  Post factual rebuttals about mistakes of fact, but don’t even think about attacking the general conclusion.  That way, more people will listen to you, and so you will be better able to spread your intelligent insights into the populace.

HELPING PEOPLE

is the purpose of having a blog.  Your blog must be productive and help people understand the crazy world that we live in!

My blog is not “better” than yours, for how could one design a valid comparison function between blogs?  But yours will be improved when you make a link to mine.

I’m going to make more postings, acting like I have new ideas.  You will read them and be informed.  My posting is not self-aggrandic—it’s a humble quest to increase the probability that wayward net travelers come across something of substance, so that their lives, and mental health, and the accurateness of their opinions, and those of people around them, will be enhanced substantially by listening to what I have to say about programming.  Make a link.

My metadiscussion about blogging is more profound than any non-metadiscussion that you may have.  I’m now talking about metadiscussion, which makes this level-two metadiscussion.  Here is where a normal peon would make a joke about maxing out at level-seventy metadiscussion, but I do not limit my sights to topical references.  It is inevitable that the first seventy levels of metadiscussion be followed by a seventy-first, and so on.  Here we are, talking about enumerating the levels of metadiscussion.  We could now use ordinal numbers to identify the subsequent meta-levels, but in mentioning such an idea, we have surpassed all the ordinal numbers.

I hope you find it worth your time to strive to have a worldview similar to mine.

Writing Numbers

Saturday, February 27th, 2010

In my previous post, I discussed some stuff, and I wrote some numbers down.

1,000,000

95,163

100,000

91,000

90,909.090909090…

95,162.58

Before I inserted the commas into the numbers, I found that it was easy to make mistakes by writing 100000 instead of 1000000.  Generally, that was hard to read.  However, 91000 isn’t hard to read, and neither are 95162.58 or 95163.  So in the previous post I inserted the numbers.

The point of this post is that while previously, I always refrained from putting period separators into numbers, I have now changed my attitude, slightly.

I would also consider using thin spaces as the digit grouping separator, as suggested in Jukka K. Korpela’s epic, Math in HTML (and CSS).  However, that would take too much effort.

It would be nice to have a variant of Markdown that took sections bounded by dollar signs, as seen in LaTeX, and converted them to HTML and CSS in the ways that document recommends.

Estimating the number of distinct numbers

Friday, February 26th, 2010

In a previous post, I estimated that if you picked a hundred thousand random numbers from one to a million, independently, i.e. “with replacement,” as some call it, you could expect to get 95,163 different values. Maybe sometimes you’ll get more or less (since the outcome is random), but you’ll probably get something close to that number. That’s an approximate guess — I’m not sure if that’s really the closest integer to the true average value — and that’s because I’m an amateur when it comes to math. We might as well chronicle how you’d get to such a value.

This post is going to be about how to come up with approximations for things, when you don’t know the “real” way to get the right answer.

So, if you pick 100,000 numbers from one to a million, and if you pick them independently, how many distinct values do you expect? Well, you’d expect something like 100,000 distinct values in total, with an adjustment for duplicates. So for each value you picked, you’d think that they’d have a 10% chance of being a duplicate of another, since there’s about 99,999 other values for them to collide with, out of a million, and 99,999/1,000,000 ≈ 0.1. So we’d expect 10% of the values to be duplicates, and so we’d expect 90,000 distinct values to be produced.

But wait! We based our estimate of 90,000 on our assumption that our estimate would be about 100,000. Now we have a more accurate estimate, so we could make a new estimate based on that. Since 89,999/1,000,000 ≈ 0.09, we’d expect 9% of the values to be duplicates, and so we’d expect 91,000 distinct values to be produced.

With our new estimate, we could perform an another iteration to produce a more accurate estimate of the number we’re trying to calculate. Or we could set up an equation and solve it. Let’s call our estimate x. If so, we think the probability of a distinct value would be x/1,000,000. That means (x/1,000,000)*100,000 will give the number of picks that are duplicates. The number of distinct values (x) would be equal to the number of picks (100,000) minus the number of duplicates ((x/1,000,000)*100,000), giving us the equation x = 100,000 − (x/1,000,000)*100,000. Simplifying, we have x = 100,000 − x/10, which simplifies to 11x/10 = 100,000, and then x = 1,000,000/11 = 90,909.090909090…

So we’ve estimated that the number of distinct values is 90,909. Is that right? Let’s run a few simulations in Python:

>>> [len(set([random.randint(1, 1000000) for i in xrange(100000)])) for j in xrange(5)]
[95101, 95311, 95268, 95270, 95105]

It looks like we’re way off. You could say we’re off by a factor of two — we expected about twice as many duplicates as we got.

What went wrong? Let’s ask this: What did we assume? We assumed that each pick’s probability of getting a duplicate was equal to the number of total distinct values we’ll end up getting, divided by a million. Is that right, though? (Based on our simulation, apparently not.) We can answer why not by imagining ourselves picking a bunch of random numbers. What’s the chance of the first pick being a duplicate? Zero. What about the 50,000th pick? It would be approximately 50,000/1,000,000, i.e 5%. Then, on our 100,000th pick, the chance of it being a duplicate would be approximately 100,000/1,000,000, i.e. 10%. So the probability goes from 0% up to about 10% in a roughly linear fashion. The average is then 5%. So we can expect 5% of our picks to collide with existing picks, and so we predict 95,000 distinct values.

That’s closer. Of course, with that new estimate, instead of having the proportion go from 0% to 10%, we now can expect the proportion to go from 0% to 9.5%. And so we expect the average to be 4.75%. Taking 4.75% off of 100,000, we get 95,250. We could iterate this again, but let’s just solve an equation, like we did before. The solution to x = 100,000 – x/20 is 2,000,000/21, approximately 95238.

That seems like a reasonable guess. It’s just an estimate, though. Is it too high or is it too low? Right now we’re assuming the rate of getting duplicate elements grows linearly. Of course, it doesn’t. We pick the first 50,000 picks with a lower rate of getting duplicates than the second 50,000 picks, which means our total number of distinct elements will be greater than 95,238/2. We’ve already accounted for that fact — we expect 3/4 of our duplicate elements to be generated from between the 50,001st and 100,000th pick. But we haven’t accounted for the fact that the rate of getting duplicate elements is actually higher than we’d have expected! This means our estimate of 95,238 is an overestimate.

What shall we do then? Let’s calculate the expected number of distinct values after the first 50,000 picks, and then based on the number we get, calculate the expected number of distinct values for the next 50,000 picks.

First we use the same equation as before, but with 50,000 in place of 100,000. Let’s let w be the number of distinct elements after 50,000 picks. Simplifying w = 50,000 – (w/1,000,000)/2*50,000, we get w = 50,000 – w/40, i.e w = 2,000,000/41 ≈ 48,780.5.

Then, let x be the number of distinct elements after the next 50,000 picks. We’re expecting the average collision rate to be the average of x/1,000,000 and 4.87805%. So we have x = 48,780.5 + 50,000 – ((x/1,000,000 + 0.0487805)/2)*50,000. Simplifying that, we have x = 98,780.5 – (x+48,780.5)/40, i.e. 41x/40 = 98,780.5 – 48,780.5/40, i.e. x = (40*98,780.5 – 48,780.5)/41 ≈ 95,181.5.

That should be a better estimate. Let’s do some simulations. Let’s take some averages of 100 simulations.

>>> sum([len(set([g.randint(1, 1000000) for i in xrange(100000)])) for j in xrange(100)])/100.0
95160.339999999997
>>> sum([len(set([g.randint(1, 1000000) for i in xrange(100000)])) for j in xrange(100)])/100.0
95152.179999999993

It seems we’re still overestimating the number of distinct values. After all, each of our iterations produced an overestimate.

I suppose we could break up our approximation into 4 blocks and get a more precise value. Here’s an alternate choice. Set up a differential equation and solve it!

If the number of distinct values we have at a given time is y, then the chance of getting a collision on our next choice is y/1,000,000, and the chance of getting a non-collision is (1,000,000 – y)/1,000,000.

So, that’s our expected rate of change. And so we have the differential equation

y’ = (1,000,000 – y)/1,000,000 = 1 – y/1,000,000.

I’ve completely forgotten how to solve equatons like this, so let’s note that y = K*exp(-t/1,000,000) is the solution family to y’ = -y/1,000,000, and now we can try to tweak that a bit to get a solution to y’ = 1 – y/1,000,000 Just by common sense, we expect a function that decays to 1,000,000, so we can guess y = 1,000,000 + K*exp(-t/1,000,000). Given that there are 0 distinct values after 0 picks, we have the initial condition of y(0) = 0, and so we can solve for K, and thus y = 1,000,000 – 1,000,000*exp(-t/1,000,000).

So, if things go at our “expected rate of change,” we’ll expect them to track along that differential equation. There will be some random noise, but let’s see where this goes.

Plugging in t=100,000, we get 1,000,000 – 1,000,000*exp(-1/10) ≈ 95,162.58.

So that’s how I came up with 95,163, and it seemed close enough to the experimental evidence, so it was good enough for me.

Is it really right, though? Is our method of using a differential equation right? Well, it’s right in the sense that it gives a close answer. It’s a good idea — better than the weird approximations we were using earlier. But the truth is that we’re approximating a discrete process, making 100,000 independent picks, with a continuous function. It’s kind of like the reverse of finding a numerical approximation to the differential equation.

For a more revelatory example, suppose we try using this equation to give us an estimate for the number of distinct values after 1 pick. We see that 1,000,000.0 – 1,000,000*exp(-1/1,000,000) ≈ 0.99999949999619275. Instead of getting the right answer, which is 1, we get an approximation based on the idea that the growth rate of the function falls from 1 to 999,999/1,000,000 on the interval. It’s off by about half a millionth.

What about after two picks? Our formula gives 1.999997999984771. But the right answer is 1.999999. We’re off by one millionth, or two half-millionths — one for the first pick, and one for the next.

Generally speaking, on each timestep, we can expect the continuous approximation to take a half of a millionth off the growth rate, due to the continuous nature of the curve. To compensate for this effect, we might add an extra half-millionth factor to the final estimation. Multiplying 95,162.58 by (1 + 1/2,000,000), we get 95,162.63. That’s not enough to care about.

What if we wanted a really exact answer, though? We might say, okay, we’re going to solve this using Euler’s method, just like we’re supposed to, and we’ll update the expected value each time. We can do that.

>>> y = 0.0
>>> for i in xrange(100000):
...     y += (1000000.0 - y)/1000000.0
...
>>> y
95162.627205942103

Is that the exact, right answer, though? Is that the true expected value? Or have we found another approximation based on incorrect assumptions?

Remember that this is a random process, and that the future rate of change depends on the outcome of current random numbers.

Let’s play a game. Roll a die, and square the value. You might say (and you’d be wrong if you did) that the expected value of a dice roll is 3.5, so the expected value of its square must be 3.52, which is 12.25. You’d be wrong, though, because the expected value is 15.16666… That’s because the numbers around 3.5 spread out nonuniformly when you send them through the squaring operation. If you sent them through a doubling, instead of a squaring, their distances would separate uniformly, and the expected value would remain correct. With squaring, they separate nonuniformly, and the expected value doesn’t.

We can’t just run the expected value through a for loop and get the right expected value at the end. We could probably do some more precise reasoning to adjust for this error. But this blog entry is long enough.

I Saw the Ocean Yesterday

Thursday, February 25th, 2010

The Scripps Institution of Oceanography has some nice architecture.  Here’s a cactus photo:

Cacti Infrared

The streetlights seemed yellow, but apparently they emitted a lot of infrared.

Hide your filesizes, people.

Wednesday, February 24th, 2010

I read this BoingBoing post: Tahoe-LAFS: a P2P filesystem that lets you use the cloud without trusting it.

Can we trust it? Can we really trust it? Reading the architecture document for Tahoe-LAFS, I found this text:

In general, anyone who already knows the contents of a file will be in a strong position to determine who else is uploading or downloading it.

Also note that the file size and (when convergence is being used) a keyed hash of the plaintext are not protected.

Listen up, folks. A file storage service that reveals file sizes is one that you cannot blindly trust.

Consider the following scenario: You find yourself with a popular pirated movie file, whose size is 2143658709 bytes, and you decide to “securely” share it. The people spying on you can see that you’ve uploaded an encrypted file of this size to the cloud. What are the odds that you happened to coincidentally come across a file that’s exactly the same size? Slim. Sufficiently slim for a subpoena, perhaps, and bam, you could get busted by the copyright cops.

It gets worse. Suppose you’ve extracted a zip file of some illegal content. Even porn is illegal in some countries. The set of file sizes in that collection is, say, some set of values, and we don’t really care what they are. Call it {x1, …, xn}. The evil government watches you one day, and sees that you’ve uploaded a bunch of files, and, in its scan for illegal porn collections, notes that {x1, …, xn} is a subset of the files you’ve uploaded. They’ve caught you. With an astronomically small chance of a false positive.

The fact that you can’t hide those file sizes in the noise of other files is somewhat counter-intuitive. For example, suppose you have an illegal file collection with 100 text files. All their file sizes are between one and a million bytes in length. You upload that collection amid 99900 other files with randomly chosen lengths between one and a million. By the way, that’s an absurdly high amount of fluff files to be uploading. So now you’ve uploaded a hundred thousand files. Your small subset is safely obfuscated within all those other files, right? Wrong.

What’s the chance that a given set of 100000 numbers, chosen from one to a million, contains a certain subset of 100 numbers? It’s approximately 1/10100. That’s right — one over a googol. Each number between one and a million has a 10% chance of being included in the set of 100000 random file sizes. Actually, it’s worse than that, since some file sizes will be repeated. I think the expected number of distinct values is about 95163. That makes it more like 1/10.5100.

The lesson here is that, if your online upload service knows your file sizes, it cannot be blindly trusted. They’re only useful for keeping secret files that you’ve generated. They’re not useful for secretly storing files whose existence is known to the authorities. When you go about designing the ultimate encrypted peer-to-peer file cloud, please store data opaquely in fixed size blocks, that never get truncated. You can figure out how to do this efficiently. You can have a layer of indirection between uploaded chunks and individual files. And don’t even think about allowing block sizes to be different discrete sizes, such as power of two. You know you’ll mess it up. Be sufficiently paranoid.

Now let’s be even more paranoid. Upload files as rarely as possible. Don’t upload them eagerly. For example, if somebody’s slowly placing a set of illegal movies into a shared directory, eager scanning of a directory and eager uploading of individual files would reveal enough information about individual file sizes to provide an identifiable signature for the entire set. Somebody watching network activity live would notice pauses after each group of chunks that corresponds to a file. This would tell them information about the length of the file, rounded to the nearest whatever-your-chunk-size-is. You’d need to add absurd amounts of random padding to individual files to sufficiently obscure this information. Upload files in groups. It’s safer and less expensive that way. And while you’re at it, don’t just read-a-file-from-disk-and-upload-it, read-a-file-from-disk-and-upload-it, and so on. Are you sure the pauses between files are too small to observe on the network? What if the shared directory you’re uploading is really on a network file system? Are you sure? Read files concurrently and upload them interleavedly. Be sufficiently paranoid.

Follow-ups:

Estimating the number of distinct numbers — on how I got 95163.

Update:

I know, I’ve ignored the statement, “In general, anyone who already knows the contents of a file will be in a strong position to determine who else is uploading or downloading it.” That’s also a reason you can’t blindly trust it, and it obviates the filesize problem. I just wanted to talk about filesizes.

No Post Today 1

Tuesday, February 23rd, 2010

No post today, I’m working on something.

P.S. I’ve decided to create a Null category, for content-free posts like this one.

A Fun Math Problem

Monday, February 22nd, 2010

What is the infimum of n2|sin(n)|, over positive integers n?

Is it zero, or is it greater than zero?

Ok, it’s time to spoil the answer.

With some lack of rigor, I think the probability is 1 that the minimum value exists and is greater than zero.

Suppose two people play a number-picking game, where two players pick numbers, and then compare them against one another, infinitely many times. Suppose the first person picks a (real) number uniformly on [0,1] and the second person picks 1/2k for his kth choice. What’s the expected number of times the first person will pick a number less than the second person does, for the rest of time? The answer is 1 — there’s a 1/2 chance of that happening for k=1, a 1/4 chance for k=2, and so on, so the expected number of times is 1/2 + 1/4 + 1/8 + …, which equals 1.

If we let r(1), r(2), r(3), … be the sequence of random numbers the first person picks, we would therefore expect there to be one value of k for which r(k)/(1/2k) is less than 1. Maybe there are two, or three such values of k. The probability that there are infinitely many such values is zero.

Now consider |sin(n)|. For small values of ε, and random values of x the probability that |sin(x)| < ε is 2ε/π. That’s because the slope of |sin(_)| is 1 or -1 when it touches zero, and it touches zero at multiples of π. Now let’s ask ourselves: How many times is n2|sin(n)| less than 1? Well, that’s the same as playing a number guessing game where player one chooses |sin(n)| and player two chooses 1/n2. And |sin(n)| is effectively a random number function. What’s the probability of player one choosing a value smaller than player two? Approximately 2(1/n2)/π. Or (2/π)/n2. If we sum these probabilities, we get (2/π)(1/12 + 1/22 + 1/32 + …) which equals (2/π)(π2/6) which equals π/3. So the expected number of times n2|sin(n)| is less than 1 is about 1.04. This is not accurate — for example, we assumed small values of ε when computing 2ε/π, and early on in the sequence, we have large values of ε, but we would still expect the expected number of times n2|sin(n)| is less than 1 to be finite.

What does this mean? It means that, assuming the fractional parts of multiples of π are random-looking, n2|sin(n)| is less than 1 only for finitely many positive integers n. Which means that of this finite number, there is a minimum value. One that is of course greater than zero. So the infimum is not zero.

I’d say there’s a good chance that sin(1) is the only value in the sequence n2|sin(n)| that is less than 1. If we check the first ten million values in the sequence, we find that sin(1) is the only one so far.

Blogging Categories

Sunday, February 21st, 2010

The following describes the categorization practices of this blog.

Entries are categorized in two dimensions: the actual category, and the boring/cool descriptor.  The Cool category is for things that would be worthy of inclusion into http://sn.printf.net/, even if it weren’t exactly on topic.  The Boring category is for things that would not be.  (Also, The Cool category doesn’t actually mean the post is cool, or that I’m cool.)  The “actual category” describes the content of the post.

Generally speaking, the Boring posts are more for my edification than yours.  The Cool posts are too, but less vastly so.

Attention Span

Saturday, February 20th, 2010

It’s very hard to maintain your attention span on the internet.

I’ve lately turned my addiction to all things internet into a combination of addiction and allergy.  I get bored, automatically type reddit.com into the address bar, then read a few pages, then say “augh, I’m wasting my time,” then leave, and then get bored again, and the cycle continues.  It’s a horrible way not to get anything done.  I’ve decided to force myself into a different form of procrastination.  I’ll procrastinate by reading.  And then I’ll feel bad about reading, and go back to work.  But it won’t feel so bad.  Another way to procrastinate is by writing blog entries.  Or by padding them out.

So.  That’s the attention span situation for me, right now.  Thanks for listening, internet.

The iPad and Tiling Window Managers

Friday, February 19th, 2010

Here’s a prediction for you.  The iPad will eventually be able to run multiple applications at the same time.  Also, it will run them in a tiling window manager.

Does anything else need to be said?  Is this not obvious?  A tiling window manager would be the ideal way for apps on the iPad to be run concurrently.  Turn the iPad sideways, open two apps, have the iPad tell them they’re oriented vertically, and you’re done!  This would be the conservative way to do things, that doesn’t let users muck things up by having overlapping windows.  So that’s the way I’d expect Apple to go with this.

Here’s a less hopeful prediction: The iPad will eventually let you run two apps side by side, but not in any other configuration.

But some day, the iPad or some other kind of tablet will have a tiling window manager.  You just watch.