Estimating the number of distinct numbers

In a previous post, I estimated that if you picked a hundred thousand random numbers from one to a million, independently, i.e. “with replacement,” as some call it, you could expect to get 95,163 different values. Maybe sometimes you’ll get more or less (since the outcome is random), but you’ll probably get something close to that number. That’s an approximate guess — I’m not sure if that’s really the closest integer to the true average value — and that’s because I’m an amateur when it comes to math. We might as well chronicle how you’d get to such a value.

This post is going to be about how to come up with approximations for things, when you don’t know the “real” way to get the right answer.

So, if you pick 100,000 numbers from one to a million, and if you pick them independently, how many distinct values do you expect? Well, you’d expect something like 100,000 distinct values in total, with an adjustment for duplicates. So for each value you picked, you’d think that they’d have a 10% chance of being a duplicate of another, since there’s about 99,999 other values for them to collide with, out of a million, and 99,999/1,000,000 ≈ 0.1. So we’d expect 10% of the values to be duplicates, and so we’d expect 90,000 distinct values to be produced.

But wait! We based our estimate of 90,000 on our assumption that our estimate would be about 100,000. Now we have a more accurate estimate, so we could make a new estimate based on that. Since 89,999/1,000,000 ≈ 0.09, we’d expect 9% of the values to be duplicates, and so we’d expect 91,000 distinct values to be produced.

With our new estimate, we could perform an another iteration to produce a more accurate estimate of the number we’re trying to calculate. Or we could set up an equation and solve it. Let’s call our estimate x. If so, we think the probability of a distinct value would be x/1,000,000. That means (x/1,000,000)*100,000 will give the number of picks that are duplicates. The number of distinct values (x) would be equal to the number of picks (100,000) minus the number of duplicates ((x/1,000,000)*100,000), giving us the equation x = 100,000 − (x/1,000,000)*100,000. Simplifying, we have x = 100,000 − x/10, which simplifies to 11x/10 = 100,000, and then x = 1,000,000/11 = 90,909.090909090…

So we’ve estimated that the number of distinct values is 90,909. Is that right? Let’s run a few simulations in Python:

>>> [len(set([random.randint(1, 1000000) for i in xrange(100000)])) for j in xrange(5)]
[95101, 95311, 95268, 95270, 95105]

It looks like we’re way off. You could say we’re off by a factor of two — we expected about twice as many duplicates as we got.

What went wrong? Let’s ask this: What did we assume? We assumed that each pick’s probability of getting a duplicate was equal to the number of total distinct values we’ll end up getting, divided by a million. Is that right, though? (Based on our simulation, apparently not.) We can answer why not by imagining ourselves picking a bunch of random numbers. What’s the chance of the first pick being a duplicate? Zero. What about the 50,000th pick? It would be approximately 50,000/1,000,000, i.e 5%. Then, on our 100,000th pick, the chance of it being a duplicate would be approximately 100,000/1,000,000, i.e. 10%. So the probability goes from 0% up to about 10% in a roughly linear fashion. The average is then 5%. So we can expect 5% of our picks to collide with existing picks, and so we predict 95,000 distinct values.

That’s closer. Of course, with that new estimate, instead of having the proportion go from 0% to 10%, we now can expect the proportion to go from 0% to 9.5%. And so we expect the average to be 4.75%. Taking 4.75% off of 100,000, we get 95,250. We could iterate this again, but let’s just solve an equation, like we did before. The solution to x = 100,000 – x/20 is 2,000,000/21, approximately 95238.

That seems like a reasonable guess. It’s just an estimate, though. Is it too high or is it too low? Right now we’re assuming the rate of getting duplicate elements grows linearly. Of course, it doesn’t. We pick the first 50,000 picks with a lower rate of getting duplicates than the second 50,000 picks, which means our total number of distinct elements will be greater than 95,238/2. We’ve already accounted for that fact — we expect 3/4 of our duplicate elements to be generated from between the 50,001st and 100,000th pick. But we haven’t accounted for the fact that the rate of getting duplicate elements is actually higher than we’d have expected! This means our estimate of 95,238 is an overestimate.

What shall we do then? Let’s calculate the expected number of distinct values after the first 50,000 picks, and then based on the number we get, calculate the expected number of distinct values for the next 50,000 picks.

First we use the same equation as before, but with 50,000 in place of 100,000. Let’s let w be the number of distinct elements after 50,000 picks. Simplifying w = 50,000 – (w/1,000,000)/2*50,000, we get w = 50,000 – w/40, i.e w = 2,000,000/41 ≈ 48,780.5.

Then, let x be the number of distinct elements after the next 50,000 picks. We’re expecting the average collision rate to be the average of x/1,000,000 and 4.87805%. So we have x = 48,780.5 + 50,000 – ((x/1,000,000 + 0.0487805)/2)*50,000. Simplifying that, we have x = 98,780.5 – (x+48,780.5)/40, i.e. 41x/40 = 98,780.5 – 48,780.5/40, i.e. x = (40*98,780.5 – 48,780.5)/41 ≈ 95,181.5.

That should be a better estimate. Let’s do some simulations. Let’s take some averages of 100 simulations.

>>> sum([len(set([g.randint(1, 1000000) for i in xrange(100000)])) for j in xrange(100)])/100.0
95160.339999999997
>>> sum([len(set([g.randint(1, 1000000) for i in xrange(100000)])) for j in xrange(100)])/100.0
95152.179999999993

It seems we’re still overestimating the number of distinct values. After all, each of our iterations produced an overestimate.

I suppose we could break up our approximation into 4 blocks and get a more precise value. Here’s an alternate choice. Set up a differential equation and solve it!

If the number of distinct values we have at a given time is y, then the chance of getting a collision on our next choice is y/1,000,000, and the chance of getting a non-collision is (1,000,000 – y)/1,000,000.

So, that’s our expected rate of change. And so we have the differential equation

y’ = (1,000,000 – y)/1,000,000 = 1 – y/1,000,000.

I’ve completely forgotten how to solve equatons like this, so let’s note that y = K*exp(-t/1,000,000) is the solution family to y’ = -y/1,000,000, and now we can try to tweak that a bit to get a solution to y’ = 1 – y/1,000,000 Just by common sense, we expect a function that decays to 1,000,000, so we can guess y = 1,000,000 + K*exp(-t/1,000,000). Given that there are 0 distinct values after 0 picks, we have the initial condition of y(0) = 0, and so we can solve for K, and thus y = 1,000,000 – 1,000,000*exp(-t/1,000,000).

So, if things go at our “expected rate of change,” we’ll expect them to track along that differential equation. There will be some random noise, but let’s see where this goes.

Plugging in t=100,000, we get 1,000,000 – 1,000,000*exp(-1/10) ≈ 95,162.58.

So that’s how I came up with 95,163, and it seemed close enough to the experimental evidence, so it was good enough for me.

Is it really right, though? Is our method of using a differential equation right? Well, it’s right in the sense that it gives a close answer. It’s a good idea — better than the weird approximations we were using earlier. But the truth is that we’re approximating a discrete process, making 100,000 independent picks, with a continuous function. It’s kind of like the reverse of finding a numerical approximation to the differential equation.

For a more revelatory example, suppose we try using this equation to give us an estimate for the number of distinct values after 1 pick. We see that 1,000,000.0 – 1,000,000*exp(-1/1,000,000) ≈ 0.99999949999619275. Instead of getting the right answer, which is 1, we get an approximation based on the idea that the growth rate of the function falls from 1 to 999,999/1,000,000 on the interval. It’s off by about half a millionth.

What about after two picks? Our formula gives 1.999997999984771. But the right answer is 1.999999. We’re off by one millionth, or two half-millionths — one for the first pick, and one for the next.

Generally speaking, on each timestep, we can expect the continuous approximation to take a half of a millionth off the growth rate, due to the continuous nature of the curve. To compensate for this effect, we might add an extra half-millionth factor to the final estimation. Multiplying 95,162.58 by (1 + 1/2,000,000), we get 95,162.63. That’s not enough to care about.

What if we wanted a really exact answer, though? We might say, okay, we’re going to solve this using Euler’s method, just like we’re supposed to, and we’ll update the expected value each time. We can do that.

>>> y = 0.0
>>> for i in xrange(100000):
...     y += (1000000.0 - y)/1000000.0
...
>>> y
95162.627205942103

Is that the exact, right answer, though? Is that the true expected value? Or have we found another approximation based on incorrect assumptions?

Remember that this is a random process, and that the future rate of change depends on the outcome of current random numbers.

Let’s play a game. Roll a die, and square the value. You might say (and you’d be wrong if you did) that the expected value of a dice roll is 3.5, so the expected value of its square must be 3.52, which is 12.25. You’d be wrong, though, because the expected value is 15.16666… That’s because the numbers around 3.5 spread out nonuniformly when you send them through the squaring operation. If you sent them through a doubling, instead of a squaring, their distances would separate uniformly, and the expected value would remain correct. With squaring, they separate nonuniformly, and the expected value doesn’t.

We can’t just run the expected value through a for loop and get the right expected value at the end. We could probably do some more precise reasoning to adjust for this error. But this blog entry is long enough.

Comments are closed.