Visualizing Duplicates and Triplicates

This is a followup to the previous post, Estimating the number of distinct numbers.  In that post, we ask how many distinct values we’ll get when making 100,000 independent picks from the integers between 1 and 1,000,000.  The answer we got was approximate, non-rigorous, and 95,163.

I began by documenting my first attempt to solve the problem, which ended up going sorely wrong.  We let x be the number of distinct values we’d end up getting.  Based on the reasoning that x/1,000,000 is the probability of a given pick being a duplicate, we solved the equation

x = 100,000 − (x/1,000,000)*100,000

which gives the solution

x = 90,909.090909090909090…

This ended up being wrong.  But the reasoning sounded so right!  What went wrong there?  We subtracted the number of duplicates from the number of picks, and wouldn’t that give us the answer?

No.  It wouldn’t.  Let’s look back.

We correctly surmised that (x/1,000,000) would be the expected proportion of picks that were duplicates, since there were x distinct values, and the probability of any pick hitting a set of size x, out of a set of size 1,000,000, is x/1,000,000.

We then went and applied that proportion to an equation describing the number of distinct numbers picked.

That was wrong.  There’s the picks, and there’s the numbers that get picked, and they are two different things.  We took a proportion describing the domain, and applied it to the range.

Another way of looking at the problem is by looking at the actual way the proportions relate:  When a number is picked twice, 2 of the picks get counted as “picks that ended up being a duplicate number,” while 1 distinct value gets created.  Our equation above assumed that 0 distinct values would get created, instead of crediting pairs of duplicate picks for the one distinct value that they produce.  Later, when we divided the proportion by 2 (for a different reason), we ended up correcting for this error.  The correction would have been exact, if it were not for the presence of triplicates.  Here’s a drawing.

As we pick numbers, we start at the bottom and follow the curve upward, counting each new pick we get.  Each horizontal row represents some number that has been picked, and as the curve doubles back, that means the number got picked again.  Since we have duplicates we’d “turn around” before reaching 100K and go in the opposite direction, covering a few rows again — they represent numbers that got picked twice.  The value A describes the number of picks we’d get that end up being duplicates, and B describes the number of distinct values we get.  C describes the number of values that end up being picked more than once.  C ends up being approximately half of A.  It would be exactly half of A, if there were no triplicates.  But we have to consider those, and the curve curls back up to represent triplicates.  There should probably be another curl, a fraction of a pixel in length, back down, to represent quadruplicates.  We’ll approximately calculate the actual number of triplicates we expect, and quadruplicates, and such, tomorrow, since I’m sick of blogging today.

Comments are closed.