binary trees, recurrence relations, and the binomial distribution: jpfed

jpfed

binary trees, recurrence relations, and the binomial distribution

Aug 15, 2007 23:35

So... I have made progress towards solving a problem today.

Imagine a node. We label this node as being of "Generation 1". Now, with probability p, this node may generate a new node (the "left child") with generation 2. Independently with probability p, it may generate a new node (the "right child") also with generation 2.

We start in generation 1, and generate all generation 2 nodes. If there are any generation 2 nodes, we generate the generation 3 nodes. If there are any of those, we generate the generation 4 nodes, etc. In doing this, we have randomly generated what is called a "binary tree".

How deep will the tree be? For any given value of p, there are an infinite number of answers to this question. For nonzero values of p, any depth of tree is possible (but some will be exceedingly unlikely). The function that takes in "a value of p" and "a tree depth" and gives you back "what the probability is that the generated tree will be that depth" is one example of a discrete probability distribution.

But what function actually describes the probability distribution that results from our randomly- generated trees? I whipped up a computer simulation today that tentatively suggests that this function is the geometric distribution with parameter 1-2*p (or perhaps (1-p)^2; there's a fuzziness to empirical work like this). Surely, we would expect this to match the first generation or two, but it should get more hairy the further you go... or would it? I'm going to have this thing run a gajillion trials and record them until I can write some software to fit a curve to the data.

One little thing to note: this problem can be described in a very interesting way seemingly unrelated to binary trees.

It would be natural for a programmer to get distracted by the specific structure of the generated binary trees, but the structure is almost irrelevant. As the tree grows, all that matters is the number of nodes in the latest generation. This number can be modeled by a strange sort of recurrence relation:

If N(g) is the number of nodes in generation g, then N(g+1) is a random variable distributed as Binomial(2*N(g),p).

My computer simulation actually does its work using this simplification. Now, my current method for generating Binomial random numbers is kind of slow, but once I implement a very fast method, I'll be able to look at what happens when p is in the neighborhood of 0.5 (or greater).