# 6805: 1094 Activities 15

Write your name and answers on this sheet and hand it in at the end.

## Checking the sum and product rules, and their consequences

Goal: Check using a very simple example that the Bayesian rules (on slide 1) are consistent with standard probabilities based on frequencies.

 Blue Brown Total Tall 1 17 18 Short 37 20 57 Total 38 37 75

 Blue Brown Total Tall Short Total
1. Table 1 shows the number of blue- or brown-eyed and tall or short individuals in a population of 75. Fill in the blanks in Table 2 with probabilities (in decimals, not fractions) based on the usual "frequentist" interpretation of probability (which would say that the probability of randomly drawing an ace from a deck of cards is 4/52 = 1/13). Circle the row and/or column that illustrates the sum rule on slide 1.

2. What is pr(short,blue)? Is this a joint or conditional probability? What is pr(blue)? From the product rule, what is pr(short|blue)? Can you read this result directly from the table?

3. Apply Bayes' theorem to find pr(blue|short) from your answers to the last part.

4. What rule does the second row (the one starting with "short") illustrate? Write it out in "pr()" notation.

5. Are the probabilities of being tall and having brown eyes mutually independent? Why or why not?

## Standard medical example problem by applying Bayesian rules of probability

Goal: Use the Bayesian rules to solve a familiar problem.

Suppose there is an unknown disease (UD) and there is a test for it.

1. The false positive rate is 2.3%. ("False positive" means the test says you have UD, but you don't.)
2. The false negative rate is 1.4%. ("False negative" means you have UD, but the test says you don't.)
Assume that 1 in 10,000 people have the disease. You are given the test and get a positive result. Your ultimate goal is to find the probability that you actually have the disease. We'll do it using the Bayesian rules.

We'll use the notation:

• H = "you have UD"
• H = "you do not have UD"
• D = "you test positive for UD"
• D = "you test negative for UD"
1. Before doing a calculation (or thinking too hard :), does your intuition tell you the probability you have the disease is high or low?

2. In the "pr()" notation, what is your ultimate goal?

3. Express the false positive rate in "pr()" notation.

4. Express the false negative rate in "pr()" notation. By applying the sum rule, what do you also know? (If you get stuck answering the question, do the next part first.)

5. Should pr(D|H) + pr(D|H) = 1? Should pr(D|H) + pr(D|H) = 1? (Hint: does the sum rule apply on the left or right of the |?)

6. Apply Bayes' theorem to your result for your ultimate goal (don't put in numbers yet). Why is this a useful thing to do here?

7. Let's find the other results we need. What is pr(H)? What is pr(H)?

8. Finally, we need pr(D). Apply marginalization first, and then the product rule twice to get an expression for pr(D) in terms of quantities we know.

9. Now plug in numbers into Bayes' theorem and calculate the result. What do you get?

Goal: Explore a classic problem from Gull (by way of Sivia's book).

In the figure, a radioactive source that emits gamma rays randomly in time but uniformly in angle is placed at (x0, y0). The gamma rays are detected on the x-axis and these positions are saved, xk, k = 1,...,N. Given these positions, the problem is to estimate the location of the source. We'll assume we know that y0 = 1 (in whatever length units we are using), so our goal to is estimate x0. The angle θ is between the γ ray and the y-axis in the figure.

1. Claim: in the pr() notation, our goal is to find the posterior PDF pr(x0 | {xk}, y0). How would you translate this posterior to words?

2. By Bayes' theorem, how is this posterior related to pr({xk} | x0, y0), pr(x0 | y0), and pr({xk} | y0)?

3. Claim: because the denominator PDF is independent of x0, it is just a normalization factor for
pr(x0 | {xk}, y0), so we don't need to calculate it explicitly. Do you understand this? What good is an unnormalized posterior pr(x0 | {xk}, y0)?

4. If we take for the prior PDF pr(x0 | y0) that pr(x0 | y0) = pr(x0) = 1/|x0,max − x0,min| for x0,max < x0 < x0,min and zero elsewhere, what are we assuming? Why is this more plausible than letting x0 be anything? Why do we assume a constant PDF? Is this PDF normalized?

5. If we assume that the xks are mutually independent, then how is pr({xk} | x0, y0) simplified? Is this a justifiable assumption?

6. Show that
pr(xk | x0, y0) = (y0/π)(y02 + (xk − x0)2)−1
given that the angular distribution from θk is uniform from −π/2 to +π/2, so
pr(θk | x0, y0) = 1/π , and also that pr(θk | x0, y0) dθk = pr(xk | x0, y0) dxk .

7. Ok, now we're ready to see what the estimates for x0 look like. Open the Mathematica notebook Bayesian_games_part1.nb.
1. For this notebook we assume y0 = 1 is known. We are trying to estimate x0, whose true value is 1 (we don't use that in the notebook). Run this section
2. Look up "CauchyDistribution" in the Mathematica Help to verify it is the same function derived above. Run the "Generate a set of random x points" section several times to see the fluctuations in the distribution. What can you say about the tails of this distribution compared to your experience with Gaussian distributions?

3. Run the section on the "Posterior for a single x measurement" several times to see how much it can change. Would taking the maximum or the mean of these distributions give us a useful estimate of x0?

4. In this section the posterior for x0 is calculated and plotted for different numbers of data. The prior is taken to be a uniform PDF from -4 to 4 (we really don't believe it is bigger than that but otherwise we don't know what it is). For each Nmax, besides plotting the posterior for x0, we calculate the maximum and mean of the posterior ⟨x0⟩, and the mean of the set of Nmax points x0. Run this section several times and record the results in the table:
 Nmax 1: ⟨x0⟩ 1: x0 2: ⟨x0⟩ 2: x0 3: ⟨x0⟩ 3: x0 1 2 4 16 64 256

What are your observations about the posterior for x0 as a function of Nmax and which mean is the better estimate?

## [Extra] Maximum entropy and prior PDFs

Goal: Derive the prior probability distribution function (PDF) on slide 12.

1. On slide 12 of Bayesian_statistics_basics.pdf, an expression for the entropy corresponding to the probability distribution function pr(x) is given. The idea of the maximum entropy approach to priors is that one should assume only what is known, and no more. This is achieved by maximizing the entropy subject to appropriate constraints. Note that this determines the function pr(x), as opposed to a single value of a function. This maximization is carried out using Lagrange multipliers. Are you familiar with using Lagrange multipliers? How about taking functional derivatives?

2. Here's a quick Lagrange multiplier problem to remind you. Find the extrema of F(x,y) = x2y − ln(x) subject to 8x + 3y = a. To do this, we take the partial derivatives of F(x,y) - λ(8x + 3y - a) with respect to x, y, and &lambda. This gives three equations in three unknowns, which we solve to find the (x,y) points of the extrema.
1. Find the three equations.

2. Use the Mathematica Solve command to find solutions when a=0. The only real solution (which is a relative minimum) is x = −1/2, y = 4/3. What did you get?

3. Assuming m(a) is a constant, find the functional derivative of Q with respect to pr(a | M,R) and set it to zero.

4. Find the (ordinary) partial derivatives with respect to λ0 and λ1, and set them each to zero.

5. Eliminate the λi dependence to verify the result given on slide 12 for pr(a | M,R).