# Recitation 1

date post

11-Feb-2016Category

## Documents

view

53download

0

Embed Size (px)

description

### Transcript of Recitation 1

Recitation 1Probability Review

Parts of the slides are from previous years recitation and lecture notes

Basic ConceptsA sample space S is the set of all possible outcomes of a conceptual or physical, repeatable experiment. (S can be finite or infinite.)E.g., S may be the set of all possible outcomes of a dice roll: An event A is any subset of S.Eg., A= Event that the dice roll is < 3.

ProbabilityA probability P(A) is a function that maps an event A onto the interval [0, 1]. P(A) is also called the probability measure or probability mass of A.Worlds in which A is trueWorlds in which A is falseP(A) is the area of the oval

Kolmogorov AxiomsAll probabilities are between 0 and 10 P(A) 1P(S) = 1P()=0 The probability of a disjunction is given byP(A U B) = P(A) + P(B) P(A B)

ABA BA BA B ?

Random VariableA random variable is a function that associates a unique number with every outcome of an experiment. Discrete r.v.:The outcome of a dice-roll: D={1,2,3,4,5,6}Binary event and indicator variable:Seeing a 6" on a toss X=1, o/w X=0. This describes the true or false outcome a random event. Continuous r.v.:The outcome of observing the measured location of an aircraftwSX(w)

Probability distributionsFor each value that r.v X can take, assign a number in [0,1]. Suppose X takes values v1,vn.Then,P(X= v1)++P(X= vn)= 1.Intuitively, the probability of X taking value vi is the frequency of getting outcome represented by vi

Bernoulli distribution: Ber(p)

Binomial distribution: Bin(n,p)Suppose a coin with head prob. p is tossed n times.What is the probability of getting k heads?How many ways can you get k heads in a sequence of k heads and n-k tails?

Discrete Distributions

Continuous Prob. DistributionA continuous random variable X is defined on a continuous sample space: an interval on the real line, a region in a high dimensional space, etc.X usually corresponds to a real-valued measurements of some property, e.g., length, position, It is meaningless to talk about the probability of the random variable assuming a particular value --- P(x) = 0Instead, we talk about the probability of the random variable assuming a value within a given interval, or half interval, etc.

Probability DensityIf the prob. of x falling into [x, x+dx] is given by p(x)dx for dx , then p(x) is called the probability density over x. The probability of the random variable assuming a value within some given interval from x1 to x2 is equivalent to the area under the graph of the probability density function between x1 and x2.

Probability mass: GaussianDistribution

Uniform Density Function

Normal (Gaussian) Density Function

The distribution is symmetric, and is often illustrated as a bell-shaped curve. Two parameters, m (mean) and s (standard deviation), determine the location and shape of the distribution.The highest point on the normal curve is at the mean, which is also the median and mode.

Continuous Distributions

Expectation: the centre of mass, mean, first moment):

Sample mean:

Variance: the spread:

Sample variance:Statistical Characterizations

Conditional Probability P(X|Y) = Fraction of worlds in which X is true given Y is trueH = "having a headache"F = "coming down with Flu"P(H)=1/10P(F)=1/40P(H|F)=1/2P(H|F) = fraction of headache given you have a flu = P(HF)/P(F)Definition:

Corollary: The Chain Rule

XYXY

The Bayes RuleWhat we have just did leads to the following general expression:

This is Bayes Rule

Probabilistic Inference H = "having a headache"F = "coming down with Flu"P(H)=1/10P(F)=1/40P(H|F)=1/2

The Problem:

P(F|H) = ?

HFFH

Joint ProbabilityA joint probability distribution for a set of RVs gives the probability of every atomic event (sample point)

P(Flu,DrinkBeer) = a 2 2 matrix of values:

Every question about a domain can be answered by the joint distribution, as we will see later.

BBF0.0050.02F0.1950.78

Inference with the JointCompute Marginals

FBH0.4FBH0.1FBH0.17FBH0.2FBH0.05FBH0.05FBH0.015FBH0.015

Inference with the JointCompute Marginals

FBH0.4FBH0.1FBH0.17FBH0.2FBH0.05FBH0.05FBH0.015FBH0.015

Inference with the JointCompute Conditionals

FBH0.4FBH0.1FBH0.17FBH0.2FBH0.05FBH0.05FBH0.015FBH0.015

Inference with the JointCompute Conditionals

General idea: compute distribution on query variable by fixing evidence variables and summing over hidden variables

FBH0.4FBH0.1FBH0.17FBH0.2FBH0.05FBH0.05FBH0.015FBH0.015

Conditional IndependenceRandom variables X and Y are said to be independent if:P(X Y) =P(X)*P(Y)Alternatively, this can be written as P(X | Y) = P(X) andP(Y | X) = P(Y)Intuitively, this means that telling you that Y happened, does not make X more or less likely.Note: This does not mean X and Y are disjoint!!!XYXY

Rules of Independence --- by examplesP(Virus | DrinkBeer) = P(Virus) iff Virus is independent of DrinkBeer

P(Flu | Virus;DrinkBeer) = P(Flu|Virus) iff Flu is independent of DrinkBeer, given Virus

P(Headache | Flu;Virus;DrinkBeer) = P(Headache|Flu;DrinkBeer) iff Headache is independent of Virus, given Flu and DrinkBeer

Marginal and Conditional IndependenceRecall that for events E (i.e. X=x) and H (say, Y=y), the conditional probability of E given H, written as P(E|H), is

P(E and H)/P(H)(= the probability of both E and H are true, given H is true)

E and H are (statistically) independent if

P(E) = P(E|H)(i.e., prob. E is true doesn't depend on whether H is true); or equivalentlyP(E and H)=P(E)P(H).

E and F are conditionally independent given H if P(E|H,F) = P(E|H)or equivalentlyP(E,F|H) = P(E|H)P(F|H)

Why knowledge of Independence is usefulLower complexity (time, space, search )

Motivates efficient inference for all kinds of queries Structured knowledge about the domaineasy to learning (both from expert and from data)easy to growx

Density EstimationA Density Estimator learns a mapping from a set of attributes to a Probability

Often know as parameter estimation if the distribution form is specifiedBinomial, Gaussian

Three important issues:

Nature of the data (iid, correlated, )Objective function (MLE, MAP, )Algorithm (simple algebra, gradient methods, EM, )Evaluation scheme (likelihood on test data, predictability, consistency, )

Parameter Learning from iid dataGoal: estimate distribution parameters q from a dataset of N independent, identically distributed (iid), fully observed, training casesD = {x1, . . . , xN}

Maximum likelihood estimation (MLE)One of the most common estimatorsWith iid and full-observability assumption, write L(q) as the likelihood of the data:

pick the setting of parameters most likely to have generated the data we saw:

Example 1: Bernoulli modelData: We observed N iid coin tossing: D={1, 0, 1, , 0}Representation:

Binary r.v:

Model:

How to write the likelihood of a single observation xi ?

The likelihood of datasetD={x1, ,xN}:

MLEObjective function:

We need to maximize this w.r.t. q

Take derivatives wrt q

orFrequency as sample mean

OverfittingRecall that for Bernoulli Distribution, we have

What if we tossed too few times so that we saw zero head?We have and we will predict that the probability of seeing a head next is zero!!!

The rescue: Where n' is know as the pseudo- (imaginary) count

But can we make this more formal?

Example 2: univariate normalData: We observed N iid real samples: D={-0.1, 10, 1, -5.2, , 3}Model:

Log likelihood:

MLE: take derivative and set to zero:

The Bayesian TheoryThe Bayesian Theory: (e.g., for date D and model M)

P(M|D) = P(D|M)P(M)/P(D)

the posterior equals to the likelihood times the prior, up to a constant.

This allows us to capture uncertainty about the model in a principled way

Hierarchical Bayesian Modelsq are the parameters for the likelihood p(x|q)a are the parameters for the prior p(q|a) .We can have hyper-hyper-parameters, etc.We stop when the choice of hyper-parameters makes no difference to the marginal likelihood; typically make hyper-parameters constants.Where do we get the prior? Intelligent guessesEmpirical Bayes (Type-II maximum likelihood) computing point estimates of a :

Bayesian estimation for Bernoulli Beta distribution:

Posterior distribution of q :

Notice the isomorphism of the posterior to the prior, such a prior is called a conjugate prior

Bayesian estimation for Bernoulli, con'd Posterior distribution of q :

Maximum a posteriori (MAP) estimation:

Posterior mean estimation:

Prior strength: A=a+bA can be interoperated as the size of an imaginary data set from which we obtain the pseudo-countsBata parameters can be understood as pseudo-counts

Bayesian estimation for normal distribution Normal Prior:

Joint probability:

Posterior:

Sample mean

We talk about the probability of something, but what is the precise mathematical definition of probability? What is probability? Is it a number, between zero and one, or something else?

A rigorous definition will involve measure theory, but by now we are satisfied to just say A probability P(A) is a function that maps an event A onto the interval [0, 1].

One thing that is easy to confuse here is: probability is defined on event space, instead of sample