# Bayesian Investing Presentation

2011 June 6

*I wish I had learned Bayesian probability earlier in my experience as an investor. It has helped me understand why quantitative methods ordinarily used tend to underestimate risk, and also explained much better the problem of why we tend to over value performance information, and how to correct it to really get at skill. The following is the substance of a PowerPoint presentation I gave in 2006 to a local Boston quantitative investing interest group — QWAFAFEW. Despite, or perhaps because of, its tongue in cheek name, this group has been the source of valuable exchange of ideas in the Boston investment community.*

**Bayesian & Qualitative Approaches to Quantitative Investing**

Jarrod Wilcox

QWAFAFEW

December 12, 2006

### BAYESIAN INFERENCE FOR QUANTITATIVE INVESTORS

#### In the super-competitive investment arena:

- Changing environments limit relevant data.
- Scientific consensus is not necessarily rewarding.

#### We can still benefit from:

- Scientific reasoning
- Reduced impact of emotion & cognitive error
- Optimal learning from data
- Private Bayesian priors.

### FAMILIAR INVESTMENT APPLICATIONS

#### Bayesian alpha adjustment

- Black-Litterman: shrinks excess return point

estimates toward CAPM-based prior.

#### Bayesian portfolio risk estimates:

- LeDoit & Wolf: shrinks covariance point

estimates toward empirically-based priors. - Michaud: incorporates uncertainty around covariance point estimates.

### TIMELINE

#### Dawn of probabilistic reasoning:

- Bernoulli, Bayes, Laplace (1700’s to 1800’s), games of chance -

probability as odds–

#### Classical statistics and probability:

- Fisher, Neyman, Pearson etc. (early 1900’s), probability as

frequency. - Extensions to multi-step processes, example Feller (1900’s)
- Kolmogorov, rigorous axioms.

#### Bayesian rebellion against frequentists:

- Polya, Cox, Jeffreys, Jaynes, “Savage, Raiffa & Schlaifer,” decisions where data is limited (mid 1900’s)

#### The new Bayesians:

- “Empirical Bayes”, hierarchical estimation (example: Stein-James shrinkage), distributed predictions (mid-late 1900’s)
- Iterative techniques: Markov Chain Monte Carlo simulation (1990’s to present).

### PROBABILITY: THE LOGIC OF SCIENCE

#### Probability axioms about A (assertions) and D (data):

- A iff P(A)=1, (not A) iff P(A)=0
- P(A1 or A2) = P(A1) + P(A2) – P(A1 and A2)
- P(A1 and A2) = P(A1) * P(A2 | A1)

#### From which Bayes Rule logically follows:

- P(A | D) = P(A) * P(D | A) / P(D)
- Posterior probability = prior * likelihood / normalization
- Normalization constant P(D) is the sum of P(D| A i) over all

mutually exclusive Ai.

#### We always have a prior P(A), even if it is uninformative.

### SIMPLE DISCRETE EXAMPLE

#### Situation:

- Two fair dice cubes are rolled, and the sum of the their

faces is 8. What is the probability that a 2 and a 6 are present?

#### Posterior = Prior * Likelihood / Normalization

#### P(A|D) = P(A) * P(D|A) / P(D)

- P(A) = 1/36 + 1/36 P(D|A) = 1
- P(D) = 2/36+2/36+1/36

#### P(A|D) = 2/5

### MAKING A QUALITATIVE DECISION MORE QUANTITATIVE

**Should I invest in Canadian farmland?**

- Higher future use of foodstock for energy, China food

consumption - Global warming impact?

#### Can probability of a big payoff be posed?

- Probability of being right when the market is wrong given record of my past similar cosmic predictions.
- Confidence that in this case I am right and market is wrong.

#### Would scenario analysis be helpful?

### SINGLE-PARAMETER CONTINUOUS DISTRIBUTION

**Binomial-generated probability Θ of cumulative “wins” W versus “losses” L:**

- Use the beta distribution with α=W+1, β=L+1.

- Convenient conjugate property: posterior has same form as prior. The prior can be interpreted as additional win-loss data.
- Beta(α1+ α2, β1+ β2) = Beta(α1, β1 )*Beta(α2, β2 )
- Beta(1,1) is an uninformative prior.

### EXAMPLE

**What is the probability of value strategies outperforming growth strategies next month?**

- Data: monthly returns for the S&P 500 value and growth sub-indices as maintained by BARRA from Jan 75 through Dec 03.
- Assumption: An uninformed prior at the beginning, no knowledge of structure such as autocorrelation.

### WHAT IF MONTHS WERE EXCHANGEABLE? ARE VALUE MONTHS MORE FREQUENT?

### WHEN PRIORS MEET LIKELIHOOD

The mean of the posterior density is a weighted average of the prior and likelihood

densities.

densities.

The weights are proportional to the relative precision of the two estimates.

Note that in this single-parameter model, more data

*always*leads to less dispersion of probability.### CONJUGATE BETA LEARNING MODEL

#### Simple, flexible, intuitive.

#### Assumes

- Minimal knowledge of IID process
- No scale parameter, location uncertainty is always reduced by data.

#### Potential applications

- Semi-qualitative decisions. Will the Fed raise rates? Will the correlation between stock and bond returns be positive?
- “Non-parametric” identification of potential return forecasting signals. News coding.
- Working backward to discover implicit priors (possible because of unique solution). Are you crazy?

### MULTI-PARAMETER DENSITY

#### (Location, Dispersion, Shape)

#### Factor the conditional probabilities.

- Example: P(μ,σ | D) = P(μ | σ,D) * P(σ | D)
- Where P(σ | D) = ∫ P(σ | μ,D) dμ

#### Some simple cases have been worked out in closed form.

- Example: Normally-distributed process with unknown mean and variance.

#### Others require Monte Carlo simulation.

### PERFORMANCE PROJECTION

#### By how much should a large-cap value manager beat the S&P500?

- We want to know both location and scale.
- Assume IID log normal process Jan 75 – Dec 03.
- Excess kurtosis, predictable variation ignored.

- For convenience, we will use a conjugate prior:
- σ
^{2}distributed as scaled inverse chi-squared - Mean distributed as N(μ, σ
^{2}/n ).

- σ
- Priors for mean (0%), standard deviation (2%) and

inverse chi-squared degrees of freedom (12) for σ^{2}.

### UPDATING THE DISPERSION

### UPDATING THE MEAN

### TO PREDICT NEW DATA

#### Point estimate:

- D
_{predicted}= mean of μ distribution

#### Full Bayesian estimate:

- Distribution of D
_{pred}~ ∫∫P(D_{pred}| μ,σ,D)P(μ,σ | D) dμdσ - Here, repeat sequential draws of σ, μ|σ and N(μ,σ) until a forecast distribution is formed.

#### In naïve Markowitz portfolio optimization of many assets, point covariance estimate projections are inadequate.

### CONJUGATE NORMAL INVERSE CHI-SQUARED LEARNING MODEL

#### Very widely applicable

- If process is close to IID normal or log-normal
- Any univariate data IID case where Central Limit Theorem kicks in.

#### Dispersion of probability for unknown variance leads posterior for the mean to have Student’s t fat tails.

#### Applications

- Decisions where scale of dispersion is important: Ranking of

active managers, Extension — Black-Litterman portfolio optimization - Where learning can be speeded up by the addition of priors to evidence: Extension — Bayesian regression
- Signals when estimate dispersion increases with increasing data: Evaluation of outliers questioning active strategies.
- Sequential decision-making: How many observations are needed to support a model?

### HIERARCHICAL ESTIMATION

#### Assemble a hierarchy of estimates to better combine group and individual information.

- Basic idea underlying Stein-James estimates and Ledoit-Wolf approach to better covariance estimation.

#### Our example: Estimation of future return differences among similar mutual funds.

### FORM EXCHANGEABLE GROUP

#### Exchangeability means not that the mutual funds are the same, but rather that our knowledge of their pertinent factors is the same.

#### Morningstar screen:

- Style: large cap, mixed (neither strong value nor strong growth), not international, beta between 0.8 and 1.2.
- Stock pickers: number of holdings between 100 and 250.
- Data availability: Morningstar ratings and 8 years of Yahoo Finance monthly return history.
- Independence: First fund listed in fund family satisfying screen.

### DATA AND MODEL

#### Data:

- Monthly excess returns net of equal -weighted group average of 14 funds, 96 months ending Nov 2006.

#### Model:

- Fundj sample mean m
_{j}excess log returns ~ N(Θ_{j}, σ_{j}^{2}/96) [Central Limit Theorem] - Θ
_{j}~ N(µ , τ^{2}) - “Empirical Bayes.” σ
_{j}^{2}are estimated directly from the data, then treated as knowns.

### ESTIMATION PROCESS

#### For each value of τ in a wide grid:

- Calculate P(τ | m
_{1},m_{2}…,σ_{1}^{2}, σ_{2}^{2},….)

#### Draw 10,000 samples of τ from this distribution and for each:

- Draw grand mean µ ~ N(f(τ,m
_{1},m_{2}…,σ_{1}^{2}, σ_{2}^{2},…),g(τ,σ_{1}^{2}, σ_{2}^{2},….)) - For each fund
- Calculate Bayesian Θ
_{mean}and then Θ_{variance} - Draw possible Θ
_{j}from N(Θ_{mean j}, Θ_{variance j})

- Calculate Bayesian Θ

#### The result is a probability distribution for each Θ_{j} which is “shrunk” toward the grand mean by amounts related to individual variances relative to estimated variance of Θ’s.

### SHRINKAGE OF NAÏVE ESTIMATES

These funds were all middle of the road large cap stock pickers.

Is this picture more realistic than unadjusted individual fund records?

What would happen if we included more funds?

What would be a next step?

### POTENTIAL HIERARCHICAL APPLICATIONS

#### Univariate: more sophisticated evaluation of manager skill, and one’s own skill.

#### Multivariate Extensions:

- Estimating sector and individual stock characteristics as sources of alpha.
- LeDoit and Wolf covariance shrinkage estimation.

#### What else?

### HIERARCHICAL MODEL PERSPECTIVE

#### Big improvements in estimation error for grouped data.

#### The 96 observation example gave us an excuse to

regard the individual variances as known.

#### With unknown individual variances, grid-based simulation of so many parameters becomes impractical.

#### Then we need intelligent sampling – Markov Chain Monte Carlo.

### RECOMMENDED TEXTS

#### Bayesian Data Analysis, 2nd Edition

- Andrew Gelman, John B. Carlin, Hal S. Stern and Donald B Rubin

#### Risk and Asset Allocation

- Attilio Meucci

#### Probability Theory: the Logic of Science

- Edwin Jaynes

I think the fteuqenrist statistics have the advantages and disadvantages, the same as Bayesian stats. The freq stats are the most widely used because it make difficult problems and models tractable using scalar statistics, and made direct inferences that although relies strongly in asymptotic distribution provide an inference which everybody agrees in the result. The difficulty of model properly the prior distribution have as result additional discussion over this step of inference and not over the results of the inference. Also the freq def of probability is more intuitive and provide a clear meaning to statement p=0.68. A “degree of belief” is difficult to interpret and to compare with experiment. Also many difficulties in freq stats arise from the fact their method where develop at early 20 century. The lack of numerical power only make possible use simple statistics as mean, variance, kurtosis, etc. and compare the observed value with a table. The Bayesian statistics can not be properly compute the in the complicate cases, and only with the arising of tractable numerical aproximations is that the Bayesian methods become competitive. I think that with new research in more general and computational intensive fteuqenrist inference methods many of the problems of this aproach can be resolve at least in part.To be fair I like some points of Bayesian statistics, specially the fact the probabilities are not related with inherent randomness but the ignorance of causes of phenomena. That make me more sense that an real randomness in nature. Also make the inference straightforward with Bayes theorem.I feel that both system have good points and weak points. May be in the future will be discover a new inference method which posses the characteristics of both system and will surpass them.

Patricia

Thank you for your thoughtful comment. For me, Bayesian logic is just a system for logical inference about partial truths. It is useful for decision-making for the individual or for groups who can agree on priors and evidence. It is indifferent as to whether one thinks of its subject as subjective or objective. On the other hand, the scientific method, with repeatable experiments, is all about transferring “objective” knowledge across people. Frequentist probability is useful as an adjunct because it assists in that process even where the knowledge is imprecise. But because it is founded on the avoidance of disagreement, the reasoning can sometimes get rather tortured.

At any rate, that is my view. But others with different priors will have different views.

Jarrod

[…] Jarrod Wilcox has a piece on Bayesian investing. […]