Bayesian Investing Presentation
2011 June 6
I wish I had learned Bayesian probability earlier in my experience as an investor. It has helped me understand why quantitative methods ordinarily used tend to underestimate risk, and also explained much better the problem of why we tend to over value performance information, and how to correct it to really get at skill. The following is the substance of a PowerPoint presentation I gave in 2006 to a local Boston quantitative investing interest group — QWAFAFEW. Despite, or perhaps because of, its tongue in cheek name, this group has been the source of valuable exchange of ideas in the Boston investment community.
Bayesian & Qualitative Approaches to Quantitative Investing
December 12, 2006
BAYESIAN INFERENCE FOR QUANTITATIVE INVESTORS
In the super-competitive investment arena:
- Changing environments limit relevant data.
- Scientific consensus is not necessarily rewarding.
We can still benefit from:
- Scientific reasoning
- Reduced impact of emotion & cognitive error
- Optimal learning from data
- Private Bayesian priors.
FAMILIAR INVESTMENT APPLICATIONS
Bayesian alpha adjustment
- Black-Litterman: shrinks excess return point
estimates toward CAPM-based prior.
Bayesian portfolio risk estimates:
- LeDoit & Wolf: shrinks covariance point
estimates toward empirically-based priors.
- Michaud: incorporates uncertainty around covariance point estimates.
Dawn of probabilistic reasoning:
- Bernoulli, Bayes, Laplace (1700’s to 1800’s), games of chance –
probability as odds–
Classical statistics and probability:
- Fisher, Neyman, Pearson etc. (early 1900’s), probability as
- Extensions to multi-step processes, example Feller (1900’s)
- Kolmogorov, rigorous axioms.
Bayesian rebellion against frequentists:
- Polya, Cox, Jeffreys, Jaynes, “Savage, Raiffa & Schlaifer,” decisions where data is limited (mid 1900’s)
The new Bayesians:
- “Empirical Bayes”, hierarchical estimation (example: Stein-James shrinkage), distributed predictions (mid-late 1900’s)
- Iterative techniques: Markov Chain Monte Carlo simulation (1990’s to present).
PROBABILITY: THE LOGIC OF SCIENCE
Probability axioms about A (assertions) and D (data):
- A iff P(A)=1, (not A) iff P(A)=0
- P(A1 or A2) = P(A1) + P(A2) – P(A1 and A2)
- P(A1 and A2) = P(A1) * P(A2 | A1)
From which Bayes Rule logically follows:
- P(A | D) = P(A) * P(D | A) / P(D)
- Posterior probability = prior * likelihood / normalization
- Normalization constant P(D) is the sum of P(D| A i) over all
mutually exclusive Ai.
We always have a prior P(A), even if it is uninformative.
SIMPLE DISCRETE EXAMPLE
- Two fair dice cubes are rolled, and the sum of the their
faces is 8. What is the probability that a 2 and a 6 are present?
Posterior = Prior * Likelihood / Normalization
P(A|D) = P(A) * P(D|A) / P(D)
- P(A) = 1/36 + 1/36 P(D|A) = 1
- P(D) = 2/36+2/36+1/36
P(A|D) = 2/5
MAKING A QUALITATIVE DECISION MORE QUANTITATIVE
Should I invest in Canadian farmland?
- Higher future use of foodstock for energy, China food
- Global warming impact?
Can probability of a big payoff be posed?
- Probability of being right when the market is wrong given record of my past similar cosmic predictions.
- Confidence that in this case I am right and market is wrong.
Would scenario analysis be helpful?
SINGLE-PARAMETER CONTINUOUS DISTRIBUTION
Binomial-generated probability Θ of cumulative “wins” W versus “losses” L:
- Use the beta distribution with α=W+1, β=L+1.
- Convenient conjugate property: posterior has same form as prior. The prior can be interpreted as additional win-loss data.
- Beta(α1+ α2, β1+ β2) = Beta(α1, β1 )*Beta(α2, β2 )
- Beta(1,1) is an uninformative prior.
What is the probability of value strategies outperforming growth strategies next month?
- Data: monthly returns for the S&P 500 value and growth sub-indices as maintained by BARRA from Jan 75 through Dec 03.
- Assumption: An uninformed prior at the beginning, no knowledge of structure such as autocorrelation.
WHAT IF MONTHS WERE EXCHANGEABLE? ARE VALUE MONTHS MORE FREQUENT?
WHEN PRIORS MEET LIKELIHOOD
The mean of the posterior density is a weighted average of the prior and likelihood
The weights are proportional to the relative precision of the two estimates.
Note that in this single-parameter model, more data always leads to less dispersion of probability.
CONJUGATE BETA LEARNING MODEL
Simple, flexible, intuitive.
- Minimal knowledge of IID process
- No scale parameter, location uncertainty is always reduced by data.
- Semi-qualitative decisions. Will the Fed raise rates? Will the correlation between stock and bond returns be positive?
- “Non-parametric” identification of potential return forecasting signals. News coding.
- Working backward to discover implicit priors (possible because of unique solution). Are you crazy?
(Location, Dispersion, Shape)
Factor the conditional probabilities.
- Example: P(μ,σ | D) = P(μ | σ,D) * P(σ | D)
- Where P(σ | D) = ∫ P(σ | μ,D) dμ
Some simple cases have been worked out in closed form.
- Example: Normally-distributed process with unknown mean and variance.
Others require Monte Carlo simulation.
By how much should a large-cap value manager beat the S&P500?
- We want to know both location and scale.
- Assume IID log normal process Jan 75 – Dec 03.
- Excess kurtosis, predictable variation ignored.
- For convenience, we will use a conjugate prior:
- σ2 distributed as scaled inverse chi-squared
- Mean distributed as N(μ, σ2/n ).
- Priors for mean (0%), standard deviation (2%) and
inverse chi-squared degrees of freedom (12) for σ2.
UPDATING THE DISPERSION
UPDATING THE MEAN
TO PREDICT NEW DATA
- Dpredicted = mean of μ distribution
Full Bayesian estimate:
- Distribution of Dpred ~ ∫∫P(Dpred| μ,σ,D)P(μ,σ | D) dμdσ
- Here, repeat sequential draws of σ, μ|σ and N(μ,σ) until a forecast distribution is formed.
In naïve Markowitz portfolio optimization of many assets, point covariance estimate projections are inadequate.
CONJUGATE NORMAL INVERSE CHI-SQUARED LEARNING MODEL
Very widely applicable
- If process is close to IID normal or log-normal
- Any univariate data IID case where Central Limit Theorem kicks in.
Dispersion of probability for unknown variance leads posterior for the mean to have Student’s t fat tails.
- Decisions where scale of dispersion is important: Ranking of
active managers, Extension — Black-Litterman portfolio optimization
- Where learning can be speeded up by the addition of priors to evidence: Extension — Bayesian regression
- Signals when estimate dispersion increases with increasing data: Evaluation of outliers questioning active strategies.
- Sequential decision-making: How many observations are needed to support a model?
Assemble a hierarchy of estimates to better combine group and individual information.
- Basic idea underlying Stein-James estimates and Ledoit-Wolf approach to better covariance estimation.
Our example: Estimation of future return differences among similar mutual funds.
FORM EXCHANGEABLE GROUP
Exchangeability means not that the mutual funds are the same, but rather that our knowledge of their pertinent factors is the same.
- Style: large cap, mixed (neither strong value nor strong growth), not international, beta between 0.8 and 1.2.
- Stock pickers: number of holdings between 100 and 250.
- Data availability: Morningstar ratings and 8 years of Yahoo Finance monthly return history.
- Independence: First fund listed in fund family satisfying screen.
DATA AND MODEL
- Monthly excess returns net of equal -weighted group average of 14 funds, 96 months ending Nov 2006.
- Fundj sample mean mj excess log returns ~ N(Θj , σj2/96) [Central Limit Theorem]
- Θj ~ N(µ , τ2)
- “Empirical Bayes.” σj2 are estimated directly from the data, then treated as knowns.
For each value of τ in a wide grid:
- Calculate P(τ | m1,m2…,σ12, σ22,….)
Draw 10,000 samples of τ from this distribution and for each:
- Draw grand mean µ ~ N(f(τ,m1,m2…,σ12, σ22,…),g(τ,σ12, σ22,….))
- For each fund
- Calculate Bayesian Θmean and then Θvariance
- Draw possible Θj from N(Θmean j, Θvariance j)
The result is a probability distribution for each Θj which is “shrunk” toward the grand mean by amounts related to individual variances relative to estimated variance of Θ’s.
SHRINKAGE OF NAÏVE ESTIMATES
These funds were all middle of the road large cap stock pickers.
Is this picture more realistic than unadjusted individual fund records?
What would happen if we included more funds?
What would be a next step?
POTENTIAL HIERARCHICAL APPLICATIONS
Univariate: more sophisticated evaluation of manager skill, and one’s own skill.
- Estimating sector and individual stock characteristics as sources of alpha.
- LeDoit and Wolf covariance shrinkage estimation.
HIERARCHICAL MODEL PERSPECTIVE
Big improvements in estimation error for grouped data.
The 96 observation example gave us an excuse to
regard the individual variances as known.
With unknown individual variances, grid-based simulation of so many parameters becomes impractical.
Then we need intelligent sampling – Markov Chain Monte Carlo.
Bayesian Data Analysis, 2nd Edition
- Andrew Gelman, John B. Carlin, Hal S. Stern and Donald B Rubin
Risk and Asset Allocation
- Attilio Meucci
Probability Theory: the Logic of Science
- Edwin Jaynes