Bayesian Investing Presentation
2011 June 6
I wish I had learned Bayesian probability earlier in my experience as an investor. It has helped me understand why quantitative methods ordinarily used tend to underestimate risk, and also explained much better the problem of why we tend to over value performance information, and how to correct it to really get at skill. The following is the substance of a PowerPoint presentation I gave in 2006 to a local Boston quantitative investing interest group — QWAFAFEW. Despite, or perhaps because of, its tongue in cheek name, this group has been the source of valuable exchange of ideas in the Boston investment community.
Bayesian & Qualitative Approaches to Quantitative Investing
Jarrod Wilcox
QWAFAFEW
December 12, 2006
BAYESIAN INFERENCE FOR QUANTITATIVE INVESTORS
In the super-competitive investment arena:
- Changing environments limit relevant data.
- Scientific consensus is not necessarily rewarding.
We can still benefit from:
- Scientific reasoning
- Reduced impact of emotion & cognitive error
- Optimal learning from data
- Private Bayesian priors.
FAMILIAR INVESTMENT APPLICATIONS
Bayesian alpha adjustment
- Black-Litterman: shrinks excess return point
estimates toward CAPM-based prior.
Bayesian portfolio risk estimates:
- LeDoit & Wolf: shrinks covariance point
estimates toward empirically-based priors. - Michaud: incorporates uncertainty around covariance point estimates.
TIMELINE
Dawn of probabilistic reasoning:
- Bernoulli, Bayes, Laplace (1700’s to 1800’s), games of chance –
probability as odds–
Classical statistics and probability:
- Fisher, Neyman, Pearson etc. (early 1900’s), probability as
frequency. - Extensions to multi-step processes, example Feller (1900’s)
- Kolmogorov, rigorous axioms.
Bayesian rebellion against frequentists:
- Polya, Cox, Jeffreys, Jaynes, “Savage, Raiffa & Schlaifer,” decisions where data is limited (mid 1900’s)
The new Bayesians:
- “Empirical Bayes”, hierarchical estimation (example: Stein-James shrinkage), distributed predictions (mid-late 1900’s)
- Iterative techniques: Markov Chain Monte Carlo simulation (1990’s to present).
PROBABILITY: THE LOGIC OF SCIENCE
Probability axioms about A (assertions) and D (data):
- A iff P(A)=1, (not A) iff P(A)=0
- P(A1 or A2) = P(A1) + P(A2) – P(A1 and A2)
- P(A1 and A2) = P(A1) * P(A2 | A1)
From which Bayes Rule logically follows:
- P(A | D) = P(A) * P(D | A) / P(D)
- Posterior probability = prior * likelihood / normalization
- Normalization constant P(D) is the sum of P(D| A i) over all
mutually exclusive Ai.
We always have a prior P(A), even if it is uninformative.
SIMPLE DISCRETE EXAMPLE
Situation:
- Two fair dice cubes are rolled, and the sum of the their
faces is 8. What is the probability that a 2 and a 6 are present?
Posterior = Prior * Likelihood / Normalization
P(A|D) = P(A) * P(D|A) / P(D)
- P(A) = 1/36 + 1/36 P(D|A) = 1
- P(D) = 2/36+2/36+1/36
P(A|D) = 2/5
MAKING A QUALITATIVE DECISION MORE QUANTITATIVE
Should I invest in Canadian farmland?
- Higher future use of foodstock for energy, China food
consumption - Global warming impact?
Can probability of a big payoff be posed?
- Probability of being right when the market is wrong given record of my past similar cosmic predictions.
- Confidence that in this case I am right and market is wrong.
Would scenario analysis be helpful?
SINGLE-PARAMETER CONTINUOUS DISTRIBUTION
Binomial-generated probability Θ of cumulative “wins” W versus “losses” L:
- Use the beta distribution with α=W+1, β=L+1.

- Convenient conjugate property: posterior has same form as prior. The prior can be interpreted as additional win-loss data.
- Beta(α1+ α2, β1+ β2) = Beta(α1, β1 )*Beta(α2, β2 )
- Beta(1,1) is an uninformative prior.
EXAMPLE
What is the probability of value strategies outperforming growth strategies next month?
- Data: monthly returns for the S&P 500 value and growth sub-indices as maintained by BARRA from Jan 75 through Dec 03.
- Assumption: An uninformed prior at the beginning, no knowledge of structure such as autocorrelation.
WHAT IF MONTHS WERE EXCHANGEABLE? ARE VALUE MONTHS MORE FREQUENT?

WHEN PRIORS MEET LIKELIHOOD

The mean of the posterior density is a weighted average of the prior and likelihood
densities.
densities.
The weights are proportional to the relative precision of the two estimates.
Note that in this single-parameter model, more data always leads to less dispersion of probability.
CONJUGATE BETA LEARNING MODEL
Simple, flexible, intuitive.
Assumes
- Minimal knowledge of IID process
- No scale parameter, location uncertainty is always reduced by data.
Potential applications
- Semi-qualitative decisions. Will the Fed raise rates? Will the correlation between stock and bond returns be positive?
- “Non-parametric” identification of potential return forecasting signals. News coding.
- Working backward to discover implicit priors (possible because of unique solution). Are you crazy?
MULTI-PARAMETER DENSITY
(Location, Dispersion, Shape)
Factor the conditional probabilities.
- Example: P(μ,σ | D) = P(μ | σ,D) * P(σ | D)
- Where P(σ | D) = ∫ P(σ | μ,D) dμ
Some simple cases have been worked out in closed form.
- Example: Normally-distributed process with unknown mean and variance.
Others require Monte Carlo simulation.
PERFORMANCE PROJECTION
By how much should a large-cap value manager beat the S&P500?
- We want to know both location and scale.
- Assume IID log normal process Jan 75 – Dec 03.
- Excess kurtosis, predictable variation ignored.
- For convenience, we will use a conjugate prior:
- σ2 distributed as scaled inverse chi-squared
- Mean distributed as N(μ, σ2/n ).
- Priors for mean (0%), standard deviation (2%) and
inverse chi-squared degrees of freedom (12) for σ2.
UPDATING THE DISPERSION

UPDATING THE MEAN

TO PREDICT NEW DATA
Point estimate:
- Dpredicted = mean of μ distribution
Full Bayesian estimate:
- Distribution of Dpred ~ ∫∫P(Dpred| μ,σ,D)P(μ,σ | D) dμdσ
- Here, repeat sequential draws of σ, μ|σ and N(μ,σ) until a forecast distribution is formed.
In naïve Markowitz portfolio optimization of many assets, point covariance estimate projections are inadequate.
CONJUGATE NORMAL INVERSE CHI-SQUARED LEARNING MODEL
Very widely applicable
- If process is close to IID normal or log-normal
- Any univariate data IID case where Central Limit Theorem kicks in.
Dispersion of probability for unknown variance leads posterior for the mean to have Student’s t fat tails.
Applications
- Decisions where scale of dispersion is important: Ranking of
active managers, Extension — Black-Litterman portfolio optimization - Where learning can be speeded up by the addition of priors to evidence: Extension — Bayesian regression
- Signals when estimate dispersion increases with increasing data: Evaluation of outliers questioning active strategies.
- Sequential decision-making: How many observations are needed to support a model?
HIERARCHICAL ESTIMATION
Assemble a hierarchy of estimates to better combine group and individual information.
- Basic idea underlying Stein-James estimates and Ledoit-Wolf approach to better covariance estimation.
Our example: Estimation of future return differences among similar mutual funds.
FORM EXCHANGEABLE GROUP
Exchangeability means not that the mutual funds are the same, but rather that our knowledge of their pertinent factors is the same.
Morningstar screen:
- Style: large cap, mixed (neither strong value nor strong growth), not international, beta between 0.8 and 1.2.
- Stock pickers: number of holdings between 100 and 250.
- Data availability: Morningstar ratings and 8 years of Yahoo Finance monthly return history.
- Independence: First fund listed in fund family satisfying screen.
DATA AND MODEL
Data:
- Monthly excess returns net of equal -weighted group average of 14 funds, 96 months ending Nov 2006.
Model:
- Fundj sample mean mj excess log returns ~ N(Θj , σj2/96) [Central Limit Theorem]
- Θj ~ N(µ , τ2)
- “Empirical Bayes.” σj2 are estimated directly from the data, then treated as knowns.
ESTIMATION PROCESS
For each value of τ in a wide grid:
- Calculate P(τ | m1,m2…,σ12, σ22,….)
Draw 10,000 samples of τ from this distribution and for each:
- Draw grand mean µ ~ N(f(τ,m1,m2…,σ12, σ22,…),g(τ,σ12, σ22,….))
- For each fund
- Calculate Bayesian Θmean and then Θvariance
- Draw possible Θj from N(Θmean j, Θvariance j)
The result is a probability distribution for each Θj which is “shrunk” toward the grand mean by amounts related to individual variances relative to estimated variance of Θ’s.
SHRINKAGE OF NAÏVE ESTIMATES

These funds were all middle of the road large cap stock pickers.
Is this picture more realistic than unadjusted individual fund records?
What would happen if we included more funds?
What would be a next step?
POTENTIAL HIERARCHICAL APPLICATIONS
Univariate: more sophisticated evaluation of manager skill, and one’s own skill.
Multivariate Extensions:
- Estimating sector and individual stock characteristics as sources of alpha.
- LeDoit and Wolf covariance shrinkage estimation.
What else?
HIERARCHICAL MODEL PERSPECTIVE
Big improvements in estimation error for grouped data.
The 96 observation example gave us an excuse to
regard the individual variances as known.
With unknown individual variances, grid-based simulation of so many parameters becomes impractical.
Then we need intelligent sampling – Markov Chain Monte Carlo.
RECOMMENDED TEXTS
Bayesian Data Analysis, 2nd Edition
- Andrew Gelman, John B. Carlin, Hal S. Stern and Donald B Rubin
Risk and Asset Allocation
- Attilio Meucci
Probability Theory: the Logic of Science
- Edwin Jaynes
I think the fteuqenrist statistics have the advantages and disadvantages, the same as Bayesian stats. The freq stats are the most widely used because it make difficult problems and models tractable using scalar statistics, and made direct inferences that although relies strongly in asymptotic distribution provide an inference which everybody agrees in the result. The difficulty of model properly the prior distribution have as result additional discussion over this step of inference and not over the results of the inference. Also the freq def of probability is more intuitive and provide a clear meaning to statement p=0.68. A “degree of belief” is difficult to interpret and to compare with experiment. Also many difficulties in freq stats arise from the fact their method where develop at early 20 century. The lack of numerical power only make possible use simple statistics as mean, variance, kurtosis, etc. and compare the observed value with a table. The Bayesian statistics can not be properly compute the in the complicate cases, and only with the arising of tractable numerical aproximations is that the Bayesian methods become competitive. I think that with new research in more general and computational intensive fteuqenrist inference methods many of the problems of this aproach can be resolve at least in part.To be fair I like some points of Bayesian statistics, specially the fact the probabilities are not related with inherent randomness but the ignorance of causes of phenomena. That make me more sense that an real randomness in nature. Also make the inference straightforward with Bayes theorem.I feel that both system have good points and weak points. May be in the future will be discover a new inference method which posses the characteristics of both system and will surpass them.
Patricia
Thank you for your thoughtful comment. For me, Bayesian logic is just a system for logical inference about partial truths. It is useful for decision-making for the individual or for groups who can agree on priors and evidence. It is indifferent as to whether one thinks of its subject as subjective or objective. On the other hand, the scientific method, with repeatable experiments, is all about transferring “objective” knowledge across people. Frequentist probability is useful as an adjunct because it assists in that process even where the knowledge is imprecise. But because it is founded on the avoidance of disagreement, the reasoning can sometimes get rather tortured.
At any rate, that is my view. But others with different priors will have different views.
Jarrod
[…] Jarrod Wilcox has a piece on Bayesian investing. […]