Skip to content

Bayesian Investing Presentation

2011 June 6
I wish I had learned Bayesian probability earlier in my experience as an investor.  It has helped me understand why quantitative methods ordinarily used tend to underestimate risk, and also explained much better the problem of why we tend to over value performance information, and how to correct it to really get at skill.  The following is the substance of a PowerPoint presentation I gave in 2006 to a local Boston quantitative investing interest group — QWAFAFEW.  Despite, or perhaps because of, its tongue in cheek name, this group has been the source of valuable exchange of ideas in the Boston investment community.

Bayesian & Qualitative Approaches to Quantitative Investing

Jarrod Wilcox
QWAFAFEW
December 12, 2006

BAYESIAN INFERENCE FOR QUANTITATIVE INVESTORS

In the super-competitive investment arena:

  • Changing environments limit relevant data.
  • Scientific consensus is not necessarily rewarding.

We can still benefit from:

  • Scientific reasoning
  • Reduced impact of emotion & cognitive error
  • Optimal learning from data
  • Private Bayesian priors.

FAMILIAR INVESTMENT APPLICATIONS

Bayesian alpha adjustment

  • Black-Litterman: shrinks excess return point
    estimates toward CAPM-based prior.

Bayesian portfolio risk estimates:

  • LeDoit & Wolf: shrinks covariance point
    estimates toward empirically-based priors.
  • Michaud: incorporates uncertainty around covariance point estimates.

TIMELINE

Dawn of probabilistic reasoning:

  • Bernoulli, Bayes, Laplace (1700’s to 1800’s), games of chance –
    probability as odds–

Classical statistics and probability:

  • Fisher, Neyman, Pearson etc. (early 1900’s), probability as
    frequency.
  • Extensions to multi-step processes, example Feller (1900’s)
  • Kolmogorov, rigorous axioms.

Bayesian rebellion against frequentists:

  • Polya, Cox, Jeffreys, Jaynes, “Savage, Raiffa & Schlaifer,” decisions where data is limited (mid 1900’s)

The new Bayesians:

  • “Empirical Bayes”, hierarchical estimation (example: Stein-James shrinkage), distributed predictions (mid-late 1900’s)
  • Iterative techniques: Markov Chain Monte Carlo simulation (1990’s to present).

PROBABILITY: THE LOGIC OF SCIENCE

Probability axioms about A (assertions) and D (data):

  • A iff P(A)=1, (not A) iff P(A)=0
  • P(A1 or A2) = P(A1) + P(A2) – P(A1 and A2)
  • P(A1 and A2) = P(A1) * P(A2 | A1)

From which Bayes Rule logically follows:

  • P(A | D) = P(A) * P(D | A) / P(D)
  • Posterior probability = prior * likelihood / normalization
  • Normalization constant P(D) is the sum of P(D| A i) over all
    mutually exclusive Ai.

We always have a prior P(A), even if it is uninformative.

SIMPLE DISCRETE EXAMPLE

Situation:

  • Two fair dice cubes are rolled, and the sum of the their
    faces is 8. What is the probability that a 2 and a 6 are present?

Posterior = Prior * Likelihood / Normalization

P(A|D) = P(A) * P(D|A) / P(D)

  • P(A) = 1/36 + 1/36 P(D|A) = 1
  • P(D) = 2/36+2/36+1/36

P(A|D) = 2/5

MAKING A QUALITATIVE DECISION MORE QUANTITATIVE

Should I invest in Canadian farmland?
  • Higher future use of foodstock for energy, China food
    consumption
  • Global warming impact?

Can probability of a big payoff be posed?

  • Probability of being right when the market is wrong given record of my past similar cosmic predictions.
  • Confidence that in this case I am right and market is wrong.

Would scenario analysis be helpful?

SINGLE-PARAMETER CONTINUOUS DISTRIBUTION

Binomial-generated probability Θ of cumulative “wins” W versus “losses” L:
    • Use the beta distribution with α=W+1, β=L+1.
beta_distribution
  • Convenient conjugate property: posterior has same form as prior. The prior can be interpreted as additional win-loss data.
  • Beta(α1+ α2, β1+ β2) = Beta(α1, β1 )*Beta(α2, β2 )
  • Beta(1,1) is an uninformative prior.

EXAMPLE

What is the probability of value strategies outperforming growth strategies next month?
  • Data: monthly returns for the S&P 500 value and growth sub-indices as maintained by BARRA from Jan 75 through Dec 03.
  • Assumption: An uninformed prior at the beginning, no knowledge of structure such as autocorrelation.

WHAT IF MONTHS WERE EXCHANGEABLE? ARE VALUE MONTHS MORE FREQUENT?

beta_learn

WHEN PRIORS MEET LIKELIHOOD

priors
The mean of the posterior density is a weighted average of the prior and likelihood
densities.
The weights are proportional to the relative precision of the two estimates.
Note that in this single-parameter model, more data always leads to less dispersion of probability.

CONJUGATE BETA LEARNING MODEL

Simple, flexible, intuitive.

Assumes

  • Minimal knowledge of IID process
  • No scale parameter, location uncertainty is always reduced by data.

Potential applications

  • Semi-qualitative decisions. Will the Fed raise rates? Will the correlation between stock and bond returns be positive?
  • “Non-parametric” identification of potential return forecasting signals. News coding.
  • Working backward to discover implicit priors (possible because of unique solution). Are you crazy?

MULTI-PARAMETER DENSITY

(Location, Dispersion, Shape)

Factor the conditional probabilities.

  • Example: P(μ,σ | D) = P(μ | σ,D) * P(σ | D)
    • Where P(σ | D) = ∫ P(σ | μ,D) dμ

Some simple cases have been worked out in closed form.

  • Example: Normally-distributed process with unknown mean and variance.

Others require Monte Carlo simulation.

PERFORMANCE PROJECTION

By how much should a large-cap value manager beat the S&P500?

  • We want to know both location and scale.
  • Assume IID log normal process Jan 75 – Dec 03.
    • Excess kurtosis, predictable variation ignored.
  • For convenience, we will use a conjugate prior:
    • σ2 distributed as scaled inverse chi-squared
    • Mean distributed as N(μ, σ2/n ).
  • Priors for mean (0%), standard deviation (2%) and
    inverse chi-squared degrees of freedom (12) for σ2.

UPDATING THE DISPERSION

tracking error

UPDATING THE MEAN

learn_mean

TO PREDICT NEW DATA

Point estimate:

  • Dpredicted = mean of μ distribution

Full Bayesian estimate:

  • Distribution of Dpred ~ ∫∫P(Dpred| μ,σ,D)P(μ,σ | D) dμdσ
  • Here, repeat sequential draws of σ, μ|σ and N(μ,σ) until a forecast distribution is formed.

In naïve Markowitz portfolio optimization of many assets, point covariance estimate projections are inadequate.

CONJUGATE NORMAL INVERSE CHI-SQUARED LEARNING MODEL

Very widely applicable

  • If process is close to IID normal or log-normal
  • Any univariate data IID case where Central Limit Theorem kicks in.

Dispersion of probability for unknown variance leads posterior for the mean to have Student’s t fat tails.

Applications

  • Decisions where scale of dispersion is important: Ranking of
    active managers, Extension — Black-Litterman portfolio optimization
  • Where learning can be speeded up by the addition of priors to evidence: Extension — Bayesian regression
  • Signals when estimate dispersion increases with increasing data: Evaluation of outliers questioning active strategies.
  • Sequential decision-making: How many observations are needed to support a model?

HIERARCHICAL ESTIMATION

Assemble a hierarchy of estimates to better combine group and individual information.

  • Basic idea underlying Stein-James estimates and Ledoit-Wolf approach to better covariance estimation.

Our example: Estimation of future return differences among similar mutual funds.

FORM EXCHANGEABLE GROUP

Exchangeability means not that the mutual funds are the same, but rather that our knowledge of their pertinent factors is the same.

Morningstar screen:

  • Style: large cap, mixed (neither strong value nor strong growth), not international, beta between 0.8 and 1.2.
  • Stock pickers: number of holdings between 100 and 250.
  • Data availability: Morningstar ratings and 8 years of Yahoo Finance monthly return history.
  • Independence: First fund listed in fund family satisfying screen.

DATA AND MODEL

Data:

  • Monthly excess returns net of equal -weighted group average of 14 funds, 96 months ending Nov 2006.

Model:

  • Fundj sample mean mj excess log returns ~ N(Θj , σj2/96) [Central Limit Theorem]
  • Θj ~ N(µ , τ2)
  • “Empirical Bayes.” σj2 are estimated directly from the data, then treated as knowns.

ESTIMATION PROCESS

For each value of τ in a wide grid:

  • Calculate P(τ | m1,m2…,σ12, σ22,….)

Draw 10,000 samples of τ from this distribution and for each:

  • Draw grand mean µ ~ N(f(τ,m1,m2…,σ12, σ22,…),g(τ,σ12, σ22,….))
  • For each fund
    • Calculate Bayesian Θmean and then Θvariance
    • Draw possible Θj from N(Θmean j, Θvariance j)

The result is a probability distribution for each Θj which is “shrunk” toward the grand mean by amounts related to individual variances relative to estimated variance of Θ’s.

SHRINKAGE OF NAÏVE ESTIMATES

shrinkage
These funds were all middle of the road large cap stock pickers.
Is this picture more realistic than unadjusted individual fund records?
What would happen if we included more funds?
What would be a next step?

POTENTIAL HIERARCHICAL APPLICATIONS

Univariate: more sophisticated evaluation of manager skill, and one’s own skill.

Multivariate Extensions:

  • Estimating sector and individual stock characteristics as sources of alpha.
  • LeDoit and Wolf covariance shrinkage estimation.

What else?

HIERARCHICAL MODEL PERSPECTIVE

Big improvements in estimation error for grouped data.

The 96 observation example gave us an excuse to

regard the individual variances as known.

With unknown individual variances, grid-based simulation of so many parameters becomes impractical.

Then we need intelligent sampling – Markov Chain Monte Carlo.

RECOMMENDED TEXTS

Bayesian Data Analysis, 2nd Edition

  • Andrew Gelman, John B. Carlin, Hal S. Stern and Donald B Rubin

Risk and Asset Allocation

  • Attilio Meucci

Probability Theory: the Logic of Science

  • Edwin Jaynes