This R session will illustrate the practical implementation and use of factor models.

The R package covFactorModel, available in GitHub, is recommended.

(Useful R links: Cookbook R, Quick-R, R documentation, CRAN, METACRAN.)

Macroeconomic factor model with single market factor

We will start with a simple example consisting of a single known factor (i.e., the market index). The model is \[ \mathbf{x}_t = \boldsymbol{\alpha} + \boldsymbol{\beta}f_t + \boldsymbol{\epsilon}_t, \quad t=1,\ldots,T \] where the explicit factor \(f_t\) is the S&P 500 index. We will do a simple least squares (LS) regression to estimate the intercept \(\boldsymbol{\alpha}\) and the loading beta \(\boldsymbol{\beta}\): \[ \underset{\boldsymbol{\alpha},\boldsymbol{\beta}}{\text{minimize}}\quad \sum_{t=1}^T\|\mathbf{x}_t - \boldsymbol{\alpha} - \boldsymbol{\beta}f_t\|^2 \]

Most the code lines go into preparing the data rather than actually performing the factor modeling. Let’s start getting the data ready:

library(xts)
library(quantmod)

# set begin-end date and stock namelist
begin_date <- "2016-01-01"
end_date <- "2017-12-31"
stock_namelist <- c("AAPL", "AMD", "ADI",  "ABBV", "AEZS", "A",  "APD", "AA","CF")
sector_namelist <- c(rep("Information Technology", 3), rep("Health Care", 3), rep("Materials", 3))

# download data from YahooFinance
data_set <- xts()
for (stock_index in 1:length(stock_namelist))
  data_set <- cbind(data_set, Ad(getSymbols(stock_namelist[stock_index], 
                                            from = begin_date, to = end_date, auto.assign = FALSE)))
colnames(data_set) <- stock_namelist
indexClass(data_set) <- "Date"
str(data_set)
#> An 'xts' object on 2016-01-04/2017-12-29 containing:
#>   Data: num [1:503, 1:9] 98.7 96.3 94.4 90.4 90.9 ...
#>  - attr(*, "dimnames")=List of 2
#>   ..$ : NULL
#>   ..$ : chr [1:9] "AAPL" "AMD" "ADI" "ABBV" ...
#>   Indexed by objects of class: [Date] TZ: UTC
#>   xts Attributes:  
#>  NULL
head(data_set)
#>                AAPL  AMD      ADI     ABBV AEZS        A       APD       AA       CF
#> 2016-01-04 98.74225 2.77 49.99239 49.46063 4.40 39.35598 107.89010 23.00764 35.13227
#> 2016-01-05 96.26781 2.75 49.62508 49.25457 4.21 39.22057 105.96097 21.96506 34.03059
#> 2016-01-06 94.38389 2.51 47.51298 49.26315 3.64 39.39467 103.38042 20.40121 31.08988
#> 2016-01-07 90.40047 2.28 46.30082 49.11721 3.29 37.72138  99.91463 19.59558 29.61520
#> 2016-01-08 90.87848 2.14 45.89677 47.77789 3.29 37.32482  99.39687 19.12169 29.33761
#> 2016-01-11 92.35001 2.34 46.98954 46.25827 3.13 36.69613  99.78938 18.95583 28.14919
tail(data_set)
#>                AAPL   AMD      ADI     ABBV AEZS        A      APD    AA       CF
#> 2017-12-21 170.3791 10.89 85.53941 90.49536 2.70 66.46960 156.0460 48.99 39.31251
#> 2017-12-22 170.3791 10.54 85.73240 90.77264 2.45 66.30225 156.2947 49.99 39.75090
#> 2017-12-26 166.0566 10.46 85.52011 90.34747 2.33 66.20380 155.8930 50.38 40.83736
#> 2017-12-27 166.0858 10.53 85.97363 90.66173 2.36 66.25303 156.5337 51.84 41.04702
#> 2017-12-28 166.5531 10.55 86.24380 90.38444 2.41 66.40069 157.4423 54.14 40.60863
#> 2017-12-29 164.7521 10.28 85.90609 89.38623 2.36 66.07411 157.8368 53.87 40.54192

SP500_index <- Ad(getSymbols("^GSPC", from = begin_date, to = end_date, auto.assign = FALSE))
colnames(SP500_index) <- "index"
head(SP500_index)
#>              index
#> 2016-01-04 2012.66
#> 2016-01-05 2016.71
#> 2016-01-06 1990.26
#> 2016-01-07 1943.09
#> 2016-01-08 1922.03
#> 2016-01-11 1923.67
plot(SP500_index)

Now we are ready to do the factor model fitting. The solution to the previous LS fitting is \[ \begin{aligned} \hat{\boldsymbol{\beta}} & = {\sf cov}(\mathbf{x}_t,f_t)/{\sf var}(f_t)\\ \hat{\boldsymbol{\alpha}} & = \bar{x} - \hat{\boldsymbol{\beta}}\bar{f}\\ \hat{\boldsymbol{\epsilon}}_i & = \mathbf{x}_i - \alpha_i\mathbf{1} - \beta_i\mathbf{f}, \quad i=1,\ldots,N\\ \hat{\sigma}_i^2 & = \frac{1}{T-2}\hat{\boldsymbol{\epsilon}}_i^T\hat{\boldsymbol{\epsilon}}_i, \quad \hat{\boldsymbol{\Psi}} = \sf{diag}(\hat{\sigma}_1^2, \ldots, \hat{\sigma}_N^2)\\ \hat{\boldsymbol{\Sigma}} & = {\sf var}(f_t)\hat{\boldsymbol{\beta}}\hat{\boldsymbol{\beta}}^T + \hat{\boldsymbol{\Psi}} \end{aligned} \]

which can be readily implemented in R as follows:

Alternatively, we can do the fitting using a more compact matrix notation (recall time is along the vertical axis in \(\mathbf{X}\)) \[ \mathbf{X}^T = \boldsymbol{\alpha}\mathbf{1}^T + \boldsymbol{\beta}\mathbf{f}^T + \mathbf{E}^T \] and then write the LS fitting as \[ \underset{\boldsymbol{\alpha},\boldsymbol{\beta}}{\text{minimize}}\quad \|\mathbf{X}^T - \boldsymbol{\alpha}\mathbf{1}^T - \boldsymbol{\beta}\mathbf{f}^T\|^2_F. \] More conveniently, we can define \(\boldsymbol{\Gamma} = [ \boldsymbol{\alpha}, \boldsymbol{\beta} ]\) and the extended factors \(\tilde{\mathbf{F}} = [\mathbf{1}, \mathbf{f}]\). The LS formulation can then be written as \[ \underset{\boldsymbol{\Gamma}}{\text{minimize}}\quad \|\mathbf{X}^T - \boldsymbol{\Gamma}\tilde{\mathbf{F}}^T\|^2_F \] with solution \[ \boldsymbol{\Gamma} = \mathbf{X}^T \tilde{\mathbf{F}} (\tilde{\mathbf{F}}^T \tilde{\mathbf{F}})^{-1} \]

As expected, we get the same result regardless of the notation used. Alternatively, we can simply use the R package covFactorModel to do the work for us:

Visualizing covariance matrices

It is interesting to visualize the estimated covariance matrix of the log-returns \(\boldsymbol{\Sigma} = var(f_t)\boldsymbol{\beta}\boldsymbol{\beta}^T + {\sf Diag}(\boldsymbol{\Psi})\), as well as that of the residuals \(\boldsymbol{\Psi}\).

Let’s start with the covariance matrix of the log-returns:

We can observe that all the stocks are highly correlated, which is the effect of the market factor. In order to inspect finer details of the interdependency of the stocks, we should first remove the market or, equivalently, plot \(\boldsymbol{\Psi}\):

Interestingly, we can observe that the automatic clustering performed on \(\boldsymbol{\Psi}\) correctly identifies the sectors of the stocks.

Assessing investment funds

In this example, we will assess the performance of several investment funds based on factor models. We will use the S&P 500 as the explicit market factor and assume a zero risk-free return \(r_f=0\). In particular, we will consider the following six Exchange-Traded Funds (ETFs):

  • SPY - SPDR S&P 500 ETF (index tracking)
  • XIVH - Velocity VIX Short Vol Hedged (high alpha)
  • SPHB - PowerShares S&P 500 High Beta Portfolio (high beta)
  • SPLV - PowerShares S&P 500 Low Volatility Portfolio (low beta)
  • USMV - iShares Edge MSCI Min Vol USA ETF
  • JKD - iShares Morningstar Large-Cap ETF

We start by loading the data:

Now we can compute the alpha and beta of all the ETFs:

Some observations are now in order:

  • SPY is an ETF that follows the S&P 500 and, as expected, it has an alpha almost zero and a beta almost one: \(\alpha=7.142211\times10^{-5}\) and \(\beta=1.0071423\).
  • XIVH is an ETF with a high alpha and indeed the alpha computed is the highest among the ETFs (1-2 orders of magnitude higher): \(\alpha=1.810392\times10^{-3}\).
  • SPHB is an ETF supposedly with a high beta and the computed beta is amongh the hightest but not the highest: \(\beta=1.5613531\). Interestingly, the computed alpha is negative! So this ETF seems to be dangerous and caution should be taken.
  • SPLV is an ETF that aims at having a low volatility and indeed the computed beta is on the low side: \(\beta=0.6777072\).
  • USMV is also an ETF that aims at having a low volatility and indeed the computed beta is the lowest: \(\beta=0.6511671\).
  • JKD seems not to be extreme and shows a good tradeof.

To get a feeling we can use some visualization:

We can also compare the different ETFs in a more systematic way using, for example, the Sharpe ratio. Recalling the factor model for one asset and one factor \[ x_t = \alpha + \beta f_t + \epsilon_t, \quad t=1,\ldots,T, \] we obtain \[ \begin{aligned} E[x_t] & = \alpha + \beta E[f_t]\\ {\sf var}[x_t] & = \beta^2 {\sf var}[f_t] + \sigma^2 \end{aligned} \] and the Sharpe ratio follows: \[ SR = \frac{E[x_t]}{\sqrt{{\sf var}[x_t]}} = \frac{\alpha + \beta E[f_t]}{\sqrt{\beta^2 {\sf var}[f_t] + \sigma^2}} = \frac{\alpha/\beta + E[f_t]}{\sqrt{{\sf var}[f_t] + \sigma^2/\beta^2}} \approx \frac{\alpha/\beta + E[f_t]}{\sqrt{{\sf var}[f_t]}} \] where we have assumed \(\sigma^2 \ll {\sf var}[f_t]\). Thus, a way to rank the different assets based on the Sharpe ratio is to rank them based on the ratio \(\alpha / \beta\):

The following observations are in order:

  • In terms of \(\alpha/\beta\), XIVH is the best (with the largest alpha) and SPHB the worst (with a negative alpha!).
  • In terms of exact Sharpe ratio (more exactly, Information ratio since we are ignoring the risk-free rate), JDK is the best, followed by SPY, which is surprising since it is just the market. This confirms the wisdom that the majority of the investment funds do not outperform the market.
  • SPHB is clearly the worst according to any measure: negative alpha, negative alpha-beta ratio, and exact Sharpe ratio.
  • JDK manages to have the best performance because its alpha is good (although not the best) while at the same time has a moderate beta of 0.88, so its exposure to the market if not extreme.
  • XIVH and SPHB have extreme exposures to the market with large betas.
  • USMV has the smallest exposure to the market with an acceptable alpha, and it’s Sharpe ratio is close to the top second and third.

Fundamental factor models: Fama-French

This example will illustrate the popular Fama-French 3-factor model using nine stocks from the S&P 500.

Let’s start by loading the data:

library(xts)
library(quantmod)

# set begin-end date and stock namelist
begin_date <- "2013-01-01"
end_date <- "2017-08-31"
stock_namelist <- c("AAPL", "AMD", "ADI",  "ABBV", "AEZS", "A",  "APD", "AA","CF")

# download data from YahooFinance
data_set <- xts()
for (stock_index in 1:length(stock_namelist))
  data_set <- cbind(data_set, Ad(getSymbols(stock_namelist[stock_index], 
                                            from = begin_date, to = end_date, auto.assign = FALSE)))
colnames(data_set) <- stock_namelist
indexClass(data_set) <- "Date"

# download Fama-French factors from website
#url <- "http://mba.tuck.dartmouth.edu/pages/faculty/ken.french/ftp/F-F_Research_Data_Factors_daily_CSV.zip"
#temp <- tempfile()
#download.file(url, temp, method = "libcurl", mode = "wb")
#unzip(temp, "F-F_Research_Data_Factors_daily.CSV")
#unlink(temp)
mydata <- read.csv("F-F_Research_Data_Factors_daily.CSV", skip = 4)
mydata <- mydata[-nrow(mydata), ]  # remove last row
fama_lib <- xts(x = mydata[, c(2,3,4)], order.by = as.Date(paste(mydata[, 1]), "%Y%m%d"))
str(fama_lib)
#> An 'xts' object on 1926-07-01/2017-11-30 containing:
#>   Data: num [1:24120, 1:3] 0.1 0.45 0.17 0.09 0.21 -0.71 0.62 0.04 0.48 0.04 ...
#>  - attr(*, "dimnames")=List of 2
#>   ..$ : NULL
#>   ..$ : chr [1:3] "Mkt.RF" "SMB" "HML"
#>   Indexed by objects of class: [Date] TZ: UTC
#>   xts Attributes:  
#>  NULL
head(fama_lib)
#>            Mkt.RF   SMB   HML
#> 1926-07-01   0.10 -0.24 -0.28
#> 1926-07-02   0.45 -0.32 -0.08
#> 1926-07-06   0.17  0.27 -0.35
#> 1926-07-07   0.09 -0.59  0.03
#> 1926-07-08   0.21 -0.36  0.15
#> 1926-07-09  -0.71  0.44  0.56
tail(fama_lib)
#>            Mkt.RF   SMB   HML
#> 2017-11-22  -0.05  0.10 -0.04
#> 2017-11-24   0.21  0.02 -0.44
#> 2017-11-27  -0.06 -0.36  0.03
#> 2017-11-28   1.06  0.38  0.84
#> 2017-11-29   0.02  0.04  1.45
#> 2017-11-30   0.82 -0.56 -0.50

# compute the log-returns of the stocks and the Fama-French factors
X <- diff(log(data_set), na.pad = FALSE)
N <- ncol(X)  # number of stocks
T <- nrow(X)  # number of days
F <- fama_lib[index(X)]/100

Now we have the three explicit factors in matrix \(\mathbf{F}\) and want to fit the model \[ \mathbf{X}^T = \boldsymbol{\alpha}\mathbf{1}^T + \mathbf{B}\mathbf{F}^T + \mathbf{E}^T \] where now the loadings are a matrix of betas: \(\mathbf{B}=[\boldsymbol{\beta}_1, \ldots, \boldsymbol{\beta}_N]^T=[\mathbf{b}_1, \mathbf{b}_2, \mathbf{b}_3]\). As before, we can do a LS fitting as \[ \underset{\boldsymbol{\alpha},\boldsymbol{\beta}}{\text{minimize}}\quad \|\mathbf{X}^T - \boldsymbol{\alpha}\mathbf{1}^T - \mathbf{B}\mathbf{F}^T\|^2_F. \] More conveniently, we define \(\boldsymbol{\Gamma} = [ \boldsymbol{\alpha}, \mathbf{B} ]\) and the extended factors \(\tilde{\mathbf{F}} = [\mathbf{1}, \mathbf{F}]\). The LS formulation can then be written as \[ \underset{\boldsymbol{\alpha},\boldsymbol{\beta}}{\text{minimize}}\quad \|\mathbf{X}^T - \boldsymbol{\Gamma}\tilde{\mathbf{F}}^T\|^2_F \] with solution \[ \boldsymbol{\Gamma} = \mathbf{X}^T \tilde{\mathbf{F}} (\tilde{\mathbf{F}}^T \tilde{\mathbf{F}})^{-1} \]

Alternatively, we can simply use the R package covFactorModel to do the work for us:

Statistical factor models

Let’s now consider statistical factor models or implicit factor models where both the factors and the loadings are not available. Recall the principal factor method for the model \(\mathbf{X}^T = \boldsymbol{\alpha}\mathbf{1}^T + \mathbf{B}\mathbf{F}^T + \mathbf{E}^T\) with \(K\) factors:

  1. PCA:
    • sample mean: \(\hat{\boldsymbol{\alpha}} = \bar{\mathbf{x}} = \frac{1}{T}\mathbf{X}^T\mathbf{1}_T\)
    • demeaned matrix: \(\bar{\mathbf{X}} = \mathbf{X} - \mathbf{1}_T\bar{\mathbf{x}}^T\)
    • sample covariance matrix: \(\hat{\boldsymbol{\Sigma}} = \frac{1}{T-1}\bar{\mathbf{X}}^T\bar{\mathbf{X}}\)
    • eigen-decomposition: \(\hat{\boldsymbol{\Sigma}} = \hat{\boldsymbol{\Gamma}} \hat{\boldsymbol{\Lambda}} \hat{\boldsymbol{\Gamma}}^T\)
  2. Estimates:
    • \(\hat{\mathbf{B}} = \hat{\boldsymbol{\Gamma}_1} \hat{\boldsymbol{\Lambda}}_1^{1/2}\)
    • \(\hat{\boldsymbol{\Psi}} = {\sf Diag}\left(\hat{\boldsymbol{\Sigma}} - \hat{\mathbf{B}} \hat{\mathbf{B}}^T\right)\)
    • \(\hat{\boldsymbol{\Sigma}} = \hat{\mathbf{B}} \hat{\mathbf{B}}^T + \hat{\boldsymbol{\Psi}}\)
  3. Update the eigen-decomposition: \(\hat{\boldsymbol{\Sigma}} - \hat{\boldsymbol{\Psi}} = \hat{\boldsymbol{\Gamma}} \hat{\boldsymbol{\Lambda}} \hat{\boldsymbol{\Gamma}}^T\)
  4. Repeat Steps 2-3 until convergence.

Again, we can simply use the R package covFactorModel to do the work for us:

Final comparison of covariance matrix estimations via different factor models

We will finally compare the following different factor models:

  • sample covariance matrix
  • macroeconomic 1-factor model
  • fundamental 3-factor Fama-French model
  • statistical factor model

We will estimate the models during a training phase and then we will compare how well the estimated covariance matrices do compared to the sample covariance matrix of the test phase. The estimation error will be evaluated in terms of the Frobenius norm \(\|\hat{\boldsymbol{\Sigma}} - \boldsymbol{\Sigma}_{\sf true}\|_F\) as well as the PRIAL (PeRcentage Improvement in Average Loss): \[ {\sf PRIAL} = 100\times \frac{\|\boldsymbol{\Sigma}_{\sf scm} - \boldsymbol{\Sigma}_{\sf true}\|_F^2 - \|\hat{\boldsymbol{\Sigma}} - \boldsymbol{\Sigma}_{\sf true}\|_F^2}{\|\boldsymbol{\Sigma}_{\sf scm} - \boldsymbol{\Sigma}_{\sf true}\|_F^2} \] which goes to 0 when the estimation \(\hat{\boldsymbol{\Sigma}}\) tends to the sample covariance matrix \(\boldsymbol{\Sigma}_{\sf scm}\) and goes to 100 when the estimation \(\hat{\boldsymbol{\Sigma}}\) tends to the true covariance matrix \(\boldsymbol{\Sigma}_{\sf true}\).

Let’s load the training and test sets:

Now let’s estimate the different factor models with the training data:

# sample covariance matrix
Sigma_SCM <- cov(X_trn)

# 1-factor model
F_ <- cbind(ones = 1, f_SP500_trn)
Gamma <- t(solve(t(F_) %*% F_, t(F_) %*% X_trn))
colnames(Gamma) <- c("alpha", "beta")
alpha <- Gamma[, 1]
beta <- Gamma[, 2]
E <- xts(t(t(X_trn) - Gamma %*% t(F_)), index(X_trn))
Psi <- (1/(T_trn-2)) * t(E) %*% E
Sigma_SP500 <- as.numeric(var(f_SP500_trn)) * beta %o% beta + diag(diag(Psi))

# Fama-French 3-factor model
F_ <- cbind(ones = 1, F_FamaFrench_trn)
Gamma <- t(solve(t(F_) %*% F_, t(F_) %*% X_trn))
colnames(Gamma) <- c("alpha", "beta1", "beta2", "beta3")
alpha <- Gamma[, 1]
B <- Gamma[, 2:4]
E <- xts(t(t(X_trn) - Gamma %*% t(F_)), index(X_trn))
Psi <- (1/(T_trn-4)) * t(E) %*% E
Sigma_FamaFrench <- B %*% cov(F_FamaFrench_trn) %*% t(B) + diag(diag(Psi))

# Statistical 1-factor model
K <- 1
alpha <- colMeans(X_trn)
X_trn_ <- X_trn - matrix(alpha, T_trn, N, byrow = TRUE)
Sigma_prev <- matrix(0, N, N)
Sigma <- (1/(T_trn-1)) * t(X_trn_) %*% X_trn_
eigSigma <- eigen(Sigma)
while (norm(Sigma - Sigma_prev, "F")/norm(Sigma, "F") > 1e-3) {
  B <- eigSigma$vectors[, 1:K, drop = FALSE] %*% diag(sqrt(eigSigma$values[1:K]), K, K)
  Psi <- diag(diag(Sigma - B %*% t(B)))
  Sigma_prev <- Sigma
  Sigma <- B %*% t(B) + Psi
  eigSigma <- eigen(Sigma - Psi)
}
Sigma_PCA1 <- Sigma

# Statistical 3-factor model
K <- 3
alpha <- colMeans(X_trn)
X_trn_ <- X_trn - matrix(alpha, T_trn, N, byrow = TRUE)
Sigma_prev <- matrix(0, N, N)
Sigma <- (1/(T_trn-1)) * t(X_trn_) %*% X_trn_
eigSigma <- eigen(Sigma)
while (norm(Sigma - Sigma_prev, "F")/norm(Sigma, "F") > 1e-3) {
  B <- eigSigma$vectors[, 1:K] %*% diag(sqrt(eigSigma$values[1:K]), K, K)
  Psi <- diag(diag(Sigma - B %*% t(B)))
  Sigma_prev <- Sigma
  Sigma <- B %*% t(B) + Psi
  eigSigma <- eigen(Sigma - Psi)
}
Sigma_PCA3 <- Sigma

# Statistical 5-factor model
K <- 5
alpha <- colMeans(X_trn)
X_trn_ <- X_trn - matrix(alpha, T_trn, N, byrow = TRUE)
Sigma_prev <- matrix(0, N, N)
Sigma <- (1/(T_trn-1)) * t(X_trn_) %*% X_trn_
eigSigma <- eigen(Sigma)
while (norm(Sigma - Sigma_prev, "F")/norm(Sigma, "F") > 1e-3) {
  B <- eigSigma$vectors[, 1:K] %*% diag(sqrt(eigSigma$values[1:K]), K, K)
  Psi <- diag(diag(Sigma - B %*% t(B)))
  Sigma_prev <- Sigma
  Sigma <- B %*% t(B) + Psi
  eigSigma <- eigen(Sigma - Psi)
}
Sigma_PCA5 <- Sigma

Finally, let’s compare the different estimations in the test data:

đŸ‘‰ The final conclusion is that using factor models for covariance matrix estimation may help. In addition, it is not clear that using explicit factors helps as compared to the statistical factor modeling.