This R session will illustrate pairs trading or statistical arbitrage.

(Useful R links: Cookbook R, Quick-R, R documentation, CRAN, METACRAN.)

Cointegration tests

There is a very convenient package, egcm, that computes the Engle-Granger cointegration test and many other unit-root tests. In particular, given two series \(x(t)\) and \(y(t)\), it searches for parameters \(\alpha\), \(\beta\), and \(\rho\) such that

\[ y(t) = \alpha + \beta x(t) + r(t)\\ r(t) = \rho r(t-1) + \epsilon(t) \] where \(r(t)\) is the residual and \(\epsilon(t)\) the innovation. If \(|\rho|<1\), then \(x(t)\) and \(y(t)\) are cointegrated (i.e., \(r(t)\) doesn’t contain a unit root). Of course the difficulty is in assessing exactly how much smaller than 1 the value of \(|\rho|\) has to be; this is the task of the unit-root test. The package urca contains more general models for cointegration. Perhaps the most famous example of a unit-root test is the Augmented Dickey-Fuller (ADF) test, which considers as null hypothesis that a unit root is pressent and as alternative hypothesis that the series is stationary (so a small \(p\)-value means indication of strong stationarity).1

Another interesting quantity to measure mean-reversion is the half-life, defined as the time it takes for the spread to mean-revert half of its distance after having diverged from the mean of the spread.

We now use the package egcm to check the cointegration of several illustrative examples:

library(quantmod)
library(egcm)  #install.packages("egcm")
egcm.set.default.pvalue(0.01)

SPY vs IVV

SPY and IVV are oth ETF’s that track the S&P 500 and obviously they are cointegrated:

SPY_prices <- Ad(getSymbols("SPY", from = "2013-01-01", to = "2013-12-31", auto.assign = FALSE))
IVV_prices <- Ad(getSymbols("IVV", from = "2013-01-01", to = "2013-12-31", auto.assign = FALSE))
plot(cbind(SPY_prices, IVV_prices), legend.loc = "topleft", main = "ETF prices")

res <- egcm(SPY_prices, IVV_prices)
summary(res)
#> IVV.Adjusted[i] =   1.0077 SPY.Adjusted[i] -   0.5419 + R[i], R[i] =   0.0139 R[i-1] + eps[i], eps ~ N(0,  0.0480^2)
#>                    (0.0003)                   (0.0529)                (0.0624)
#> 
#> R[2013-12-30] = -0.0281 (t = -0.577)
#> 
#> Unit Root Tests of Residuals
#>                                                     Statistic    p-value
#>   Augmented Dickey Fuller (ADF)                        -6.042    0.00010
#>   Phillips-Perron (PP)                               -261.890    0.00010
#>   Pantula, Gonzales-Farias and Fuller (PGFF)            0.007    0.00010
#>   Elliott, Rothenberg and Stock DF-GLS (ERSD)          -1.259    0.45122
#>   Johansen's Trace Test (JOT)                        -103.605    0.00010
#>   Schmidt and Phillips Rho (SPR)                     -172.446    0.00010
#> 
#> Variances
#>   SD(diff(SPY.Adjusted)) =   1.004723
#>   SD(diff(IVV.Adjusted)) =   1.011531
#>   SD(diff(residuals))  =   0.068125
#>   SD(residuals)        =   0.048626
#>   SD(innovations)      =   0.047976
#> 
#> Half life       =   0.162012
#> R[last]         =  -0.028071 (t=-0.58)
plot(res)

GLD vs IAU

GLD and IAU both track the price of gold and they are cointegrated:

GLD_prices <- Ad(getSymbols("GLD", from = "2013-01-01", to = "2013-12-31", auto.assign = FALSE))
IAU_prices <- Ad(getSymbols("IAU", from = "2013-01-01", to = "2013-12-31", auto.assign = FALSE))
plot(cbind(GLD_prices, IAU_prices), legend.loc = "topright", main = "ETF prices")

plot(cbind(GLD_prices/as.numeric(GLD_prices[1]), IAU_prices/as.numeric(IAU_prices[1])), 
     legend.loc = "topright", main = "ETF normalized prices")

plot(log(cbind(GLD_prices, IAU_prices)), legend.loc = "topleft", main = "ETF log-prices")

res <- egcm(GLD_prices, IAU_prices, log = TRUE)
print(res)
#> IAU.Adjusted[i] =   0.9967 GLD.Adjusted[i] -   2.2813 + R[i], R[i] =   0.2504 R[i-1] + eps[i], eps ~ N(0,  0.0005^2)
#>                    (0.0003)                   (0.0015)                (0.0614)
#> 
#> R[2013-12-30] = -0.0004 (t = -0.722)
plot(res)

Coca-cola vs Pepsi

Coca-cola and Pepsi are often mentioned as an example of a pair of securities for which pairs trading may be fruitful. However, at least in 2013, they were not cointegrated.

KO_prices <- Ad(getSymbols("KO", from = "2013-01-01", to = "2013-12-31", auto.assign = FALSE))
PEP_prices <- Ad(getSymbols("PEP", from = "2013-01-01", to = "2013-12-31", auto.assign = FALSE))
# check cointegration of prices
plot(cbind(KO_prices/as.numeric(KO_prices[1]), PEP_prices/as.numeric(PEP_prices[1])), 
     legend.loc = "topleft", main = "ETF normalized prices")

print(egcm(KO_prices, PEP_prices))
#> PEP.Adjusted[i] =   2.1843 KO.Adjusted[i] -   3.6323 + R[i], R[i] =   0.9922 R[i-1] + eps[i], eps ~ N(0,  0.5935^2)
#>                    (0.1237)                  (4.1252)                (0.0141)
#> 
#> R[2013-12-30] = -0.5379 (t = -0.204)
#> 
#> WARNING: KO.Adjusted and PEP.Adjusted do not appear to be cointegrated.
# check cointegration of log-prices
plot(log(cbind(KO_prices, PEP_prices)), legend.loc = "topleft", main = "ETF log-prices")

res <- egcm(KO_prices, PEP_prices, log = TRUE)
print(res)
#> PEP.Adjusted[i] =   1.0941 KO.Adjusted[i] +   0.3996 + R[i], R[i] =   0.9901 R[i-1] + eps[i], eps ~ N(0,  0.0088^2)
#>                    (0.0599)                  (0.2098)                (0.0143)
#> 
#> R[2013-12-30] = -0.0088 (t = -0.228)
#> 
#> WARNING: KO.Adjusted and PEP.Adjusted do not appear to be cointegrated.
plot(res)

Ford vs GM

Ford and GM seemed to be even more tightly linked. Yet, the degree of linkage was not high enough to pass the cointegration test.

F_prices <- Ad(getSymbols("F", from = "2013-01-01", to = "2013-12-31", auto.assign = FALSE))
GM_prices <- Ad(getSymbols("GM", from = "2013-01-01", to = "2013-12-31", auto.assign = FALSE))
plot(log(cbind(F_prices, GM_prices)), legend.loc = "topleft", main = "ETF log-prices")

res <- egcm(F_prices, GM_prices, log = TRUE)
print(res)
#> GM.Adjusted[i] =   0.9117 F.Adjusted[i] +   1.0328 + R[i], R[i] =   1.0000 R[i-1] + eps[i], eps ~ N(0,  0.0118^2)
#>                   (0.0260)                 (0.0650)                (0.0155)
#> 
#> R[2013-12-30] = 0.1900 (t = 3.830)
#> 
#> WARNING: F.Adjusted and GM.Adjusted do not appear to be cointegrated.
plot(res)

LS regression for pairs trading

Let \(y_{1t}\) and \(y_{2t}\) denote the log-prices of two stocks. If they are cointegrated, then the spread \(z_{t}=y_{1t}-\gamma y_{2t}\) is stationary and can be written as2 \[z_{t}=y_{1t}-\gamma y_{2t}=\mu+\epsilon_{t}\] where \(\mu\) represents the equilibrium value and \(\epsilon_{t}\) is a zero-mean residual. Equivalently, it can be written as \[y_{1t}=\mu+\gamma y_{2t}+\epsilon_{t}\] which now has the typical form for linear regression.

The Least Squares (LS) regression over \(T\) observations is \[ \underset{\mu,\gamma}{\textsf{minimize}} \quad \sum_{t=1}^{T}\left(y_{1t}-\left(\mu+\gamma y_{2t}\right)\right)^{2}. \] More compactly, by stacking the \(T\) observations in the vectors \(\mathbf{y}_{1}\) and \(\mathbf{y}_{2}\), we can write: \[ \underset{\mu,\gamma}{\textsf{minimize}} \quad \left\Vert \mathbf{y}_{1}-\left(\mu\mathbf{1}+\gamma\mathbf{y}_{2}\right)\right\Vert ^{2}. \]

We start by loading some stock prices for our tests:

library(xts)
library(quantmod)

# set begin-end dates and stock namelist
begin_date <- "2000-07-01"
end_date <- "2013-12-31"
stock_namelist <- c("0388.HK", "0688.HK")

# download data from YahooFinance
prices <- xts()
for (stock_index in 1:length(stock_namelist))
  prices <- cbind(prices, Ad(getSymbols(stock_namelist[stock_index], 
                                        from = begin_date, to = end_date, auto.assign = FALSE)))
colnames(prices) <- stock_namelist
indexClass(prices) <- "Date"
T <- nrow(prices)  # number of days

# interpolate NAs
anyNA(prices)
#> [1] TRUE
prices <- na.approx(prices)
# some visual exploration
str(prices)
#> An 'xts' object on 2000-07-03/2013-12-31 containing:
#>   Data: num [1:3399, 1:2] 6.49 6.6 6.54 6.21 6.46 ...
#>  - attr(*, "dimnames")=List of 2
#>   ..$ : NULL
#>   ..$ : chr [1:2] "0388.HK" "0688.HK"
#>   Indexed by objects of class: [Date] TZ: UTC
#>   xts Attributes:  
#>  NULL
head(prices)
#>             0388.HK  0688.HK
#> 2000-07-03 6.485238 0.446220
#> 2000-07-04 6.596574 0.452011
#> 2000-07-05 6.540904 0.475193
#> 2000-07-06 6.206903 0.463605
#> 2000-07-07 6.457407 0.463605
#> 2000-07-10 6.735743 0.480990
tail(prices)
#>             0388.HK  0688.HK
#> 2013-12-20 113.5739 18.70984
#> 2013-12-23 113.8391 18.45640
#> 2013-12-24 115.6951 18.49871
#> 2013-12-27 115.4300 18.41417
#> 2013-12-30 114.8997 18.24527
#> 2013-12-31 114.2810 18.41417
plot(prices, legend.loc = "topleft", main = "Stock prices")

Observe that from a visual inspection of the prices, it is not clear at all that the two stocks could be cointegrated. However, once we plot the log-prices, the picture becomes much more clear:

logprices <- log(prices)
plot(logprices, legend.loc = "topleft", main = "Stock log-prices")

First of all, let’s try to estimate \(\gamma\) and \(\mu\) from a training period and then use it in the test period:

T_trn <- round(0.7*T)  # define the training set
T_tst <- T - T_trn
y1 <- logprices[, 1]
y2 <- logprices[, 2]

# do LS regression
ls_coeffs <- coef(lm(y1[1:T_trn] ~ y2[1:T_trn]))
ls_coeffs
#> (Intercept) y2[1:T_trn] 
#>    2.435953    0.864618
mu <- ls_coeffs[1]
gamma <- ls_coeffs[2]

We can plot the log-prices of one stock and its regression based on the other:

tmp <- cbind(y1, mu + gamma*y2)
colnames(tmp) <- c(colnames(y1), paste("mu + gamma x", colnames(y2)))
{ plot(tmp, legend.loc = "topleft", main = "Regression of y1 from y2")
  addEventLines(xts("training", index(y1[T_trn])), srt = 90, pos = 2, lwd = 2, col = "blue") }

More clearly, we can plot the spread to observe its mean-reversion property:

# spread
spread <- y1 - gamma*y2
{ plot(spread, main = "Spread")
  addEventLines(xts("", index(y1[T_trn])), lwd = 2, col = "blue") }

Indeed, the spread seems to be mean-reverting. However, one can observe that the mean reversion is much stronger during the training period than during the test period. This is a bit dangerous. One thing is for the spread to be mean-reverting during some period and a very different thing is whether the mean-reversion is persistent enough over subsequent periods. If it’s not persistent enough, then it is very dangerous to trade the spread. Indeed, the author of the package egcm wrote the following paper exploring precisely the persistency of cointegration in the markets:

Clegg, Matthew, On the Persistence of Cointegration in Pairs Trading (January 28, 2014). Available at SSRN: https://ssrn.com/abstract=2491201 or http://dx.doi.org/10.2139/ssrn.2491201

We can now run a cointegration test:

library(egcm)
#coint_result <- egcm(y1, y2, p.value = 0.01)  
coint_result <- egcm(y1, y2, p.value = 0.05)
print(coint_result)
#> 0688.HK[i] =   1.1333 0388.HK[i] -   2.7359 + R[i], R[i] =   0.9937 R[i-1] + eps[i], eps ~ N(0,  0.0287^2)
#>               (0.0032)              (0.0187)                (0.0022)
#> 
#> R[2013-12-31] = 0.2787 (t = 1.234)
#> 
#> WARNING: The series seem cointegrated but the residuals are not AR(1).
plot(coint_result)

Trading the spread

The spread formed as \(z_{t}=y_{1t}-\gamma y_{2t}\) is effectively equivalent to using the portfolio \(\mathbf{w} = \left[\begin{array}{c} 1\\ -\gamma \end{array}\right]\). In order to make fair comparisons, we will fix the leverage of the portfolio so that \(\| \mathbf{w}\|_1 = 1\). The normalized portfolio is then \(\mathbf{w} = \left[\begin{array}{c} 1/(1+\gamma)\\ -\gamma/(1+\gamma) \end{array}\right]\)

Let’s form a more realistic spread based on the normalized portfolio:

# normalized portfolio
w_ref <- c(1, -gamma)/(1+gamma)
sum(abs(w_ref))
#> [1] 1
w_spread <- matrix(w_ref, T, 2, byrow = TRUE)
w_spread <- xts(w_spread, index(y1))
colnames(w_spread) <- c("w1", "w2")

# resulting normalized spread
spread <- rowSums(cbind(y1, y2) * w_spread)
spread <- xts(spread, index(y1))
colnames(spread) <- "spread"
{ plot(spread, main = "Spread (from normalized portfolio)")
  addEventLines(xts("", index(y1[T_trn])), lwd = 2, col = "blue") }

The Z-score is a normalized version of the spread, which is useful to generate the trading signal. In particular, it is defined as \[Z^{\textsf{score}}_t = \frac{z_t-\textsf{E}[z_t]}{\textsf{Std}[z_t]}.\] Now let’s generate the Z-score:

spread_mean <- mean(spread[1:T_trn], na.rm = TRUE)
spread_var <- as.numeric(var(spread[1:T_trn], na.rm = TRUE))
Z_score <- (spread-spread_mean)/sqrt(spread_var)
colnames(Z_score) <- "Z-score"
threshold_long <- threshold_short <- Z_score
threshold_short[] <- .7
threshold_long[] <- -.7
{ plot(Z_score, main = "Z-score")
  lines(threshold_short, lty = 2)
  lines(threshold_long, lty = 2) 
  addEventLines(xts("", index(y1[T_trn])), lwd = 2, col = "blue") }

We are ready to generate the trading signal (0: no position, 1: long position, -1: short position):

# we define a function for convenience and future use
generate_signal <- function(Z_score, threshold_long, threshold_short) {
  signal <- Z_score
  colnames(signal) <- "signal"
  signal[] <- NA
  
  #initial position
  signal[1] <- 0
  if (Z_score[1] <= threshold_long[1]) {
    signal[1] <- 1
  } else if (Z_score[1] >= threshold_short[1])
    signal[1] <- -1

  # loop
  for (t in 2:nrow(Z_score)) {
    if (signal[t-1] == 0) {  #if we were in no position
      if (Z_score[t] <= threshold_long[t]) {
        signal[t] <- 1
      } else if(Z_score[t] >= threshold_short[t]) {
        signal[t] <- -1
      } else signal[t] <- 0
    } else if (signal[t-1] == 1) {  #if we were in a long position
      if (Z_score[t] >= 0) signal[t] <- 0
      else signal[t] <- signal[t-1]
    } else {  #if we were in a short position
      if (Z_score[t] <= 0) signal[t] <- 0
      else signal[t] <- signal[t-1]
    }
  }
  return(signal)
}

# now just invoke the function
signal <- generate_signal(Z_score, threshold_long, threshold_short)
{ plot(cbind(Z_score, signal), main = "Z-score and trading signal", legend.loc = "topleft")
  addEventLines(xts("", index(y1[T_trn])), lwd = 2, col = "blue") }

Finally, we can compute the return and P&L of the strategy. A simple way of doing this is directly from the spread and the trading signal:

# let's compute the PnL directly from the signal and spread
spread_return <- diff(spread)
traded_return <- spread_return * lag(signal)   # NOTE THE LAG!!
traded_return[is.na(traded_return)] <- 0
colnames(traded_return) <- "traded spread"
{ plot(traded_return, main = "Return of traded spread")
  addEventLines(xts("", index(y1[T_trn])), lwd = 2, col = "blue") }

We have two different ways to compute the wealth or cumulative P&L:

  • without reinvestment or not compounded: The same initial budget of, say, 1$ is reinvested every time we enter a trade (even if the wealth at some point is 10$, still only 1$ is invested). The code is 1 + cumsum(traded_return)
  • with reinvestment or compounded: All the existing wealth at each time is fully reinvested. The code is cumprod(1 + traded_return)
{ plot(1 + cumsum(traded_return), main = "Cum P&L of traded spread (no reinvestment)")
  addEventLines(xts("", index(y1[T_trn])), lwd = 2, col = "blue") }

{ plot(cumprod(1 + traded_return), main = "Cum P&L of traded spread (w/ reinvestment)")
  addEventLines(xts("", index(y1[T_trn])), lwd = 2, col = "blue") }

We can observe that the wealth increases steadily during the training period but flattens out during the test or out-of-sample period (which is the one that really counts!). So in practice, this wouldn’t have made a good return.

Recall that the same plots can be obtained with the package PerformanceAnalytics (the argument geometric determines whether it is compounded or not):

library(PerformanceAnalytics)
{ chart.CumReturns(traded_return, main = "Cum P&L of traded spread (w/ reinvestment)", 
                   geometric = TRUE, wealth.index = TRUE)
  addEventLines(xts("", index(y1[T_trn])), lwd = 2, col = "blue") }

And it is very instructive to plot the drawdown:

{ chart.Drawdown(traded_return, main = "Drawdown", lwd = 1)
  addEventLines(xts("", index(y1[T_trn])), lwd = 2, col = "blue") }

Let’s zoom in during the period 2000-2006 to observe the connection between the spread and the cumulative P&L:

par(mfrow = c(2, 1))
plot(cbind(Z_score, signal)["2000::2006"], main = "Z-score and trading signal", 
     legend.loc = "bottomleft")
plot(1 + cumsum(traded_return)["2000::2006"], main = "Cum P&L of traded spread")

A more systematic and fool-proof way to compute the return and P&L of the strategy is via the equivalent portfolio. In fact, the previous way of calculating the P&L directly from the spread can lead to wrong results if \(\gamma\) or \(\mu\) change over time. If that happens, then the spread may look very nice but it’s not realistic because once we are in a trade, we shouln’t change \(\gamma\) or \(\mu\). Let’s use now the equivalent portfolio approach.

# combine the ref portfolio with trading signal
w_portf <- w_spread * matrix(lag(signal), T, 2)   # NOTE THE LAG!!

# now compute the PnL from the log-prices and the portfolio
X <- diff(cbind(y1, y2))  #compute log-returns from log-prices
portf_return <- xts(rowSums(X * w_portf), index(X))
portf_return[is.na(portf_return)] <- 0
colnames(portf_return) <- "portfolio"
par(mfrow = c(2, 1))
{ plot(cbind(Z_score, signal), main = "Z-score and trading signal", legend.loc = "bottomleft")
  lines(threshold_short, lty = 2)
  lines(threshold_long, lty = 2) 
  addEventLines(xts("", index(y1[T_trn])), lwd = 2, col = "blue") }
{ plot(cumprod(1 + portf_return), main = "Cum P&L of traded spread")
  addEventLines(xts("", index(y1[T_trn])), lwd = 2, col = "blue") }

A second look at LS pairs trading

Let’s load data from several ETFs:

# set begin-end dates and stock namelist
begin_date <- "2000-01-01"
end_date <- "2005-12-31"  #"2018-04-30"
stock_namelist <- c("EWA", "EWK", "EWO", "EWC", "EWQ", "EWG", "EWH", "EWI", "EWJ", "EWM", "EWW", 
                    "EWN", "EWS", "EWP", "EWD", "EWL", "EWY", "EZU", "EWU", "EWZ", "EWT", "SPY")

# download data from YahooFinance
prices <- xts()
for (stock_index in 1:length(stock_namelist))
  prices <- cbind(prices, Ad(getSymbols(stock_namelist[stock_index], 
                                        from = begin_date, to = end_date, auto.assign = FALSE)))
colnames(prices) <- stock_namelist
indexClass(prices) <- "Date"
Y <- log(prices)
T <- nrow(prices)  # number of days

In order to compare conveniently different pairs, we define a function that computes the spread, trades it, and plots:

analyze_spread_LS <- function(Y, pct_training = 0.7, threshold = 0.7, legend_loc = "topleft") {
  if(anyNA(Y)) 
    Y <- na.approx(Y)
  T <- nrow(Y)
  T_trn <- round(pct_training*T)
  
  # LS regression
  gamma <- coef(lm(Y[1:T_trn, 1] ~ Y[1:T_trn, 2]))[2]

  # spread portfolio
  w_spread <- matrix(c(1, -gamma)/(1+gamma), T, 2, byrow = TRUE)
  w_spread <- xts(w_spread, index(Y))
  colnames(w_spread) <- c("w1", "w2")

  # spread
  spread <- rowSums(Y * w_spread)
  spread <- xts(spread, index(Y))
  colnames(spread) <- paste0(colnames(Y)[1], "-", colnames(Y)[2])
  
  # Z-score
  spread_mean <- mean(spread[1:T_trn])
  spread_var <- as.numeric(var(spread[1:T_trn]))
  Z_score <- (spread-spread_mean)/sqrt(spread_var)
  threshold_long <- threshold_short <- Z_score
  threshold_short[] <- threshold
  threshold_long[] <- -threshold
  
  # trading signal
  signal <- generate_signal(Z_score, threshold_long, threshold_short)

  # combine the ref portfolio with trading signal
  w_portf <- w_spread * matrix(lag(signal), T, 2)   # NOTE THE LAG!!
  
  # now compute the PnL from the log-prices and the portfolio
  X <- diff(Y)  #compute log-returns from log-prices
  portf_return <- xts(rowSums(X * w_portf), index(X))
  portf_return[is.na(portf_return)] <- 0
  
  # plots
  tmp <- cbind(Z_score, signal)
  colnames(tmp) <- c("Z-score", "signal")
  par(mfrow = c(2, 1))
  { plot(tmp, legend.loc = legend_loc,
         main = paste("Z-score and trading signal for", colnames(Y)[1], "vs", colnames(Y)[2]))
    lines(threshold_short, lty = 2)
    lines(threshold_long, lty = 2)
    print(addEventLines(xts("", index(Y[T_trn])), lwd = 2, col = "blue")) }
  { plot(cumprod(1 + portf_return), main = "Cum P&L")
    print(addEventLines(xts("", index(Y[T_trn])), lwd = 2, col = "blue")) }
  
  ## more plots
  #print(plot(cumprod(1 + portf_return[T_trn:T]), main = "Cum P&L"))
  #print(chart.Drawdown(portf_return[T_trn:T], main = "Drawdown", lwd = 1))
  par(mfrow = c(1, 1))
}

Now we are ready to explore different pairs during different periods:

# a good cointegrated case
analyze_spread_LS(Y["2000-08::2002-01", c("EWH", "EWZ")], pct_training = 0.55, legend_loc = "bottomleft")

# a case where cointegration is lost
analyze_spread_LS(Y["2000-07::2001", c("EWY", "EWT")], pct_training = 0.55)

Time-varying parameters

In practice, the parameters gamma and mu may be time-varying and the spread becomes \[z_{t} = y_{1t}-\gamma_t y_{2t} = \mu_t+\epsilon_{t}\] where \(\mu_t\) is the slowly time-varying equilibrium value and \(\gamma_t\) is the slowly time-varying cointegration factor.

In this section, we will generate synthetic data so that we can conveniently explore the tracking behavior of the different alternatives and later on we will use real market data.

The synthetic log-price data follows \[\begin{aligned} y_{1t} &= \mu_t + \gamma_t x_{t} + \epsilon_{1t}\\ y_{2t} &= x_{t} + \epsilon_{2t} \end{aligned}\] where \(\epsilon_{1t}\), \(\epsilon_{2t}\) are AR(1) residual terms and \(x_t\) is a stochastic common trend defined as a random walk: \[x_{t}=x_{t-1}+w_{t}\] with \(w_{t}\) an i.i.d. residual term.

library(xts)
library(MASS)
library(MTS)
set.seed(473)
T_syn <- 1000
# create mu and gamma
#mu_true <- xts(rep(0.5, T_syn), order.by = as.Date("2010-01-01") + 0:(T_syn-1))
mu_true <- xts(c(rep(0.6, T_syn/4), rep(0.4, T_syn/2), rep(0.6, T_syn/4)), 
               order.by = as.Date("2010-01-01") + 0:(T_syn-1))
colnames(mu_true) <- "mu-true"
#gamma_true <- xts(rep(0.4, T_syn), index(mu_true))
gamma_true <- xts(c(rep(0.2, T_syn/2), rep(0.8, T_syn/2)), index(mu_true))
colnames(gamma_true) <- "gamma-true"
# smooth parameters
mu_true[] <- filter(mu_true, rep(1, 50)/50, sides = 1)
mu_true[] <- filter(mu_true, rep(1, 50)/50, sides = 1)
mu_true <- na.locf(mu_true, fromLast = TRUE)
gamma_true[] <- filter(gamma_true, rep(1, 50)/50, sides = 1)
gamma_true[] <- filter(gamma_true, rep(1, 50)/50, sides = 1)
gamma_true <- na.locf(gamma_true, fromLast = TRUE)
plot(cbind(mu_true, gamma_true), col = c("blue", "darkgreen"),
     legend.loc = "topleft", main = "True values")

# create common trend as random walk 
#  (with ann volatility 50% or daily volatility = 50%/sqrt(252) = 0.03419)
tend_sd <- 0.2*0.03149
trend <- cumsum(MASS::mvrnorm(T_syn, 0, tend_sd^2))  # random walk with standard innovation

# create VAR(1) residual
#  (with ann volatility 40% or daily volatility = 40%/sqrt(252) = 0.0252)
res_sd <- 0.1*0.0252
residual <- MTS::VARMAsim(T_syn, arlags = 1, phi = 0.9*diag(2), sigma = res_sd^2*diag(2))$series  

# create the synthetic log-prices
Y_syn <- cbind(mu_true, 0) + cbind(gamma_true, 1)*trend + residual
colnames(Y_syn) <- c("Y1 = mu + gamma*X + W1", "Y2 = X + W2") 
plot(Y_syn, legend.loc = "left", main = "Synthetic log-price data")

We can now proceed to the estimation of \(\mu_t\) and \(\gamma_t\). We will start with the previous vanilla LS regression, then we will consider a rolling LS to capture the time variation, and finally we will explore Kalman filtering for optimal tracking.

Static LS

Let’s start by using the previous vanilla LS regression to estimate a constant \(\mu\) and \(\gamma\):

# LS regression
ls_coeffs <- coef(lm(Y_syn[1:(0.3*T_syn), 1] ~ Y_syn[1:(0.3*T_syn), 2]))
mu_LS <- mu_true
mu_LS[] <- ls_coeffs[1]
colnames(mu_LS) <- "mu-LS"
gamma_LS <- gamma_true
gamma_LS[] <- ls_coeffs[2]
colnames(gamma_LS) <- "gamma-LS"
# plots
par(mfrow = c(2, 1))
plot(cbind(mu_true, mu_LS), legend.loc = "bottomleft", main = "LS estimation of mu")
plot(cbind(gamma_true, gamma_LS), legend.loc = "left", main = "LS estimation of gamma")

As we can see, the estimation is quite bad and we need some time-varying estimate.

Rolling LS

The rolling LS is simply an LS but performed on a rolling window basis defined by the window length T_lookback: \[ \underset{\mu_{t_0},\gamma_{t_0}}{\textsf{minimize}} \quad \sum_{l=t_0-T_\textsf{lookback}+1}^{t_0}\left(y_{1t}-\left(\mu+\gamma y_{2t}\right)\right)^{2}. \]

# rolling LS
T_lookback <- 400  # lookback window length
T_shift <- 10  # how often is refreshed

# init empty variables
mu_rolling_LS <- xts(rep(NA, T_syn), index(mu_true))
colnames(mu_rolling_LS) <- "mu-rolling-LS"
gamma_rolling_LS <- xts(rep(NA, T_syn), index(mu_true))
colnames(gamma_rolling_LS) <- "gamma-rolling-LS"

# loop
t0_update <- seq(from = T_lookback, to = T_syn-T_shift, by = T_shift)
t0_update
#>  [1] 400 410 420 430 440 450 460 470 480 490 500 510 520 530 540 550 560 570 580 590 600 610 620 630 640 650 660 670 680
#> [30] 690 700 710 720 730 740 750 760 770 780 790 800 810 820 830 840 850 860 870 880 890 900 910 920 930 940 950 960 970
#> [59] 980 990
for (t0 in t0_update) {
  ls_coeffs <- coef(lm(Y_syn[(t0-T_lookback+1):t0, 1] ~ Y_syn[(t0-T_lookback+1):t0, 2], 
                       weights = 2^(1:T_lookback)))
  mu_rolling_LS[t0+1] <- ls_coeffs[1]
  gamma_rolling_LS[t0+1] <- ls_coeffs[2]
}
# complete values and smooth
plot(mu_rolling_LS, type = "o", pch = 16)

mu_rolling_LS <- na.locf(mu_rolling_LS)
plot(mu_rolling_LS)

mu_rolling_LS <- na.locf(mu_rolling_LS, fromLast = TRUE)
plot(mu_rolling_LS)

mu_rolling_LS[] <- filter(mu_rolling_LS, rep(1, T_lookback/4)/(T_lookback/4), sides = 1)
mu_rolling_LS <- na.locf(mu_rolling_LS, fromLast = TRUE)
plot(mu_rolling_LS)

gamma_rolling_LS <- na.locf(gamma_rolling_LS)
gamma_rolling_LS <- na.locf(gamma_rolling_LS, fromLast = TRUE)
gamma_rolling_LS[] <- filter(gamma_rolling_LS, rep(1, T_lookback/4)/(T_lookback/4), sides = 1)
gamma_rolling_LS <- na.locf(gamma_rolling_LS, fromLast = TRUE)
# plots
par(mfrow = c(2, 1))
plot(cbind(mu_true, mu_LS, mu_rolling_LS), legend.loc = "left", main = "Tracking of mu")
plot(cbind(gamma_true, gamma_LS, gamma_rolling_LS), legend.loc = "topleft", main = "Tracking of gamma")

The rolling LS can adapt to the changes but it requires some time to converge depending on the lookback window length. Of course if the lookback window lenght is too short then the estimation will be too noisy. So it is difficult to find a good tradeoff between adaptability and estimation error.

Kalman

The Kalman formulation of the time-varying fit is, using the notation of thepackage KFAS, is given by the following state transition equation and observation equation: \[\begin{aligned} \boldsymbol{\alpha}_{t+1} &= \mathbf{T}_t\boldsymbol{\alpha}_{t} + \mathbf{R}_t\boldsymbol{\eta}_{t}\\ y_{1t} &= \mathbf{Z}_{t}\boldsymbol{\alpha}_{t}+\epsilon_{t} \end{aligned}\] where

  • \(\boldsymbol{\alpha}_{t}\triangleq\left[\begin{array}{c}\mu_{t}\\ \gamma_{t}\end{array}\right]\) is the hidden state (with \(\boldsymbol{\alpha}_1 \sim \mathcal{N}\left(\mathbf{a}_1,\mathbf{P}_1\right)\))
  • \(\mathbf{T}_t\triangleq\left[\begin{array}{cc} 1 & 0\\ 0 & 1\end{array}\right]\) is the state transition matrix
  • \(\mathbf{R}_t\triangleq\left[\begin{array}{cc} 1 & 0\\ 0 & 1\end{array}\right]\)
  • \(\boldsymbol{\eta}_{t}\sim\mathcal{N}\left(\mathbf{0},\mathbf{Q}_t\right)\) is the i.i.d. state transition noise with \(\mathbf{Q}_t=\left[\begin{array}{cc} \sigma_{1}^{2} & 0\\ 0 & \sigma_{2}^{2} \end{array}\right]\)
  • \(\mathbf{Z}_{t}\triangleq\left[\begin{array}{cc} 1 & y_{2t}\end{array}\right]\) is the observation coefficient matrix
  • \(\epsilon_{t}\sim\mathcal{N}\left(0,h_t\right)\) is the i.i.d. observation noise (with \(h_t=h\)).
library(KFAS)  #install.packages("KFAS")

# init empty variables
gamma_Kalman_filtering <- mu_Kalman_filtering <- xts(rep(NA, T_syn), index(mu_true))
colnames(mu_Kalman_filtering) <- "mu-Kalman-filtering"
colnames(gamma_Kalman_filtering) <- "gamma-Kalman-filtering"
gamma_Kalman_smoothing <- mu_Kalman_smoothing <- xts(rep(NA, T_syn), index(mu_true))
colnames(mu_Kalman_smoothing) <- "mu-Kalman-smoothing"
colnames(gamma_Kalman_smoothing) <- "gamma-Kalman-smoothing"

# Kalman parameters
Tt <- diag(2)
Rt <- diag(2)
Qt <- diag(c(1e-3, 1e-2))  # state transition variance very small
Zt <- array(as.vector(t(cbind(1, as.matrix(Y_syn[, 2])))), dim = c(1, 2, T_syn))  # time-varying
Ht <- matrix(1.3e-7)  # observation variance

# the prior in the code: P1cov = kappa*P1Inf + P1, kappa = 1e7
a1 <- matrix(c(0.6, 0.2), 2, 1)
P1 <- 1e-12*diag(2)  # variance of initial point
P1inf <- 0*diag(2)

# create Kalman model
model <- SSModel(as.matrix(Y_syn[, 1]) ~ 
                   0 + SSMcustom(Z=Zt, T=Tt, R=Rt, Q=Qt, a1=a1, P1=P1, P1inf=P1inf), H=Ht)
print(model)
#> Call:
#> SSModel(formula = as.matrix(Y_syn[, 1]) ~ 0 + SSMcustom(Z = Zt, 
#>     T = Tt, R = Rt, Q = Qt, a1 = a1, P1 = P1, P1inf = P1inf), 
#>     H = Ht)
#> 
#> State space model object of class SSModel
#> 
#> Dimensions:
#> [1] Number of time points: 1000
#> [1] Number of time series: 1
#> [1] Number of disturbances: 2
#> [1] Number of states: 2
#> Names of the states:
#> [1]  custom1  custom2
#> Distributions of the time series:
#> [1]  gaussian
#> 
#> Object is a valid object of class SSModel.

# run Kalman filtering and smoothing and get their values
out <- KFS(model)
mu_Kalman_filtering[] <- out$a[-1, 1]  # a is Kalman filtering (alphahat is Kalman smoothing) (a(T+1)=alphahat(T))
gamma_Kalman_filtering[] <- out$a[-1, 2]
mu_Kalman_smoothing[] <- out$alphahat[, 1]  # a is Kalman filtering (alphahat is Kalman smoothing) (a(T+1)=alphahat(T))
gamma_Kalman_smoothing[] <- out$alphahat[, 2]
# let's smooth a bit the Kalman filtering
mu_Kalman_filtering[] <- filter(mu_Kalman_filtering, rep(1, 5)/5, sides = 1)
mu_Kalman_filtering <- na.locf(mu_Kalman_filtering, fromLast = TRUE)
gamma_Kalman_filtering[] <- filter(gamma_Kalman_filtering, rep(1, 5)/5, sides = 1)
gamma_Kalman_filtering <- na.locf(gamma_Kalman_filtering, fromLast = TRUE)
# plots
par(mfrow = c(2, 1))
plot(cbind(mu_true, mu_LS, mu_rolling_LS, mu_Kalman_filtering, mu_Kalman_smoothing), 
     legend.loc = "left", main = "Tracking of mu")
plot(cbind(gamma_true, gamma_LS, gamma_rolling_LS, gamma_Kalman_filtering, gamma_Kalman_smoothing), 
     legend.loc = "topleft", main = "Tracking of gamma", ylim = c(-0.2, 0.9))

We can look at the centered spreads (i.e., with mean removed) \(z_t = y_{1t} - \gamma_t y_{2t} - \mu_t\) to properly compare the different methods:

# compute spreads
compute_spread <- function(Y, gamma, mu, name = NULL) {
  w_spread <- cbind(1, -gamma)/cbind(1+gamma, 1+gamma)
  spread <- rowSums(Y * w_spread) - mu/(1+gamma)
  colnames(spread) <- name
  return(spread)
}

spread_true <- compute_spread(Y_syn, gamma_true, mu_true, "spread-true")
spread_LS <- compute_spread(Y_syn, gamma_LS, mu_LS, "spread-LS")
spread_rolling_LS <- compute_spread(Y_syn, gamma_rolling_LS, mu_rolling_LS, "spread-rolling-LS")
spread_Kalman <- compute_spread(Y_syn, gamma_Kalman_filtering, mu_Kalman_filtering, "spread-Kalman")
# plots
plot(cbind(spread_true, spread_LS, spread_rolling_LS, spread_Kalman), 
     legend.loc = "topright", main = "Spreads")

Clearly the static LS or even the rolling LS perform really poorly (of course with real data the parameters may not change as abruptly as in our synthetic data). Let’s plot now just the true spread and the obtained with Kalman:

# plots
plot(cbind(spread_true, spread_Kalman), 
     legend.loc = "topright", main = "Spreads", col = c("black", "blue"))

Final comparison of static LS, rolling LS, and Kalman

We are finally ready to compare the different methods to compute the spreads with real market data. We will use the same market data loaded above for many ETFs.

We first define the functions to estimate \(\mu\) and \(\gamma\) according to the different methods:

estimate_mu_gamma_LS <- function(Y, pct_training = 0.3) {
  T <- nrow(Y)
  T_trn <- round(pct_training*T)
  # LS regression
  ls_coeffs <- coef(lm(Y[1:T_trn, 1] ~ Y[1:T_trn, 2]))
  mu <- xts(rep(ls_coeffs[1], T), index(Y))
  colnames(mu) <- "mu-LS"
  gamma <- xts(rep(ls_coeffs[2], T), index(Y))
  colnames(gamma) <- "gamma-LS"
  return(list(mu = mu, gamma = gamma))
}
estimate_mu_gamma_rolling_LS <- function(Y, pct_training = 0.3) {
  T <- nrow(Y)
  T_start <- round(pct_training*T)
  T_lookback <- 500  # lookback window length
  T_shift <- 10  # how often is refreshed
  # init empty variables
  gamma_rolling_LS <- mu_rolling_LS <- xts(rep(NA, T), index(Y))
  colnames(mu_rolling_LS) <- "mu-rolling-LS"
  colnames(gamma_rolling_LS) <- "gamma-rolling-LS"
  # loop
  t0_update <- seq(from = min(T_start, T_lookback), to = T-T_shift, by = T_shift)
  for (t0 in t0_update) {
    T_lookback_ <- ifelse(t0-T_lookback+1 >= 1, T_lookback, T_start)
    ls_coeffs <- coef(lm(Y[(t0-T_lookback_+1):t0, 1] ~ Y[(t0-T_lookback_+1):t0, 2],
                         weights = last(1:T_lookback, T_lookback_)))
    mu_rolling_LS[t0+1] <- ls_coeffs[1]
    gamma_rolling_LS[t0+1] <- ls_coeffs[2]
  }
  # complete values
  mu_rolling_LS <- na.locf(mu_rolling_LS)
  mu_rolling_LS <- na.locf(mu_rolling_LS, fromLast = TRUE)
  gamma_rolling_LS <- na.locf(gamma_rolling_LS)
  gamma_rolling_LS <- na.locf(gamma_rolling_LS, fromLast = TRUE)
  # smoothing
  L <- 15
  mu_rolling_LS[] <- filter(mu_rolling_LS, rep(1, L)/L, sides = 1)
  mu_rolling_LS <- na.locf(mu_rolling_LS, fromLast = TRUE)
  gamma_rolling_LS[] <- filter(gamma_rolling_LS, rep(1, L)/L, sides = 1)
  gamma_rolling_LS <- na.locf(gamma_rolling_LS, fromLast = TRUE)
  return(list(mu = mu_rolling_LS, gamma = gamma_rolling_LS))
}
estimate_mu_gamma_Kalman <- function(Y) {
  T <- nrow(Y)
  # init empty variables
  gamma_Kalman_filtering <- mu_Kalman_filtering <- xts(rep(NA, T), index(Y))
  colnames(mu_Kalman_filtering) <- "mu-Kalman"
  colnames(gamma_Kalman_filtering) <- "gamma-Kalman"
  # Kalman parameters
  Tt <- diag(2)
  Rt <- diag(2)
  Qt <- 1e-5*diag(2)  # state transition variance very small
  Zt <- array(as.vector(t(cbind(1, as.matrix(Y[, 2])))), dim = c(1, 2, T))  # time-varying
  Ht <- matrix(1e-3)  # observation variance
  # the prior in the code: P1cov = kappa*P1Inf + P1, kappa = 1e7
  init <- estimate_mu_gamma_LS(Y)
  a1 <- matrix(c(init$mu[1], init$gamma[1]), 2, 1)
  P1 <- 1e-5*diag(2)  # variance of initial point
  P1inf <- 0*diag(2)
  # create Kalman model
  model <- SSModel(as.matrix(Y[, 1]) ~ 0 + SSMcustom(Z=Zt, T=Tt, R=Rt, Q=Qt, a1=a1, P1=P1, P1inf=P1inf), H=Ht)
  # run Kalman filtering
  out <- KFS(model)
  mu_Kalman_filtering[] <- out$a[-1, 1]  # a is Kalman filtering (alphahat is Kalman smoothing) (a(T+1)=alphahat(T))
  gamma_Kalman_filtering[] <- out$a[-1, 2]
  # smoothing
  L <- 30
  mu_Kalman_filtering[] <- filter(mu_Kalman_filtering, rep(1, L)/L, sides = 1)
  mu_Kalman_filtering <- na.locf(mu_Kalman_filtering, fromLast = TRUE)
  gamma_Kalman_filtering[] <- filter(gamma_Kalman_filtering, rep(1, L)/L, sides = 1)
  gamma_Kalman_filtering <- na.locf(gamma_Kalman_filtering, fromLast = TRUE)
  return(list(mu = mu_Kalman_filtering, gamma = gamma_Kalman_filtering))
}

Let’s choose one pair and period:

Y_ <- Y["2000-08::2003-12", c("EWH", "EWZ")]
if(anyNA(Y_)) 
    Y_ <- na.approx(Y_)
plot(Y_, legend.loc = "bottomleft", main = "Log-prices")

Observe that in the period May 2002 - May 2003 the cointegration seems to have changed.

Now we can estimate the parameters by just invoking the functions above:

LS <- estimate_mu_gamma_LS(Y_)
rolling_LS <- estimate_mu_gamma_rolling_LS(Y_)
Kalman <- estimate_mu_gamma_Kalman(Y_)

We can plot the tracking of the parameters:

# plots
par(mfrow = c(2, 1))
{ plot(cbind(LS$mu, rolling_LS$mu, Kalman$mu), 
       legend.loc = "left", main = "Tracking of mu")
  addEventLines(xts("", index(Y_[round(0.3*nrow(Y_))])), lwd = 2, col = "blue") }
{ plot(cbind(LS$gamma, rolling_LS$gamma, Kalman$gamma), 
       legend.loc = "left", main = "Tracking of gamma")
  addEventLines(xts("", index(Y_[round(0.3*nrow(Y_))])), lwd = 2, col = "blue") }

And we can also plot the spreads:

spread_LS <- compute_spread(Y_, LS$gamma, LS$mu, "LS")
spread_rolling_LS <- compute_spread(Y_, rolling_LS$gamma, rolling_LS$mu, "rolling-LS")
spread_Kalman <- compute_spread(Y_, Kalman$gamma, Kalman$mu, "Kalman")

# plots
plot(cbind(spread_LS, spread_rolling_LS, spread_Kalman), legend.loc = "topleft", main = "Spreads")

We can see that the spread based on LS is not adapting at the end and cointegration is lost, whereas the spread based on Kalman looks beautiful.

We are finally ready to trade these spreads. Let’s define a function for that:

library(TTR)

generate_Z_score_EMA <- function(spread, n = 120) {
  ## traditional rolling windowed mean and variance
  # first, the mean
  spread.mean <- EMA(spread, n)
  spread.mean <- na.locf(spread.mean, fromLast = TRUE)
  spread.demeaned <- spread - spread.mean
  # second, the variance
  spread.var <- EMA(spread.demeaned^2, n)
  spread.var <- na.locf(spread.var, fromLast = TRUE)
  # finally compute Z-score
  Z.score <- spread.demeaned/sqrt(spread.var)
  return(Z.score)
}
pairs_trading <- function(Y, gamma, mu, name = NULL, threshold = 0.7, plot = FALSE) {
  # spread and spread portfolio
  w_spread <- cbind(1, -gamma)/cbind(1+gamma, 1+gamma)
  spread <- rowSums(Y * w_spread) - mu/(1+gamma)

  # Z-score
  Z_score <- generate_Z_score_EMA(spread)
  threshold_long <- threshold_short <- Z_score
  threshold_short[] <- threshold
  threshold_long[] <- -threshold
    
  # trading signal
  signal <- generate_signal(Z_score, threshold_long, threshold_short)
  
  # combine the ref portfolio with trading signal
  w_portf <- w_spread * lag(cbind(signal, signal))   # NOTE THE LAG!!
  
  # # fix the portfolio (gamma and mu) during a trade
  # lag_signal <- as.numeric(lag(signal))
  # for (t in 2:nrow(w_portf)) {
  #   if (lag_signal[t] != 0 && lag_signal[t] == lag_signal[t-1])
  #     w_portf[t, ] <- w_portf[t-1, ]
  # }

  # now compute the PnL from the log-prices and the portfolio
  X <- diff(Y)  #compute log-returns from log-prices
  portf_return <- xts(rowSums(X * w_portf), index(X))
  portf_return[is.na(portf_return)] <- 0
  colnames(portf_return) <- name
  
  # plots
  if (plot) {
    tmp <- cbind(Z_score, signal)
    colnames(tmp) <- c("Z-score", "signal")
    par(mfrow = c(2, 1))
    { plot(tmp, legend.loc = "topleft",
           main = paste("Z-score and trading on spread based on", name))
      lines(threshold_short, lty = 2)
      print(lines(threshold_long, lty = 2)) }
    print(plot(cumprod(1 + portf_return), main = paste("Cum P&L for spread based on", name)))
  }

  return(portf_return)
}
return_LS <- pairs_trading(Y_["::2003-03"], LS$gamma["::2003-03"], LS$mu["::2003-03"], 
                           "LS", plot = TRUE)

return_rolling_LS <- pairs_trading(Y_["::2003-03"], rolling_LS$gamma["::2003-03"], rolling_LS$mu["::2003-03"], 
                                   "rolling-LS", plot = TRUE)

return_Kalman <- pairs_trading(Y_["::2003-03"], Kalman$gamma["::2003-03"], Kalman$mu["::2003-03"], 
                               "Kalman", plot = TRUE)

Let’s compare the P&L of the methods:

# plot
plot(cumprod(1 + cbind(return_LS, return_rolling_LS, return_Kalman)["::2003-03"]), 
     main = "Cum P&L", legend.loc = "topleft")

We can conclude with the following points:

  • financial data is clearly nonstationary so time-varying approaches are necessary
  • cointegration can be detected, but its persistence is not clear (need for exit strategies)
  • Kalman provides a nice theoretical and practical framework that clearly outperforms the traditional LS regression.

  1. Null hypothesis testing is a reductio ad absurdum argument adapted to statistics. In essence, a claim is assumed valid if its counter-claim is improbable. A result is said to be statistically significant if it allows us to reject the null hypothesis. In particular, the \(p\)-value is the probability of observing the current data under the null hypothesis (i.e., under the assumption of unit root). If the \(p\)-value is small, we can reject the null hypothesis, implying that the correct hypothesis lies in the logical complement of the null hypothesis (i.e., stationarity holds). Summary: we want small \(p\)-value when testing for cointegration.↩︎

  2. Note that one could also swap \(y_{1t}\) and \(y_{2t}\) in the regression formulation.↩︎