This R session will illustrate pairs trading or statistical arbitrage.
(Useful R links: Cookbook R, Quick-R, R documentation, CRAN, METACRAN.)
There is a very convenient package, egcm, that computes the Engle-Granger cointegration test and many other unit-root tests. In particular, given two series \(x(t)\) and \(y(t)\), it searches for parameters \(\alpha\), \(\beta\), and \(\rho\) such that
\[ y(t) = \alpha + \beta x(t) + r(t)\\ r(t) = \rho r(t-1) + \epsilon(t) \] where \(r(t)\) is the residual and \(\epsilon(t)\) the innovation. If \(|\rho|<1\), then \(x(t)\) and \(y(t)\) are cointegrated (i.e., \(r(t)\) doesn’t contain a unit root). Of course the difficulty is in assessing exactly how much smaller than 1 the value of \(|\rho|\) has to be; this is the task of the unit-root test. The package urca contains more general models for cointegration. Perhaps the most famous example of a unit-root test is the Augmented Dickey-Fuller (ADF) test, which considers as null hypothesis that a unit root is pressent and as alternative hypothesis that the series is stationary (so a small \(p\)-value means indication of strong stationarity).1
Another interesting quantity to measure mean-reversion is the half-life, defined as the time it takes for the spread to mean-revert half of its distance after having diverged from the mean of the spread.
We now use the package egcm to check the cointegration of several illustrative examples:
SPY and IVV are oth ETF’s that track the S&P 500 and obviously they are cointegrated:
SPY_prices <- Ad(getSymbols("SPY", from = "2013-01-01", to = "2013-12-31", auto.assign = FALSE))
IVV_prices <- Ad(getSymbols("IVV", from = "2013-01-01", to = "2013-12-31", auto.assign = FALSE))
plot(cbind(SPY_prices, IVV_prices), legend.loc = "topleft", main = "ETF prices")
res <- egcm(SPY_prices, IVV_prices)
summary(res)
#> IVV.Adjusted[i] = 1.0077 SPY.Adjusted[i] - 0.5419 + R[i], R[i] = 0.0139 R[i-1] + eps[i], eps ~ N(0, 0.0480^2)
#> (0.0003) (0.0529) (0.0624)
#>
#> R[2013-12-30] = -0.0281 (t = -0.577)
#>
#> Unit Root Tests of Residuals
#> Statistic p-value
#> Augmented Dickey Fuller (ADF) -6.042 0.00010
#> Phillips-Perron (PP) -261.890 0.00010
#> Pantula, Gonzales-Farias and Fuller (PGFF) 0.007 0.00010
#> Elliott, Rothenberg and Stock DF-GLS (ERSD) -1.259 0.45122
#> Johansen's Trace Test (JOT) -103.605 0.00010
#> Schmidt and Phillips Rho (SPR) -172.446 0.00010
#>
#> Variances
#> SD(diff(SPY.Adjusted)) = 1.004723
#> SD(diff(IVV.Adjusted)) = 1.011531
#> SD(diff(residuals)) = 0.068125
#> SD(residuals) = 0.048626
#> SD(innovations) = 0.047976
#>
#> Half life = 0.162012
#> R[last] = -0.028071 (t=-0.58)
plot(res)
GLD and IAU both track the price of gold and they are cointegrated:
GLD_prices <- Ad(getSymbols("GLD", from = "2013-01-01", to = "2013-12-31", auto.assign = FALSE))
IAU_prices <- Ad(getSymbols("IAU", from = "2013-01-01", to = "2013-12-31", auto.assign = FALSE))
plot(cbind(GLD_prices, IAU_prices), legend.loc = "topright", main = "ETF prices")
plot(cbind(GLD_prices/as.numeric(GLD_prices[1]), IAU_prices/as.numeric(IAU_prices[1])),
legend.loc = "topright", main = "ETF normalized prices")
res <- egcm(GLD_prices, IAU_prices, log = TRUE)
print(res)
#> IAU.Adjusted[i] = 0.9967 GLD.Adjusted[i] - 2.2813 + R[i], R[i] = 0.2504 R[i-1] + eps[i], eps ~ N(0, 0.0005^2)
#> (0.0003) (0.0015) (0.0614)
#>
#> R[2013-12-30] = -0.0004 (t = -0.722)
plot(res)
Coca-cola and Pepsi are often mentioned as an example of a pair of securities for which pairs trading may be fruitful. However, at least in 2013, they were not cointegrated.
KO_prices <- Ad(getSymbols("KO", from = "2013-01-01", to = "2013-12-31", auto.assign = FALSE))
PEP_prices <- Ad(getSymbols("PEP", from = "2013-01-01", to = "2013-12-31", auto.assign = FALSE))
# check cointegration of prices
plot(cbind(KO_prices/as.numeric(KO_prices[1]), PEP_prices/as.numeric(PEP_prices[1])),
legend.loc = "topleft", main = "ETF normalized prices")
print(egcm(KO_prices, PEP_prices))
#> PEP.Adjusted[i] = 2.1843 KO.Adjusted[i] - 3.6323 + R[i], R[i] = 0.9922 R[i-1] + eps[i], eps ~ N(0, 0.5935^2)
#> (0.1237) (4.1252) (0.0141)
#>
#> R[2013-12-30] = -0.5379 (t = -0.204)
#>
#> WARNING: KO.Adjusted and PEP.Adjusted do not appear to be cointegrated.
# check cointegration of log-prices
plot(log(cbind(KO_prices, PEP_prices)), legend.loc = "topleft", main = "ETF log-prices")
res <- egcm(KO_prices, PEP_prices, log = TRUE)
print(res)
#> PEP.Adjusted[i] = 1.0941 KO.Adjusted[i] + 0.3996 + R[i], R[i] = 0.9901 R[i-1] + eps[i], eps ~ N(0, 0.0088^2)
#> (0.0599) (0.2098) (0.0143)
#>
#> R[2013-12-30] = -0.0088 (t = -0.228)
#>
#> WARNING: KO.Adjusted and PEP.Adjusted do not appear to be cointegrated.
plot(res)
Ford and GM seemed to be even more tightly linked. Yet, the degree of linkage was not high enough to pass the cointegration test.
F_prices <- Ad(getSymbols("F", from = "2013-01-01", to = "2013-12-31", auto.assign = FALSE))
GM_prices <- Ad(getSymbols("GM", from = "2013-01-01", to = "2013-12-31", auto.assign = FALSE))
plot(log(cbind(F_prices, GM_prices)), legend.loc = "topleft", main = "ETF log-prices")
res <- egcm(F_prices, GM_prices, log = TRUE)
print(res)
#> GM.Adjusted[i] = 0.9117 F.Adjusted[i] + 1.0328 + R[i], R[i] = 1.0000 R[i-1] + eps[i], eps ~ N(0, 0.0118^2)
#> (0.0260) (0.0650) (0.0155)
#>
#> R[2013-12-30] = 0.1900 (t = 3.830)
#>
#> WARNING: F.Adjusted and GM.Adjusted do not appear to be cointegrated.
plot(res)
Let \(y_{1t}\) and \(y_{2t}\) denote the log-prices of two stocks. If they are cointegrated, then the spread \(z_{t}=y_{1t}-\gamma y_{2t}\) is stationary and can be written as2 \[z_{t}=y_{1t}-\gamma y_{2t}=\mu+\epsilon_{t}\] where \(\mu\) represents the equilibrium value and \(\epsilon_{t}\) is a zero-mean residual. Equivalently, it can be written as \[y_{1t}=\mu+\gamma y_{2t}+\epsilon_{t}\] which now has the typical form for linear regression.
The Least Squares (LS) regression over \(T\) observations is \[ \underset{\mu,\gamma}{\textsf{minimize}} \quad \sum_{t=1}^{T}\left(y_{1t}-\left(\mu+\gamma y_{2t}\right)\right)^{2}. \] More compactly, by stacking the \(T\) observations in the vectors \(\mathbf{y}_{1}\) and \(\mathbf{y}_{2}\), we can write: \[ \underset{\mu,\gamma}{\textsf{minimize}} \quad \left\Vert \mathbf{y}_{1}-\left(\mu\mathbf{1}+\gamma\mathbf{y}_{2}\right)\right\Vert ^{2}. \]
We start by loading some stock prices for our tests:
library(xts)
library(quantmod)
# set begin-end dates and stock namelist
begin_date <- "2000-07-01"
end_date <- "2013-12-31"
stock_namelist <- c("0388.HK", "0688.HK")
# download data from YahooFinance
prices <- xts()
for (stock_index in 1:length(stock_namelist))
prices <- cbind(prices, Ad(getSymbols(stock_namelist[stock_index],
from = begin_date, to = end_date, auto.assign = FALSE)))
colnames(prices) <- stock_namelist
indexClass(prices) <- "Date"
T <- nrow(prices) # number of days
# interpolate NAs
anyNA(prices)
#> [1] TRUE
prices <- na.approx(prices)
# some visual exploration
str(prices)
#> An 'xts' object on 2000-07-03/2013-12-31 containing:
#> Data: num [1:3399, 1:2] 6.49 6.6 6.54 6.21 6.46 ...
#> - attr(*, "dimnames")=List of 2
#> ..$ : NULL
#> ..$ : chr [1:2] "0388.HK" "0688.HK"
#> Indexed by objects of class: [Date] TZ: UTC
#> xts Attributes:
#> NULL
head(prices)
#> 0388.HK 0688.HK
#> 2000-07-03 6.485238 0.446220
#> 2000-07-04 6.596574 0.452011
#> 2000-07-05 6.540904 0.475193
#> 2000-07-06 6.206903 0.463605
#> 2000-07-07 6.457407 0.463605
#> 2000-07-10 6.735743 0.480990
tail(prices)
#> 0388.HK 0688.HK
#> 2013-12-20 113.5739 18.70984
#> 2013-12-23 113.8391 18.45640
#> 2013-12-24 115.6951 18.49871
#> 2013-12-27 115.4300 18.41417
#> 2013-12-30 114.8997 18.24527
#> 2013-12-31 114.2810 18.41417
plot(prices, legend.loc = "topleft", main = "Stock prices")
Observe that from a visual inspection of the prices, it is not clear at all that the two stocks could be cointegrated. However, once we plot the log-prices, the picture becomes much more clear:
First of all, let’s try to estimate \(\gamma\) and \(\mu\) from a training period and then use it in the test period:
T_trn <- round(0.7*T) # define the training set
T_tst <- T - T_trn
y1 <- logprices[, 1]
y2 <- logprices[, 2]
# do LS regression
ls_coeffs <- coef(lm(y1[1:T_trn] ~ y2[1:T_trn]))
ls_coeffs
#> (Intercept) y2[1:T_trn]
#> 2.435953 0.864618
mu <- ls_coeffs[1]
gamma <- ls_coeffs[2]
We can plot the log-prices of one stock and its regression based on the other:
tmp <- cbind(y1, mu + gamma*y2)
colnames(tmp) <- c(colnames(y1), paste("mu + gamma x", colnames(y2)))
{ plot(tmp, legend.loc = "topleft", main = "Regression of y1 from y2")
addEventLines(xts("training", index(y1[T_trn])), srt = 90, pos = 2, lwd = 2, col = "blue") }
More clearly, we can plot the spread to observe its mean-reversion property:
# spread
spread <- y1 - gamma*y2
{ plot(spread, main = "Spread")
addEventLines(xts("", index(y1[T_trn])), lwd = 2, col = "blue") }
Indeed, the spread seems to be mean-reverting. However, one can observe that the mean reversion is much stronger during the training period than during the test period. This is a bit dangerous. One thing is for the spread to be mean-reverting during some period and a very different thing is whether the mean-reversion is persistent enough over subsequent periods. If it’s not persistent enough, then it is very dangerous to trade the spread. Indeed, the author of the package egcm wrote the following paper exploring precisely the persistency of cointegration in the markets:
Clegg, Matthew, On the Persistence of Cointegration in Pairs Trading (January 28, 2014). Available at SSRN: https://ssrn.com/abstract=2491201 or http://dx.doi.org/10.2139/ssrn.2491201
We can now run a cointegration test:
library(egcm)
#coint_result <- egcm(y1, y2, p.value = 0.01)
coint_result <- egcm(y1, y2, p.value = 0.05)
print(coint_result)
#> 0688.HK[i] = 1.1333 0388.HK[i] - 2.7359 + R[i], R[i] = 0.9937 R[i-1] + eps[i], eps ~ N(0, 0.0287^2)
#> (0.0032) (0.0187) (0.0022)
#>
#> R[2013-12-31] = 0.2787 (t = 1.234)
#>
#> WARNING: The series seem cointegrated but the residuals are not AR(1).
plot(coint_result)
The spread formed as \(z_{t}=y_{1t}-\gamma y_{2t}\) is effectively equivalent to using the portfolio \(\mathbf{w} = \left[\begin{array}{c} 1\\ -\gamma \end{array}\right]\). In order to make fair comparisons, we will fix the leverage of the portfolio so that \(\| \mathbf{w}\|_1 = 1\). The normalized portfolio is then \(\mathbf{w} = \left[\begin{array}{c} 1/(1+\gamma)\\ -\gamma/(1+\gamma) \end{array}\right]\)
Let’s form a more realistic spread based on the normalized portfolio:
# normalized portfolio
w_ref <- c(1, -gamma)/(1+gamma)
sum(abs(w_ref))
#> [1] 1
w_spread <- matrix(w_ref, T, 2, byrow = TRUE)
w_spread <- xts(w_spread, index(y1))
colnames(w_spread) <- c("w1", "w2")
# resulting normalized spread
spread <- rowSums(cbind(y1, y2) * w_spread)
spread <- xts(spread, index(y1))
colnames(spread) <- "spread"
{ plot(spread, main = "Spread (from normalized portfolio)")
addEventLines(xts("", index(y1[T_trn])), lwd = 2, col = "blue") }
The Z-score is a normalized version of the spread, which is useful to generate the trading signal. In particular, it is defined as \[Z^{\textsf{score}}_t = \frac{z_t-\textsf{E}[z_t]}{\textsf{Std}[z_t]}.\] Now let’s generate the Z-score:
spread_mean <- mean(spread[1:T_trn], na.rm = TRUE)
spread_var <- as.numeric(var(spread[1:T_trn], na.rm = TRUE))
Z_score <- (spread-spread_mean)/sqrt(spread_var)
colnames(Z_score) <- "Z-score"
threshold_long <- threshold_short <- Z_score
threshold_short[] <- .7
threshold_long[] <- -.7
{ plot(Z_score, main = "Z-score")
lines(threshold_short, lty = 2)
lines(threshold_long, lty = 2)
addEventLines(xts("", index(y1[T_trn])), lwd = 2, col = "blue") }
We are ready to generate the trading signal (0: no position, 1: long position, -1: short position):
# we define a function for convenience and future use
generate_signal <- function(Z_score, threshold_long, threshold_short) {
signal <- Z_score
colnames(signal) <- "signal"
signal[] <- NA
#initial position
signal[1] <- 0
if (Z_score[1] <= threshold_long[1]) {
signal[1] <- 1
} else if (Z_score[1] >= threshold_short[1])
signal[1] <- -1
# loop
for (t in 2:nrow(Z_score)) {
if (signal[t-1] == 0) { #if we were in no position
if (Z_score[t] <= threshold_long[t]) {
signal[t] <- 1
} else if(Z_score[t] >= threshold_short[t]) {
signal[t] <- -1
} else signal[t] <- 0
} else if (signal[t-1] == 1) { #if we were in a long position
if (Z_score[t] >= 0) signal[t] <- 0
else signal[t] <- signal[t-1]
} else { #if we were in a short position
if (Z_score[t] <= 0) signal[t] <- 0
else signal[t] <- signal[t-1]
}
}
return(signal)
}
# now just invoke the function
signal <- generate_signal(Z_score, threshold_long, threshold_short)
{ plot(cbind(Z_score, signal), main = "Z-score and trading signal", legend.loc = "topleft")
addEventLines(xts("", index(y1[T_trn])), lwd = 2, col = "blue") }
Finally, we can compute the return and P&L of the strategy. A simple way of doing this is directly from the spread and the trading signal:
# let's compute the PnL directly from the signal and spread
spread_return <- diff(spread)
traded_return <- spread_return * lag(signal) # NOTE THE LAG!!
traded_return[is.na(traded_return)] <- 0
colnames(traded_return) <- "traded spread"
{ plot(traded_return, main = "Return of traded spread")
addEventLines(xts("", index(y1[T_trn])), lwd = 2, col = "blue") }
We have two different ways to compute the wealth or cumulative P&L:
1 + cumsum(traded_return)
cumprod(1 + traded_return)
{ plot(1 + cumsum(traded_return), main = "Cum P&L of traded spread (no reinvestment)")
addEventLines(xts("", index(y1[T_trn])), lwd = 2, col = "blue") }
{ plot(cumprod(1 + traded_return), main = "Cum P&L of traded spread (w/ reinvestment)")
addEventLines(xts("", index(y1[T_trn])), lwd = 2, col = "blue") }
We can observe that the wealth increases steadily during the training period but flattens out during the test or out-of-sample period (which is the one that really counts!). So in practice, this wouldn’t have made a good return.
Recall that the same plots can be obtained with the package PerformanceAnalytics (the argument geometric
determines whether it is compounded or not):
library(PerformanceAnalytics)
{ chart.CumReturns(traded_return, main = "Cum P&L of traded spread (w/ reinvestment)",
geometric = TRUE, wealth.index = TRUE)
addEventLines(xts("", index(y1[T_trn])), lwd = 2, col = "blue") }
And it is very instructive to plot the drawdown:
{ chart.Drawdown(traded_return, main = "Drawdown", lwd = 1)
addEventLines(xts("", index(y1[T_trn])), lwd = 2, col = "blue") }
Let’s zoom in during the period 2000-2006 to observe the connection between the spread and the cumulative P&L:
par(mfrow = c(2, 1))
plot(cbind(Z_score, signal)["2000::2006"], main = "Z-score and trading signal",
legend.loc = "bottomleft")
plot(1 + cumsum(traded_return)["2000::2006"], main = "Cum P&L of traded spread")
A more systematic and fool-proof way to compute the return and P&L of the strategy is via the equivalent portfolio. In fact, the previous way of calculating the P&L directly from the spread can lead to wrong results if \(\gamma\) or \(\mu\) change over time. If that happens, then the spread may look very nice but it’s not realistic because once we are in a trade, we shouln’t change \(\gamma\) or \(\mu\). Let’s use now the equivalent portfolio approach.
# combine the ref portfolio with trading signal
w_portf <- w_spread * matrix(lag(signal), T, 2) # NOTE THE LAG!!
# now compute the PnL from the log-prices and the portfolio
X <- diff(cbind(y1, y2)) #compute log-returns from log-prices
portf_return <- xts(rowSums(X * w_portf), index(X))
portf_return[is.na(portf_return)] <- 0
colnames(portf_return) <- "portfolio"
par(mfrow = c(2, 1))
{ plot(cbind(Z_score, signal), main = "Z-score and trading signal", legend.loc = "bottomleft")
lines(threshold_short, lty = 2)
lines(threshold_long, lty = 2)
addEventLines(xts("", index(y1[T_trn])), lwd = 2, col = "blue") }
{ plot(cumprod(1 + portf_return), main = "Cum P&L of traded spread")
addEventLines(xts("", index(y1[T_trn])), lwd = 2, col = "blue") }
Let’s load data from several ETFs:
# set begin-end dates and stock namelist
begin_date <- "2000-01-01"
end_date <- "2005-12-31" #"2018-04-30"
stock_namelist <- c("EWA", "EWK", "EWO", "EWC", "EWQ", "EWG", "EWH", "EWI", "EWJ", "EWM", "EWW",
"EWN", "EWS", "EWP", "EWD", "EWL", "EWY", "EZU", "EWU", "EWZ", "EWT", "SPY")
# download data from YahooFinance
prices <- xts()
for (stock_index in 1:length(stock_namelist))
prices <- cbind(prices, Ad(getSymbols(stock_namelist[stock_index],
from = begin_date, to = end_date, auto.assign = FALSE)))
colnames(prices) <- stock_namelist
indexClass(prices) <- "Date"
Y <- log(prices)
T <- nrow(prices) # number of days
In order to compare conveniently different pairs, we define a function that computes the spread, trades it, and plots:
analyze_spread_LS <- function(Y, pct_training = 0.7, threshold = 0.7, legend_loc = "topleft") {
if(anyNA(Y))
Y <- na.approx(Y)
T <- nrow(Y)
T_trn <- round(pct_training*T)
# LS regression
gamma <- coef(lm(Y[1:T_trn, 1] ~ Y[1:T_trn, 2]))[2]
# spread portfolio
w_spread <- matrix(c(1, -gamma)/(1+gamma), T, 2, byrow = TRUE)
w_spread <- xts(w_spread, index(Y))
colnames(w_spread) <- c("w1", "w2")
# spread
spread <- rowSums(Y * w_spread)
spread <- xts(spread, index(Y))
colnames(spread) <- paste0(colnames(Y)[1], "-", colnames(Y)[2])
# Z-score
spread_mean <- mean(spread[1:T_trn])
spread_var <- as.numeric(var(spread[1:T_trn]))
Z_score <- (spread-spread_mean)/sqrt(spread_var)
threshold_long <- threshold_short <- Z_score
threshold_short[] <- threshold
threshold_long[] <- -threshold
# trading signal
signal <- generate_signal(Z_score, threshold_long, threshold_short)
# combine the ref portfolio with trading signal
w_portf <- w_spread * matrix(lag(signal), T, 2) # NOTE THE LAG!!
# now compute the PnL from the log-prices and the portfolio
X <- diff(Y) #compute log-returns from log-prices
portf_return <- xts(rowSums(X * w_portf), index(X))
portf_return[is.na(portf_return)] <- 0
# plots
tmp <- cbind(Z_score, signal)
colnames(tmp) <- c("Z-score", "signal")
par(mfrow = c(2, 1))
{ plot(tmp, legend.loc = legend_loc,
main = paste("Z-score and trading signal for", colnames(Y)[1], "vs", colnames(Y)[2]))
lines(threshold_short, lty = 2)
lines(threshold_long, lty = 2)
print(addEventLines(xts("", index(Y[T_trn])), lwd = 2, col = "blue")) }
{ plot(cumprod(1 + portf_return), main = "Cum P&L")
print(addEventLines(xts("", index(Y[T_trn])), lwd = 2, col = "blue")) }
## more plots
#print(plot(cumprod(1 + portf_return[T_trn:T]), main = "Cum P&L"))
#print(chart.Drawdown(portf_return[T_trn:T], main = "Drawdown", lwd = 1))
par(mfrow = c(1, 1))
}
Now we are ready to explore different pairs during different periods:
# a good cointegrated case
analyze_spread_LS(Y["2000-08::2002-01", c("EWH", "EWZ")], pct_training = 0.55, legend_loc = "bottomleft")
# a case where cointegration is lost
analyze_spread_LS(Y["2000-07::2001", c("EWY", "EWT")], pct_training = 0.55)
In practice, the parameters gamma and mu may be time-varying and the spread becomes \[z_{t} = y_{1t}-\gamma_t y_{2t} = \mu_t+\epsilon_{t}\] where \(\mu_t\) is the slowly time-varying equilibrium value and \(\gamma_t\) is the slowly time-varying cointegration factor.
In this section, we will generate synthetic data so that we can conveniently explore the tracking behavior of the different alternatives and later on we will use real market data.
The synthetic log-price data follows \[\begin{aligned} y_{1t} &= \mu_t + \gamma_t x_{t} + \epsilon_{1t}\\ y_{2t} &= x_{t} + \epsilon_{2t} \end{aligned}\] where \(\epsilon_{1t}\), \(\epsilon_{2t}\) are AR(1) residual terms and \(x_t\) is a stochastic common trend defined as a random walk: \[x_{t}=x_{t-1}+w_{t}\] with \(w_{t}\) an i.i.d. residual term.
library(xts)
library(MASS)
library(MTS)
set.seed(473)
T_syn <- 1000
# create mu and gamma
#mu_true <- xts(rep(0.5, T_syn), order.by = as.Date("2010-01-01") + 0:(T_syn-1))
mu_true <- xts(c(rep(0.6, T_syn/4), rep(0.4, T_syn/2), rep(0.6, T_syn/4)),
order.by = as.Date("2010-01-01") + 0:(T_syn-1))
colnames(mu_true) <- "mu-true"
#gamma_true <- xts(rep(0.4, T_syn), index(mu_true))
gamma_true <- xts(c(rep(0.2, T_syn/2), rep(0.8, T_syn/2)), index(mu_true))
colnames(gamma_true) <- "gamma-true"
# smooth parameters
mu_true[] <- filter(mu_true, rep(1, 50)/50, sides = 1)
mu_true[] <- filter(mu_true, rep(1, 50)/50, sides = 1)
mu_true <- na.locf(mu_true, fromLast = TRUE)
gamma_true[] <- filter(gamma_true, rep(1, 50)/50, sides = 1)
gamma_true[] <- filter(gamma_true, rep(1, 50)/50, sides = 1)
gamma_true <- na.locf(gamma_true, fromLast = TRUE)
plot(cbind(mu_true, gamma_true), col = c("blue", "darkgreen"),
legend.loc = "topleft", main = "True values")
# create common trend as random walk
# (with ann volatility 50% or daily volatility = 50%/sqrt(252) = 0.03419)
tend_sd <- 0.2*0.03149
trend <- cumsum(MASS::mvrnorm(T_syn, 0, tend_sd^2)) # random walk with standard innovation
# create VAR(1) residual
# (with ann volatility 40% or daily volatility = 40%/sqrt(252) = 0.0252)
res_sd <- 0.1*0.0252
residual <- MTS::VARMAsim(T_syn, arlags = 1, phi = 0.9*diag(2), sigma = res_sd^2*diag(2))$series
# create the synthetic log-prices
Y_syn <- cbind(mu_true, 0) + cbind(gamma_true, 1)*trend + residual
colnames(Y_syn) <- c("Y1 = mu + gamma*X + W1", "Y2 = X + W2")
We can now proceed to the estimation of \(\mu_t\) and \(\gamma_t\). We will start with the previous vanilla LS regression, then we will consider a rolling LS to capture the time variation, and finally we will explore Kalman filtering for optimal tracking.
Let’s start by using the previous vanilla LS regression to estimate a constant \(\mu\) and \(\gamma\):
# LS regression
ls_coeffs <- coef(lm(Y_syn[1:(0.3*T_syn), 1] ~ Y_syn[1:(0.3*T_syn), 2]))
mu_LS <- mu_true
mu_LS[] <- ls_coeffs[1]
colnames(mu_LS) <- "mu-LS"
gamma_LS <- gamma_true
gamma_LS[] <- ls_coeffs[2]
colnames(gamma_LS) <- "gamma-LS"
# plots
par(mfrow = c(2, 1))
plot(cbind(mu_true, mu_LS), legend.loc = "bottomleft", main = "LS estimation of mu")
plot(cbind(gamma_true, gamma_LS), legend.loc = "left", main = "LS estimation of gamma")
As we can see, the estimation is quite bad and we need some time-varying estimate.
The rolling LS is simply an LS but performed on a rolling window basis defined by the window length T_lookback
: \[
\underset{\mu_{t_0},\gamma_{t_0}}{\textsf{minimize}} \quad
\sum_{l=t_0-T_\textsf{lookback}+1}^{t_0}\left(y_{1t}-\left(\mu+\gamma y_{2t}\right)\right)^{2}.
\]
# rolling LS
T_lookback <- 400 # lookback window length
T_shift <- 10 # how often is refreshed
# init empty variables
mu_rolling_LS <- xts(rep(NA, T_syn), index(mu_true))
colnames(mu_rolling_LS) <- "mu-rolling-LS"
gamma_rolling_LS <- xts(rep(NA, T_syn), index(mu_true))
colnames(gamma_rolling_LS) <- "gamma-rolling-LS"
# loop
t0_update <- seq(from = T_lookback, to = T_syn-T_shift, by = T_shift)
t0_update
#> [1] 400 410 420 430 440 450 460 470 480 490 500 510 520 530 540 550 560 570 580 590 600 610 620 630 640 650 660 670 680
#> [30] 690 700 710 720 730 740 750 760 770 780 790 800 810 820 830 840 850 860 870 880 890 900 910 920 930 940 950 960 970
#> [59] 980 990
for (t0 in t0_update) {
ls_coeffs <- coef(lm(Y_syn[(t0-T_lookback+1):t0, 1] ~ Y_syn[(t0-T_lookback+1):t0, 2],
weights = 2^(1:T_lookback)))
mu_rolling_LS[t0+1] <- ls_coeffs[1]
gamma_rolling_LS[t0+1] <- ls_coeffs[2]
}
# complete values and smooth
plot(mu_rolling_LS, type = "o", pch = 16)
mu_rolling_LS[] <- filter(mu_rolling_LS, rep(1, T_lookback/4)/(T_lookback/4), sides = 1)
mu_rolling_LS <- na.locf(mu_rolling_LS, fromLast = TRUE)
plot(mu_rolling_LS)
gamma_rolling_LS <- na.locf(gamma_rolling_LS)
gamma_rolling_LS <- na.locf(gamma_rolling_LS, fromLast = TRUE)
gamma_rolling_LS[] <- filter(gamma_rolling_LS, rep(1, T_lookback/4)/(T_lookback/4), sides = 1)
gamma_rolling_LS <- na.locf(gamma_rolling_LS, fromLast = TRUE)
# plots
par(mfrow = c(2, 1))
plot(cbind(mu_true, mu_LS, mu_rolling_LS), legend.loc = "left", main = "Tracking of mu")
plot(cbind(gamma_true, gamma_LS, gamma_rolling_LS), legend.loc = "topleft", main = "Tracking of gamma")
The rolling LS can adapt to the changes but it requires some time to converge depending on the lookback window length. Of course if the lookback window lenght is too short then the estimation will be too noisy. So it is difficult to find a good tradeoff between adaptability and estimation error.
The Kalman formulation of the time-varying fit is, using the notation of thepackage KFAS, is given by the following state transition equation and observation equation: \[\begin{aligned} \boldsymbol{\alpha}_{t+1} &= \mathbf{T}_t\boldsymbol{\alpha}_{t} + \mathbf{R}_t\boldsymbol{\eta}_{t}\\ y_{1t} &= \mathbf{Z}_{t}\boldsymbol{\alpha}_{t}+\epsilon_{t} \end{aligned}\] where
library(KFAS) #install.packages("KFAS")
# init empty variables
gamma_Kalman_filtering <- mu_Kalman_filtering <- xts(rep(NA, T_syn), index(mu_true))
colnames(mu_Kalman_filtering) <- "mu-Kalman-filtering"
colnames(gamma_Kalman_filtering) <- "gamma-Kalman-filtering"
gamma_Kalman_smoothing <- mu_Kalman_smoothing <- xts(rep(NA, T_syn), index(mu_true))
colnames(mu_Kalman_smoothing) <- "mu-Kalman-smoothing"
colnames(gamma_Kalman_smoothing) <- "gamma-Kalman-smoothing"
# Kalman parameters
Tt <- diag(2)
Rt <- diag(2)
Qt <- diag(c(1e-3, 1e-2)) # state transition variance very small
Zt <- array(as.vector(t(cbind(1, as.matrix(Y_syn[, 2])))), dim = c(1, 2, T_syn)) # time-varying
Ht <- matrix(1.3e-7) # observation variance
# the prior in the code: P1cov = kappa*P1Inf + P1, kappa = 1e7
a1 <- matrix(c(0.6, 0.2), 2, 1)
P1 <- 1e-12*diag(2) # variance of initial point
P1inf <- 0*diag(2)
# create Kalman model
model <- SSModel(as.matrix(Y_syn[, 1]) ~
0 + SSMcustom(Z=Zt, T=Tt, R=Rt, Q=Qt, a1=a1, P1=P1, P1inf=P1inf), H=Ht)
print(model)
#> Call:
#> SSModel(formula = as.matrix(Y_syn[, 1]) ~ 0 + SSMcustom(Z = Zt,
#> T = Tt, R = Rt, Q = Qt, a1 = a1, P1 = P1, P1inf = P1inf),
#> H = Ht)
#>
#> State space model object of class SSModel
#>
#> Dimensions:
#> [1] Number of time points: 1000
#> [1] Number of time series: 1
#> [1] Number of disturbances: 2
#> [1] Number of states: 2
#> Names of the states:
#> [1] custom1 custom2
#> Distributions of the time series:
#> [1] gaussian
#>
#> Object is a valid object of class SSModel.
# run Kalman filtering and smoothing and get their values
out <- KFS(model)
mu_Kalman_filtering[] <- out$a[-1, 1] # a is Kalman filtering (alphahat is Kalman smoothing) (a(T+1)=alphahat(T))
gamma_Kalman_filtering[] <- out$a[-1, 2]
mu_Kalman_smoothing[] <- out$alphahat[, 1] # a is Kalman filtering (alphahat is Kalman smoothing) (a(T+1)=alphahat(T))
gamma_Kalman_smoothing[] <- out$alphahat[, 2]
# let's smooth a bit the Kalman filtering
mu_Kalman_filtering[] <- filter(mu_Kalman_filtering, rep(1, 5)/5, sides = 1)
mu_Kalman_filtering <- na.locf(mu_Kalman_filtering, fromLast = TRUE)
gamma_Kalman_filtering[] <- filter(gamma_Kalman_filtering, rep(1, 5)/5, sides = 1)
gamma_Kalman_filtering <- na.locf(gamma_Kalman_filtering, fromLast = TRUE)
# plots
par(mfrow = c(2, 1))
plot(cbind(mu_true, mu_LS, mu_rolling_LS, mu_Kalman_filtering, mu_Kalman_smoothing),
legend.loc = "left", main = "Tracking of mu")
plot(cbind(gamma_true, gamma_LS, gamma_rolling_LS, gamma_Kalman_filtering, gamma_Kalman_smoothing),
legend.loc = "topleft", main = "Tracking of gamma", ylim = c(-0.2, 0.9))
We can look at the centered spreads (i.e., with mean removed) \(z_t = y_{1t} - \gamma_t y_{2t} - \mu_t\) to properly compare the different methods:
# compute spreads
compute_spread <- function(Y, gamma, mu, name = NULL) {
w_spread <- cbind(1, -gamma)/cbind(1+gamma, 1+gamma)
spread <- rowSums(Y * w_spread) - mu/(1+gamma)
colnames(spread) <- name
return(spread)
}
spread_true <- compute_spread(Y_syn, gamma_true, mu_true, "spread-true")
spread_LS <- compute_spread(Y_syn, gamma_LS, mu_LS, "spread-LS")
spread_rolling_LS <- compute_spread(Y_syn, gamma_rolling_LS, mu_rolling_LS, "spread-rolling-LS")
spread_Kalman <- compute_spread(Y_syn, gamma_Kalman_filtering, mu_Kalman_filtering, "spread-Kalman")
# plots
plot(cbind(spread_true, spread_LS, spread_rolling_LS, spread_Kalman),
legend.loc = "topright", main = "Spreads")
Clearly the static LS or even the rolling LS perform really poorly (of course with real data the parameters may not change as abruptly as in our synthetic data). Let’s plot now just the true spread and the obtained with Kalman:
# plots
plot(cbind(spread_true, spread_Kalman),
legend.loc = "topright", main = "Spreads", col = c("black", "blue"))
We are finally ready to compare the different methods to compute the spreads with real market data. We will use the same market data loaded above for many ETFs.
We first define the functions to estimate \(\mu\) and \(\gamma\) according to the different methods:
estimate_mu_gamma_LS <- function(Y, pct_training = 0.3) {
T <- nrow(Y)
T_trn <- round(pct_training*T)
# LS regression
ls_coeffs <- coef(lm(Y[1:T_trn, 1] ~ Y[1:T_trn, 2]))
mu <- xts(rep(ls_coeffs[1], T), index(Y))
colnames(mu) <- "mu-LS"
gamma <- xts(rep(ls_coeffs[2], T), index(Y))
colnames(gamma) <- "gamma-LS"
return(list(mu = mu, gamma = gamma))
}
estimate_mu_gamma_rolling_LS <- function(Y, pct_training = 0.3) {
T <- nrow(Y)
T_start <- round(pct_training*T)
T_lookback <- 500 # lookback window length
T_shift <- 10 # how often is refreshed
# init empty variables
gamma_rolling_LS <- mu_rolling_LS <- xts(rep(NA, T), index(Y))
colnames(mu_rolling_LS) <- "mu-rolling-LS"
colnames(gamma_rolling_LS) <- "gamma-rolling-LS"
# loop
t0_update <- seq(from = min(T_start, T_lookback), to = T-T_shift, by = T_shift)
for (t0 in t0_update) {
T_lookback_ <- ifelse(t0-T_lookback+1 >= 1, T_lookback, T_start)
ls_coeffs <- coef(lm(Y[(t0-T_lookback_+1):t0, 1] ~ Y[(t0-T_lookback_+1):t0, 2],
weights = last(1:T_lookback, T_lookback_)))
mu_rolling_LS[t0+1] <- ls_coeffs[1]
gamma_rolling_LS[t0+1] <- ls_coeffs[2]
}
# complete values
mu_rolling_LS <- na.locf(mu_rolling_LS)
mu_rolling_LS <- na.locf(mu_rolling_LS, fromLast = TRUE)
gamma_rolling_LS <- na.locf(gamma_rolling_LS)
gamma_rolling_LS <- na.locf(gamma_rolling_LS, fromLast = TRUE)
# smoothing
L <- 15
mu_rolling_LS[] <- filter(mu_rolling_LS, rep(1, L)/L, sides = 1)
mu_rolling_LS <- na.locf(mu_rolling_LS, fromLast = TRUE)
gamma_rolling_LS[] <- filter(gamma_rolling_LS, rep(1, L)/L, sides = 1)
gamma_rolling_LS <- na.locf(gamma_rolling_LS, fromLast = TRUE)
return(list(mu = mu_rolling_LS, gamma = gamma_rolling_LS))
}
estimate_mu_gamma_Kalman <- function(Y) {
T <- nrow(Y)
# init empty variables
gamma_Kalman_filtering <- mu_Kalman_filtering <- xts(rep(NA, T), index(Y))
colnames(mu_Kalman_filtering) <- "mu-Kalman"
colnames(gamma_Kalman_filtering) <- "gamma-Kalman"
# Kalman parameters
Tt <- diag(2)
Rt <- diag(2)
Qt <- 1e-5*diag(2) # state transition variance very small
Zt <- array(as.vector(t(cbind(1, as.matrix(Y[, 2])))), dim = c(1, 2, T)) # time-varying
Ht <- matrix(1e-3) # observation variance
# the prior in the code: P1cov = kappa*P1Inf + P1, kappa = 1e7
init <- estimate_mu_gamma_LS(Y)
a1 <- matrix(c(init$mu[1], init$gamma[1]), 2, 1)
P1 <- 1e-5*diag(2) # variance of initial point
P1inf <- 0*diag(2)
# create Kalman model
model <- SSModel(as.matrix(Y[, 1]) ~ 0 + SSMcustom(Z=Zt, T=Tt, R=Rt, Q=Qt, a1=a1, P1=P1, P1inf=P1inf), H=Ht)
# run Kalman filtering
out <- KFS(model)
mu_Kalman_filtering[] <- out$a[-1, 1] # a is Kalman filtering (alphahat is Kalman smoothing) (a(T+1)=alphahat(T))
gamma_Kalman_filtering[] <- out$a[-1, 2]
# smoothing
L <- 30
mu_Kalman_filtering[] <- filter(mu_Kalman_filtering, rep(1, L)/L, sides = 1)
mu_Kalman_filtering <- na.locf(mu_Kalman_filtering, fromLast = TRUE)
gamma_Kalman_filtering[] <- filter(gamma_Kalman_filtering, rep(1, L)/L, sides = 1)
gamma_Kalman_filtering <- na.locf(gamma_Kalman_filtering, fromLast = TRUE)
return(list(mu = mu_Kalman_filtering, gamma = gamma_Kalman_filtering))
}
Let’s choose one pair and period:
Y_ <- Y["2000-08::2003-12", c("EWH", "EWZ")]
if(anyNA(Y_))
Y_ <- na.approx(Y_)
plot(Y_, legend.loc = "bottomleft", main = "Log-prices")
Observe that in the period May 2002 - May 2003 the cointegration seems to have changed.
Now we can estimate the parameters by just invoking the functions above:
LS <- estimate_mu_gamma_LS(Y_)
rolling_LS <- estimate_mu_gamma_rolling_LS(Y_)
Kalman <- estimate_mu_gamma_Kalman(Y_)
We can plot the tracking of the parameters:
# plots
par(mfrow = c(2, 1))
{ plot(cbind(LS$mu, rolling_LS$mu, Kalman$mu),
legend.loc = "left", main = "Tracking of mu")
addEventLines(xts("", index(Y_[round(0.3*nrow(Y_))])), lwd = 2, col = "blue") }
{ plot(cbind(LS$gamma, rolling_LS$gamma, Kalman$gamma),
legend.loc = "left", main = "Tracking of gamma")
addEventLines(xts("", index(Y_[round(0.3*nrow(Y_))])), lwd = 2, col = "blue") }
And we can also plot the spreads:
spread_LS <- compute_spread(Y_, LS$gamma, LS$mu, "LS")
spread_rolling_LS <- compute_spread(Y_, rolling_LS$gamma, rolling_LS$mu, "rolling-LS")
spread_Kalman <- compute_spread(Y_, Kalman$gamma, Kalman$mu, "Kalman")
# plots
plot(cbind(spread_LS, spread_rolling_LS, spread_Kalman), legend.loc = "topleft", main = "Spreads")
We can see that the spread based on LS is not adapting at the end and cointegration is lost, whereas the spread based on Kalman looks beautiful.
We are finally ready to trade these spreads. Let’s define a function for that:
library(TTR)
generate_Z_score_EMA <- function(spread, n = 120) {
## traditional rolling windowed mean and variance
# first, the mean
spread.mean <- EMA(spread, n)
spread.mean <- na.locf(spread.mean, fromLast = TRUE)
spread.demeaned <- spread - spread.mean
# second, the variance
spread.var <- EMA(spread.demeaned^2, n)
spread.var <- na.locf(spread.var, fromLast = TRUE)
# finally compute Z-score
Z.score <- spread.demeaned/sqrt(spread.var)
return(Z.score)
}
pairs_trading <- function(Y, gamma, mu, name = NULL, threshold = 0.7, plot = FALSE) {
# spread and spread portfolio
w_spread <- cbind(1, -gamma)/cbind(1+gamma, 1+gamma)
spread <- rowSums(Y * w_spread) - mu/(1+gamma)
# Z-score
Z_score <- generate_Z_score_EMA(spread)
threshold_long <- threshold_short <- Z_score
threshold_short[] <- threshold
threshold_long[] <- -threshold
# trading signal
signal <- generate_signal(Z_score, threshold_long, threshold_short)
# combine the ref portfolio with trading signal
w_portf <- w_spread * lag(cbind(signal, signal)) # NOTE THE LAG!!
# # fix the portfolio (gamma and mu) during a trade
# lag_signal <- as.numeric(lag(signal))
# for (t in 2:nrow(w_portf)) {
# if (lag_signal[t] != 0 && lag_signal[t] == lag_signal[t-1])
# w_portf[t, ] <- w_portf[t-1, ]
# }
# now compute the PnL from the log-prices and the portfolio
X <- diff(Y) #compute log-returns from log-prices
portf_return <- xts(rowSums(X * w_portf), index(X))
portf_return[is.na(portf_return)] <- 0
colnames(portf_return) <- name
# plots
if (plot) {
tmp <- cbind(Z_score, signal)
colnames(tmp) <- c("Z-score", "signal")
par(mfrow = c(2, 1))
{ plot(tmp, legend.loc = "topleft",
main = paste("Z-score and trading on spread based on", name))
lines(threshold_short, lty = 2)
print(lines(threshold_long, lty = 2)) }
print(plot(cumprod(1 + portf_return), main = paste("Cum P&L for spread based on", name)))
}
return(portf_return)
}
return_LS <- pairs_trading(Y_["::2003-03"], LS$gamma["::2003-03"], LS$mu["::2003-03"],
"LS", plot = TRUE)
return_rolling_LS <- pairs_trading(Y_["::2003-03"], rolling_LS$gamma["::2003-03"], rolling_LS$mu["::2003-03"],
"rolling-LS", plot = TRUE)
return_Kalman <- pairs_trading(Y_["::2003-03"], Kalman$gamma["::2003-03"], Kalman$mu["::2003-03"],
"Kalman", plot = TRUE)
Let’s compare the P&L of the methods:
# plot
plot(cumprod(1 + cbind(return_LS, return_rolling_LS, return_Kalman)["::2003-03"]),
main = "Cum P&L", legend.loc = "topleft")
We can conclude with the following points:
Null hypothesis testing is a reductio ad absurdum argument adapted to statistics. In essence, a claim is assumed valid if its counter-claim is improbable. A result is said to be statistically significant if it allows us to reject the null hypothesis. In particular, the \(p\)-value is the probability of observing the current data under the null hypothesis (i.e., under the assumption of unit root). If the \(p\)-value is small, we can reject the null hypothesis, implying that the correct hypothesis lies in the logical complement of the null hypothesis (i.e., stationarity holds). Summary: we want small \(p\)-value when testing for cointegration.↩︎
Note that one could also swap \(y_{1t}\) and \(y_{2t}\) in the regression formulation.↩︎