This R session will illustrate pairs trading or statistical arbitrage.
(Useful R links: Cookbook R, Quick-R, R documentation, CRAN, METACRAN.)
There is a very convenient package, egcm, that computes the Engle-Granger cointegration test and many other unit-root tests. In particular, given two series \(x(t)\) and \(y(t)\), it searches for parameters \(\alpha\), \(\beta\), and \(\rho\) such that
\[ y(t) = \alpha + \beta x(t) + r(t)\\ r(t) = \rho r(t-1) + \epsilon(t) \] where \(r(t)\) is the residual and \(\epsilon(t)\) the innovation. If \(|\rho|<1\), then \(x(t)\) and \(y(t)\) are cointegrated (i.e., \(r(t)\) doesn’t contain a unit root). Of course the difficulty is in assessing exactly how much smaller than 1 the value of \(|\rho|\) has to be; this is the task of the unit-root test. The package urca contains more general models for cointegration. Perhaps the most famous example of a unit-root test is the Augmented Dickey-Fuller (ADF) test, which considers as null hypothesis that a unit root is pressent and as alternative hypothesis that the series is stationary (so a small \(p\)-value means indication of strong stationarity).1
Another interesting quantity to measure mean-reversion is the half-life, defined as the time it takes for the spread to mean-revert half of its distance after having diverged from the mean of the spread.
We now use the package egcm to check the cointegration of several illustrative examples:
SPY and IVV are oth ETF’s that track the S&P 500 and obviously they are cointegrated:
SPY_prices <- Ad(getSymbols("SPY", from = "2013-01-01", to = "2013-12-31", auto.assign = FALSE))
IVV_prices <- Ad(getSymbols("IVV", from = "2013-01-01", to = "2013-12-31", auto.assign = FALSE))
plot(cbind(SPY_prices, IVV_prices), legend.loc = "topleft", main = "ETF prices")
res <- egcm(SPY_prices, IVV_prices)
summary(res)
#> IVV.Adjusted[i] = 1.0077 SPY.Adjusted[i] - 0.5419 + R[i], R[i] = 0.0139 R[i-1] + eps[i], eps ~ N(0, 0.0480^2)
#> (0.0003) (0.0529) (0.0624)
#>
#> R[2013-12-30] = -0.0281 (t = -0.577)
#>
#> Unit Root Tests of Residuals
#> Statistic p-value
#> Augmented Dickey Fuller (ADF) -6.042 0.00010
#> Phillips-Perron (PP) -261.890 0.00010
#> Pantula, Gonzales-Farias and Fuller (PGFF) 0.007 0.00010
#> Elliott, Rothenberg and Stock DF-GLS (ERSD) -1.259 0.45122
#> Johansen's Trace Test (JOT) -103.605 0.00010
#> Schmidt and Phillips Rho (SPR) -172.446 0.00010
#>
#> Variances
#> SD(diff(SPY.Adjusted)) = 1.004723
#> SD(diff(IVV.Adjusted)) = 1.011531
#> SD(diff(residuals)) = 0.068125
#> SD(residuals) = 0.048626
#> SD(innovations) = 0.047976
#>
#> Half life = 0.162012
#> R[last] = -0.028071 (t=-0.58)
plot(res)
GLD and IAU both track the price of gold and they are cointegrated:
GLD_prices <- Ad(getSymbols("GLD", from = "2013-01-01", to = "2013-12-31", auto.assign = FALSE))
IAU_prices <- Ad(getSymbols("IAU", from = "2013-01-01", to = "2013-12-31", auto.assign = FALSE))
plot(cbind(GLD_prices, IAU_prices), legend.loc = "topright", main = "ETF prices")
plot(cbind(GLD_prices/as.numeric(GLD_prices[1]), IAU_prices/as.numeric(IAU_prices[1])),
legend.loc = "topright", main = "ETF normalized prices")
res <- egcm(GLD_prices, IAU_prices, log = TRUE)
print(res)
#> IAU.Adjusted[i] = 0.9967 GLD.Adjusted[i] - 2.2813 + R[i], R[i] = 0.2504 R[i-1] + eps[i], eps ~ N(0, 0.0005^2)
#> (0.0003) (0.0015) (0.0614)
#>
#> R[2013-12-30] = -0.0004 (t = -0.722)
plot(res)
Coca-cola and Pepsi are often mentioned as an example of a pair of securities for which pairs trading may be fruitful. However, at least in 2013, they were not cointegrated.
KO_prices <- Ad(getSymbols("KO", from = "2013-01-01", to = "2013-12-31", auto.assign = FALSE))
PEP_prices <- Ad(getSymbols("PEP", from = "2013-01-01", to = "2013-12-31", auto.assign = FALSE))
# check cointegration of prices
plot(cbind(KO_prices/as.numeric(KO_prices[1]), PEP_prices/as.numeric(PEP_prices[1])),
legend.loc = "topleft", main = "ETF normalized prices")
print(egcm(KO_prices, PEP_prices))
#> PEP.Adjusted[i] = 2.1843 KO.Adjusted[i] - 3.6323 + R[i], R[i] = 0.9922 R[i-1] + eps[i], eps ~ N(0, 0.5935^2)
#> (0.1237) (4.1252) (0.0141)
#>
#> R[2013-12-30] = -0.5379 (t = -0.204)
#>
#> WARNING: KO.Adjusted and PEP.Adjusted do not appear to be cointegrated.
# check cointegration of log-prices
plot(log(cbind(KO_prices, PEP_prices)), legend.loc = "topleft", main = "ETF log-prices")
res <- egcm(KO_prices, PEP_prices, log = TRUE)
print(res)
#> PEP.Adjusted[i] = 1.0941 KO.Adjusted[i] + 0.3996 + R[i], R[i] = 0.9901 R[i-1] + eps[i], eps ~ N(0, 0.0088^2)
#> (0.0599) (0.2098) (0.0143)
#>
#> R[2013-12-30] = -0.0088 (t = -0.228)
#>
#> WARNING: KO.Adjusted and PEP.Adjusted do not appear to be cointegrated.
plot(res)
Ford and GM seemed to be even more tightly linked. Yet, the degree of linkage was not high enough to pass the cointegration test.
F_prices <- Ad(getSymbols("F", from = "2013-01-01", to = "2013-12-31", auto.assign = FALSE))
GM_prices <- Ad(getSymbols("GM", from = "2013-01-01", to = "2013-12-31", auto.assign = FALSE))
plot(log(cbind(F_prices, GM_prices)), legend.loc = "topleft", main = "ETF log-prices")
res <- egcm(F_prices, GM_prices, log = TRUE)
print(res)
#> GM.Adjusted[i] = 0.9117 F.Adjusted[i] + 1.0328 + R[i], R[i] = 1.0000 R[i-1] + eps[i], eps ~ N(0, 0.0118^2)
#> (0.0260) (0.0650) (0.0155)
#>
#> R[2013-12-30] = 0.1900 (t = 3.830)
#>
#> WARNING: F.Adjusted and GM.Adjusted do not appear to be cointegrated.
plot(res)
Let \(y_{1t}\) and \(y_{2t}\) denote the log-prices of two stocks. If they are cointegrated, then the spread \(z_{t}=y_{1t}-\gamma y_{2t}\) is stationary and can be written as2 \[z_{t}=y_{1t}-\gamma y_{2t}=\mu+\epsilon_{t}\] where \(\mu\) represents the equilibrium value and \(\epsilon_{t}\) is a zero-mean residual. Equivalently, it can be written as \[y_{1t}=\mu+\gamma y_{2t}+\epsilon_{t}\] which now has the typical form for linear regression.
The Least Squares (LS) regression over \(T\) observations is \[ \underset{\mu,\gamma}{\textsf{minimize}} \quad \sum_{t=1}^{T}\left(y_{1t}-\left(\mu+\gamma y_{2t}\right)\right)^{2}. \] More compactly, by stacking the \(T\) observations in the vectors \(\mathbf{y}_{1}\) and \(\mathbf{y}_{2}\), we can write: \[ \underset{\mu,\gamma}{\textsf{minimize}} \quad \left\Vert \mathbf{y}_{1}-\left(\mu\mathbf{1}+\gamma\mathbf{y}_{2}\right)\right\Vert ^{2}. \]
We start by loading some stock prices for our tests:
library(xts)
library(quantmod)
# set begin-end dates and stock namelist
begin_date <- "2000-07-01"
end_date <- "2013-12-31"
stock_namelist <- c("0388.HK", "0688.HK")
# download data from YahooFinance
prices <- xts()
for (stock_index in 1:length(stock_namelist))
prices <- cbind(prices, Ad(getSymbols(stock_namelist[stock_index],
from = begin_date, to = end_date, auto.assign = FALSE)))
colnames(prices) <- stock_namelist
indexClass(prices) <- "Date"
T <- nrow(prices) # number of days
# interpolate NAs
anyNA(prices)
#> [1] TRUE
prices <- na.approx(prices)
# some visual exploration
str(prices)
#> An 'xts' object on 2000-07-03/2013-12-31 containing:
#> Data: num [1:3399, 1:2] 6.49 6.6 6.54 6.21 6.46 ...
#> - attr(*, "dimnames")=List of 2
#> ..$ : NULL
#> ..$ : chr [1:2] "0388.HK" "0688.HK"
#> Indexed by objects of class: [Date] TZ: UTC
#> xts Attributes:
#> NULL
head(prices)
#> 0388.HK 0688.HK
#> 2000-07-03 6.485238 0.446220
#> 2000-07-04 6.596574 0.452011
#> 2000-07-05 6.540904 0.475193
#> 2000-07-06 6.206903 0.463605
#> 2000-07-07 6.457407 0.463605
#> 2000-07-10 6.735743 0.480990
tail(prices)
#> 0388.HK 0688.HK
#> 2013-12-20 113.5739 18.70984
#> 2013-12-23 113.8391 18.45640
#> 2013-12-24 115.6951 18.49871
#> 2013-12-27 115.4300 18.41417
#> 2013-12-30 114.8997 18.24527
#> 2013-12-31 114.2810 18.41417
plot(prices, legend.loc = "topleft", main = "Stock prices")
Observe that from a visual inspection of the prices, it is not clear at all that the two stocks could be cointegrated. However, once we plot the log-prices, the picture becomes much more clear:
First of all, let’s try to estimate \(\gamma\) and \(\mu\) from a training period and then use it in the test period:
T_trn <- round(0.7*T) # define the training set
T_tst <- T - T_trn
y1 <- logprices[, 1]
y2 <- logprices[, 2]
# do LS regression
ls_coeffs <- coef(lm(y1[1:T_trn] ~ y2[1:T_trn]))
ls_coeffs
#> (Intercept) y2[1:T_trn]
#> 2.435953 0.864618
mu <- ls_coeffs[1]
gamma <- ls_coeffs[2]
We can plot the log-prices of one stock and its regression based on the other:
tmp <- cbind(y1, mu + gamma*y2)
colnames(tmp) <- c(colnames(y1), paste("mu + gamma x", colnames(y2)))
{ plot(tmp, legend.loc = "topleft", main = "Regression of y1 from y2")
addEventLines(xts("training", index(y1[T_trn])), srt = 90, pos = 2, lwd = 2, col = "blue") }
More clearly, we can plot the spread to observe its mean-reversion property:
# spread
spread <- y1 - gamma*y2
{ plot(spread, main = "Spread")
addEventLines(xts("", index(y1[T_trn])), lwd = 2, col = "blue") }
Indeed, the spread seems to be mean-reverting. However, one can observe that the mean reversion is much stronger during the training period than during the test period. This is a bit dangerous. One thing is for the spread to be mean-reverting during some period and a very different thing is whether the mean-reversion is persistent enough over subsequent periods. If it’s not persistent enough, then it is very dangerous to trade the spread. Indeed, the author of the package egcm wrote the following paper exploring precisely the persistency of cointegration in the markets:
Clegg, Matthew, On the Persistence of Cointegration in Pairs Trading (January 28, 2014). Available at SSRN: https://ssrn.com/abstract=2491201 or http://dx.doi.org/10.2139/ssrn.2491201
We can now run a cointegration test:
library(egcm)
#coint_result <- egcm(y1, y2, p.value = 0.01)
coint_result <- egcm(y1, y2, p.value = 0.05)
print(coint_result)
#> 0688.HK[i] = 1.1333 0388.HK[i] - 2.7359 + R[i], R[i] = 0.9937 R[i-1] + eps[i], eps ~ N(0, 0.0287^2)
#> (0.0032) (0.0187) (0.0022)
#>
#> R[2013-12-31] = 0.2787 (t = 1.234)
#>
#> WARNING: The series seem cointegrated but the residuals are not AR(1).
plot(coint_result)
The spread formed as \(z_{t}=y_{1t}-\gamma y_{2t}\) is effectively equivalent to using the portfolio \(\mathbf{w} = \left[\begin{array}{c} 1\\ -\gamma \end{array}\right]\). In order to make fair comparisons, we will fix the leverage of the portfolio so that \(\| \mathbf{w}\|_1 = 1\). The normalized portfolio is then \(\mathbf{w} = \left[\begin{array}{c} 1/(1+\gamma)\\ -\gamma/(1+\gamma) \end{array}\right]\)
Let’s form a more realistic spread based on the normalized portfolio:
# normalized portfolio
w_ref <- c(1, -gamma)/(1+gamma)
sum(abs(w_ref))
#> [1] 1
w_spread <- matrix(w_ref, T, 2, byrow = TRUE)
w_spread <- xts(w_spread, index(y1))
colnames(w_spread) <- c("w1", "w2")
# resulting normalized spread
spread <- rowSums(cbind(y1, y2) * w_spread)
spread <- xts(spread, index(y1))
colnames(spread) <- "spread"
{ plot(spread, main = "Spread (from normalized portfolio)")
addEventLines(xts("", index(y1[T_trn])), lwd = 2, col = "blue") }
The Z-score is a normalized version of the spread, which is useful to generate the trading signal. In particular, it is defined as \[Z^{\textsf{score}}_t = \frac{z_t-\textsf{E}[z_t]}{\textsf{Std}[z_t]}.\] Now let’s generate the Z-score:
spread_mean <- mean(spread[1:T_trn], na.rm = TRUE)
spread_var <- as.numeric(var(spread[1:T_trn], na.rm = TRUE))
Z_score <- (spread-spread_mean)/sqrt(spread_var)
colnames(Z_score) <- "Z-score"
threshold_long <- threshold_short <- Z_score
threshold_short[] <- .7
threshold_long[] <- -.7
{ plot(Z_score, main = "Z-score")
lines(threshold_short, lty = 2)
lines(threshold_long, lty = 2)
addEventLines(xts("", index(y1[T_trn])), lwd = 2, col = "blue") }
We are ready to generate the trading signal (0: no position, 1: long position, -1: short position):
# we define a function for convenience and future use
generate_signal <- function(Z_score, threshold_long, threshold_short) {
signal <- Z_score
colnames(signal) <- "signal"
signal[] <- NA
#initial position
signal[1] <- 0
if (Z_score[1] <= threshold_long[1]) {
signal[1] <- 1
} else if (Z_score[1] >= threshold_short[1])
signal[1] <- -1
# loop
for (t in 2:nrow(Z_score)) {
if (signal[t-1] == 0) { #if we were in no position
if (Z_score[t] <= threshold_long[t]) {
signal[t] <- 1
} else if(Z_score[t] >= threshold_short[t]) {
signal[t] <- -1
} else signal[t] <- 0
} else if (signal[t-1] == 1) { #if we were in a long position
if (Z_score[t] >= 0) signal[t] <- 0
else signal[t] <- signal[t-1]
} else { #if we were in a short position
if (Z_score[t] <= 0) signal[t] <- 0
else signal[t] <- signal[t-1]
}
}
return(signal)
}
# now just invoke the function
signal <- generate_signal(Z_score, threshold_long, threshold_short)
{ plot(cbind(Z_score, signal), main = "Z-score and trading signal", legend.loc = "topleft")
addEventLines(xts("", index(y1[T_trn])), lwd = 2, col = "blue") }