This R session will illustrate sparse index tracking, i.e., the tracking of an index with a sparse portfolio.

The R package sparseIndexTracking, available in GitHub and CRAN, is recommended.

(Useful R links: Cookbook R, Quick-R, R documentation, CRAN, METACRAN.)

Index tracking

Fund managers follow two types of investment strategies: active and passive. Active investments assume that the market is not perfectly efficient and try to beat the market, whereas passive investments assume that the market cannot be beaten in the long run.

Index tracking is a popular form of passive portfolio management strategy. The goal is to construct a portfolio that replicates the performance of a financial index. The following are comprehensive references on which the R package sparseIndexTracking is based:

K. Benidis, Y. Feng, and D. P. Palomar, Optimization Methods for Financial Index Tracking: From Theory to Practice. Foundations and Trends in Optimization, Now Publishers, 2018.

K. Benidis, Y. Feng, and D. P. Palomar, “Sparse portfolios for high-dimensional financial index tracking,” IEEE Trans. Signal Process., vol. 66, no. 1, pp. 155–170, 2018.

In words, the goal is to minimize some error measure of the index tracking using a reduced basket of representative assets as opposed to using all the assets (which would be problematic due to transaction costs and execution logistics).

To define the problem mathematically, let us introduce some definitions. Supposed that an index is composed of \(N\) assets. We denote by \(\mathbf{r}^b=[r_1^b,\dots,r_T^b]^\top\in\mathbb{R}^T\) and \(\mathbf{X}=[\mathbf{r}_1,\dots,\mathbf{r}_T]^\top\in\mathbb{R}^{T\times N}\) the (arithmetic) net returns of the index and the \(N\) assets in the past \(T\) days, respectively, with \(\mathbf{r}_t\in\mathbb{R}^N\) denoting the net returns of the \(N\) assets at the \(t\)-th day.

The goal in an optimization form is the design of a (sparse) portfolio \(\mathbf{w}\in\mathbb{R}_+^N\), with \(\mathbf{w}^\top\mathbf{1} = 1\), that tracks closely the index, i.e., \(\mathbf{X}\mathbf{w} \approx \mathbf{r}^b\). A typical measure of tracking error is the empirical tracking error: \[\textsf{ETE}(\mathbf{w}) = \frac{1}{T}\big\|\mathbf{r}^b - \mathbf{X}\mathbf{w}\big\|_2.\] The underlying optimization problem is \[\begin{array}{ll} \underset{\mathbf{w}}{\text{minimize}} & \frac{1}{T}\big\|\mathbf{r}^b - \mathbf{X}\mathbf{w}\big\|_2^2 + \lambda\|\mathbf{w}\|_0\\ \textsf{subject to} & \mathbf{w}^\top\mathbf{1}=1\\ & \mathbf{0}\leq\mathbf{w}\leq u\mathbf{1}, \end{array}\]

where \(\lambda\) is a regularization parameter that controls the sparsity of the portfolio, and \(u\) is an upper bound on the weights of the portfolio.


To warm up, we start with some simple benchmarks.

Two-step design

A simple design for sparse index tracking is to decompose the task into two steps: asset selection (choosing the active weights) and capital allocation (optimization of weights). We will assume we know the portfolio implemented by the actual index, which will be a dense portfolio (as opposed to sparse) \(\mathbf{b}\) such that \(\mathbf{X}\mathbf{b} = \mathbf{r}^b\). In practice, it is expensive to get the information on \(\mathbf{b}\), so we will just regress it.

Step 1: Asset selection
We can define the selection of \(K\) assets through the vector \(\mathbf{s}\), defined as \[s_i=\begin{cases} 1\quad & \text{if the }j\text{-th asset is selected}\\ 0 & \text{otherwise}, \end{cases}\] with \(\mathbf{1}^T\mathbf{s}=K\). For example, we can simply choose the \(K\) largest assets of \(\mathbf{b}\).

Step 2: Capital allocation
Once the active assets are known, then the portfolio design is simple. One option is to define the portfolio in the same proportion as the index composition (but only using the active assets): \[\mathbf{w}=\frac{\mathbf{b}\odot\mathbf{s}}{\mathbf{1}^T(\mathbf{b}\odot\mathbf{s})}\] where \(\odot\) denotes Hadamard (elementwise) product. Another option is to fit the index returns subject to the sparsity pattern given by the asset selection: \[\begin{array}{ll} \underset{\mathbf{w}}{\text{minimize}} & \frac{1}{T}\big\|\mathbf{r}^b - \mathbf{X}(\mathbf{w}\odot\mathbf{s})\big\|_2^2\\ \textsf{subject to} & \mathbf{1}^\top\mathbf{w}=1\\ & \mathbf{w}\ge\mathbf{0}. \end{array}\]

Mixed Integer Programming (MIP)

MIP finds the best combination by trying all possible combinations of \(K\) active assets out of \(N\), i.e., \(\binom{N}{K}\) combinations. However, this approach is only practical for small dimensions, e.g., even choosing \(K=20\) assets out of \(N=100\) is infeasible since \(\binom{100}{20}>10^{20}\).

Genetic algorithms

Genetic algorithms is a family of meta-optimization methods that encode a candidate solution in a genome and then they let a whole population of candidate solutions to evolve from generation to generation imitating the natural selection process of life itself. They are also computationally too intensive.

Numerical results

We will illustrate the two-step approach. We start by loading the market data of the index S&P 500 and its underlying assets during 2010 (from the package sparseIndexTracking):

library(sparseIndexTracking)  #install.packages("sparseIndexTracking")
T <- nrow(INDEX_2010$X)
N <- ncol(INDEX_2010$X)
X <- INDEX_2010$X
r <- INDEX_2010$SP500

# training and test sets
T_trn <- round(0.7*T)  # define the training set
T_tst <- T - T_trn
X_trn <- X[1:T_trn, ]
r_trn <- r[1:T_trn, ]
X_tst <- X[-(1:T_trn), ]
r_tst <- r[-(1:T_trn), ]

The data INDEX_2010 contains a list with two xts objects:
1. X: A \(T\times N\) xts with the daily linear returns of the \(N\) assets that were in the index during the year 2010 (total \(T\) trading days)
2. SP500: A \(T\times 1\) xts with the daily linear returns of the index S&P 500 during the same period.

Since we don’t know the index composition \(\mathbf{b}\), we will get an estimate by solving the constrained regression \[\begin{array}{ll} \underset{\mathbf{b}}{\t