This R session will introduce the basics of R for finance.

Introduction

What is R?

R vs Python

  • The syntax of R, Python, Matlab, and Julia is really not that different (see cheatsheet).

  • R and Python are the most popular languages for data science. Some comparisons by Norm Matloff, Arslan Shahid, Martijn Theuwissen.

  • Data visualization: both R and Python have many plotting libraries, but the R package ggplot2 (based on the concept of “grammar of graphics”) is the clear winner (see gallery). Now Python also has a ggplot library.

  • Modeling libraries: both R and Python have tons of libraries. It seems that R has more in data science. For statistics, R is the clear winner.

  • Ease of learning: Python was designed in 1989 for programmers with a syntax inspired by C. R was developed around 1993 for statisticians and scientists, also with a syntax inspired by C. Some people thing Python is easier while others think R is easier. Perhaps initially R was more difficult to learn than Python but with the modern existing IDE’s like RStudio this is not the case anymore. Some people may say that Python is more elegant, but that depends on what one is used to.

  • Speed: Python was initially faster than R. However, with the existing modern packages that’s not true anymore. For example, the famous R package data.table to manipulate huge amounts of data is the clearn winner (see benchmark). In fact, the R package Rcpp allows the combination of R with C++ leading to very fast implementations of the packages. More recently, R has benefitted from many parallel compuation packages as well.

  • Community support: both languages have significant amount of user base, hence, they both have a very active support community.

  • Machine learning: Python is more popular for neural networks. However, the truth is that the popular deep learning libraries (e.g., TensorFlow, MXNet, etc.) are coded in C and have interfaces with Python, R, and other languages. Interestingly, random forests (which is one of the most popular machine learning methods) is far superior in R. The reason is that neural networks traditionally come from a computer science background whereas random forests come from a statistics background.

  • Why R?: R has been used for statistical computing for over two decades. You can get started with writing useful code in no time. It has been used extensively by data scientists and has an insane number of packages available for a lot of data science related tasks.

  • Why Python?: Python is more of a general purpose programming language. For web-based applications, Python seems to be more popular.

  • Finance: again both R and Python are heavily used in finance. You can easily find very passionate defenders and opponents of each language. From my own observations in the academic and industrial sectors, I can say that R is unbeatable for quick testing and prototype development and perhaps Python is more used for a later stage where the final product (probably web-based) has to be developed for clients.

Installation

To install, just follow the following simple steps:

  1. Install R from CRAN.
  2. Install the free IDE RStudio.

Now you are ready to start using R from within RStudio (note that you can also use R directly from the command line without the need for RStudio or you can use another IDE of your preference).

Packages

To see the versions of R and the installed packages just type sessionInfo():

To see the version of a specific package use packageVersion("package_name").

As time progresses, you will have to install different packages from CRAN with the command install.packages("package_name") or from GitHub with the command devtools::install_github("package_name"). After installing a package, it needs to be loaded before it can be used with the command library("package_name") or library(package_name):

Variables: vectors, matrices, data frames, and lists

In R, we can easily assign a value to a variable or object with <- (if the variable does not exist it will be created):

We can combine several elements with c():

We can always see the variables in memory with ls():

My favorite command is str(variable). It gives you various information about the variable, i.e., type of variable, dimensions, contents, etc.

Another useful pair of commands are head() and tail(). They are specially good for variables of large dimensions showing you the first and last few elements, respectively.

It is important to remark that R is a functional language where almost everything is done through functions of all sorts (such as str(), print(), head(), ls(), tail(), max(), etc.).

There are a variety of functions for getting help:

Data types

Operators in R: arithmetic operators include +, -, *, /, ^ and logical operators >, >=, ==, !=.

R has a wide variety of data types including scalars, vectors, matrices, data frames, and lists.

Vectors

A vector is just a collection of several variables of the same type (numerical, character, logical, etc.).

Refer to elements of a vector using subscripts:

Note that in R vectors are not column vectors or row vectors, they do not have any orientation. If one desires a column vector, then that is actually an \(n\times 1\) matrix.

It is also important to differentiate elementwise multiplication * from inner product %*% and outer product %o%:

One can name the elements of a vector:

Matrices

A matrix is two-dimensional collection of several variables of the same type (numerical, character, logical, etc.).

We can easily create a matrix with matrix():

Identify rows, columns or elements using subscripts:

Arrays

Arrays are similar to matrices but can have more than two dimensions.

Data frames

A data frame is more general than a matrix, in that different columns can have different modes (numeric, character, factor, etc.).

There are a variety of ways to identify the elements of a data frame:

Data frames in R are very powerful and versatile. They are commonly used in machine learning where each row is one observation and each column one variable (each variable can be of different types). For financial applications, we mainly deal with multivariate time series, which can be seen as a matrix or data frame but with some particularities: each row is an observation but in a specific order (properly indexed with dates or times) and each column is of the same time (numeric). For multivariate time series we will later explore the class xts, which is more appropriate than matrices or data frames.

Plotting

rbokeh

The package rbokeh is adopted from Python and allows for interactive plotting (however it seems that it is not being maintained anymore since the last update was in 2016):

Key packages for finance

Package xts

As previously mentioned, in finance we mainly deal with multivariate time series that can be thought of as matrices where each row is an observation in a specific order (properly indexed with dates or times) and all columns are of the same time (numeric) corresponding to different assets One could simply use an object of class matrix or class data.frame. However, there is a very convenient class (from package xts) that has been specifically designed for that purpose: xts (actually, it is the culmination of a long history of development of other classes like ts, fts, mts, irts, tseries, timeSeries, and zoo).

Subsetting xts

The most noticable difference in the behavior of xts objects will be apparent in the use of the “[” operator. Using special notation, one can use date-like strings to extract data based on the time-index. Using increasing levels of time-detail, it is possible to subset the object by year, week, days, or even seconds.

The i (row) argument to the subset operator “[”, in addition to accepting numeric values for indexing, can also be a character string, a time-based object, or a vector of either. The format must left-specified with respect to the standard ISO:8601 time format “CCYY-MM-DD HH:MM:SS”. This means that for one to extract a particular month, it is necesssary to fully specify the year as well. To identify a particular hour, say all observations in the eighth hour on January 1, 2007, one would likewise need to include the full year, month and day - e.g. “2007-01-01 08”.

It is also possible to explicitly request a range of times via this index-based subsetting, using the ISO-recommended “/” as the range separator. The basic form is “from/to”, where both from and to are optional. If either side is missing, it is interpretted as a request to retrieve data from the beginning, or through the end of the data object.

Another benefit to this method is that exact starting and ending times need not match the underlying data: the nearest available observation will be returned that is within the requested time period.

The following example shows how to extract the entire month of March 2007:

Now extract all the data from the beginning through January 7, 2007:

Additional xts tools providing subsetting are the first and last functions. In the spirit of head and tail from the utils recommended package, they allow for string based subsetting, without forcing the user to conform to the specifics of the time index. Here is the first 1 week of the data:

… and here is the first 3 days of the last week of the data.

While the subsetting ability of the above makes exactly which time-based class you choose for your index a bit less relevant, it is nonetheless a factor that is beneficial to have control over.

To that end, xts provides facilities for indexing based on any of the current time-based classes. These include Date, POSIXct, chron, yearmon, yearqtr, and timeDate. The index itself may be accessed via the zoo generics extended to xts: index and the replacement function index<-.

It is also possible to directly query and set the index class of an xts object by using the respective functions indexClass and indexClass<-. Temporary conversion, resulting in a new object with the requested index class, can be accomplished via the convertIndex function.

Of course one can also use the traditional indexing for matrices:

Finally, it is straightforward to combine different xts objects into one with multiple columns and properly aligned by the time index with merge() or simply the more standard cbind() (which calls merge()):

Plotting xts

Another advantage of using the class xts is for plotting. While the base R plot function is not very visually appealing, when plotting an xts object with plot() it is actually plot.xts() that is invoked and it is much prettier:

One can also use the awesome ggplot2 package. Recall that first we need to melt the multivariate xts object with the function ggplot2::fortify():

Alternatively, we can use the convenient function autoplot() (from the package ggfortify) that will do the melting for us:

Note: the package ggTimeSeries contains nice extension of ggplot2 for time series (including calendar heatmaps, horizon plots, steamgraphs, waterfalls, etc.).

Additional time-based tools

Calculate periodicity: The periodicity function provides a quick summary as to the underlying periodicity of time series objects:

Find endpoints by time: Another common issue with time-series data is identifying the endpoints with respect to time. Often it is necessary to break data into hourly or monthly intervals to calculate some statistic. A simple call to endpoints offers a quick vector of values suitable for subsetting a dataset by. Note that the first element it zero, which is used to delineate the end.

Change periodicity: One of the most ubiquitous type of data in finance is OHLC data (Open-High- Low-Close). Often is is necessary to change the periodicity of this data to something coarser, e.g. take daily data and aggregate to weekly or monthly. With to.period and related wrapper functions it is a simple proposition.

Periodically apply a function: Often it is desirable to be able to calculate a particular statistic, or evaluate a function, over a set of non-overlapping time periods. With the period.apply family of functions it is quite simple. The following examples illustrate a simple application of the max function to our example data:

In addition to apply.monthly, there are wrappers to other common time frames including: apply.daily, apply.weekly, apply.quarterly, and ap- ply.yearly. Current optimized functions include period.max, period.min, period.sum, and period.prod.

Package quantmod

The package quantmod is designed to assist the quantitative trader in the development, testing, and deployment of statistically based trading models.

Getting data: The most useful function in quantmod is getSymbol(), which allows to conveniently load data from several websites like YahooFinance, GoogleFinance, FRED, etc.:

The OHLCV basics: Data commonly has the prices open, high, low, close, adjusted close, as well as volume. There are many handy functions to extract those data, e.g., Op(), Hi(), Lo(), Cl(), Ad(), Vo(), as well as to query a variety of questions such as is.OHLC(), has.Vo(), etc.

Charting with quantmod: The function chartSeries() is a nice tool to visualize financial time series in a way that many practicioners are familiar with—line charts, as well as OHLC bar and candle charts. There are convenience wrappers to these different styles (lineChart(), barChart(), and candleChart()), though chartSeries() does quite a bit to automatically handle data in the most appropriate way.

Technical analysis charting tools: One can add technical analysis studies from package TTR to the above charts:

Package TTR

The package TTR (Technical Trading Rules) is designed for traditional technical analysis and charting.

Moving averages: One can easily compute moving averages.

Bollinger Bands:

RSI – Relative Strength Indicator:

MACD:

Package PerformanceAnalytics

The package PerformanceAnalytics contains a large list of convenient functions for plotting and evaluation of performance.

library(PerformanceAnalytics)

# compute returns
ret <- CalculateReturns(cbind(Cl(AAPL), Cl(GOOG)))  # same as Cl(AAPL)/lag(Cl(AAPL)) - 1)
head(ret)
#>              AAPL.Close    GOOG.Close
#> 2013-01-02           NA            NA
#> 2013-01-03 -0.012622236  0.0005807487
#> 2013-01-04 -0.027854642  0.0197603623
#> 2013-01-07 -0.005882338 -0.0043632833
#> 2013-01-08  0.002691399 -0.0019735350
#> 2013-01-09 -0.015628904  0.0065730278

# performance measures
table.AnnualizedReturns(ret)
#>                           AAPL.Close GOOG.Close
#> Annualized Return             0.1105     0.2895
#> Annualized Std Dev            0.2575     0.2449
#> Annualized Sharpe (Rf=0%)     0.4291     1.1821
table.CalendarReturns(ret)
#>       Jan  Feb  Mar  Apr  May Jun  Jul  Aug  Sep  Oct  Nov  Dec AAPL.Close GOOG.Close
#> 2013 -0.3 -0.7 -2.1  2.9 -0.4 0.7 -0.2 -0.9 -1.2 -0.4  1.9  1.2        0.3       -0.3
#> 2014  0.2 -0.3  0.0 -0.4 -0.4 1.0 -2.6  0.2  0.6  1.0 -0.1 -1.9       -2.6        1.5
#> 2015 -1.5 -1.5 -1.5 -2.7 -1.1 0.7 -0.9 -0.5  1.1 -0.9  0.4 -1.3       -9.2       -2.8
table.Stats(ret)
#>                 AAPL.Close GOOG.Close
#> Observations      754.0000   754.0000
#> NAs                 1.0000     1.0000
#> Minimum            -0.1236    -0.0531
#> Quartile 1         -0.0076    -0.0068
#> Median              0.0001     0.0000
#> Arithmetic Mean     0.0005     0.0011
#> Geometric Mean      0.0004     0.0010
#> Quartile 3          0.0100     0.0084
#> Maximum             0.0820     0.1605
#> SE Mean             0.0006     0.0006
#> LCL Mean (0.95)    -0.0006     0.0000
#> UCL Mean (0.95)     0.0017     0.0022
#> Variance            0.0003     0.0002
#> Stdev               0.0162     0.0154
#> Skewness           -0.5988     2.6857
#> Kurtosis            6.5048    24.1815
table.DownsideRisk(ret)
#>                               AAPL.Close GOOG.Close
#> Semi Deviation                    0.0118     0.0092
#> Gain Deviation                    0.0105     0.0142
#> Loss Deviation                    0.0120     0.0082
#> Downside Deviation (MAR=210%)     0.0163     0.0138
#> Downside Deviation (Rf=0%)        0.0116     0.0086
#> Downside Deviation (0%)           0.0116     0.0086
#> Maximum Drawdown                  0.2887     0.1918
#> Historical VaR (95%)             -0.0249    -0.0196
#> Historical ES (95%)              -0.0369    -0.0273
#> Modified VaR (95%)               -0.0266    -0.0029
#> Modified ES (95%)                -0.0538    -0.0241

# plots
charts.PerformanceSummary(ret, wealth.index = TRUE, main = "Buy & Hold performance")

Example of a technical trading strategy combining the packages xts, quantmod, TTR, and PerformanceAnalytics

For illustration purposes, let’s now put into practice the basic financial packages xts, quantmod, TTR, and PerformanceAnalytics with a very simple example of a technical trading strategy.

Disclaimer: this course is not based on technical trading at all 😱; on the contrary, it is based on sounded statistical modeling and portfolio optimization. 👍

As a trading stragey, we choose MACD (Moving Average Convergence Divergence) for this example. In a moving average crossovers strategy two averages are computed, a slow moving average and a fast moving average. The difference between the fast moving average and slow moving average is called MACD line. A third average called signal line —a 9 day exponential moving average of MACD signal— is also computed. If the MACD line crosses above the signal line then it is a bullish sign and we go long. If the MACD line crosses below the signal line then it is a bearish sign and we go short. We choose closing price of NSE data to calculate the averages.

We define our trading signal as follows:

  • If the MACD signal crossed above the signal line we go long on NSE
  • If the MACD signal crossed below the signal line we go short on NSE

The trading signal is then applied to the closing price to obtain the returns of our strategy:

Finally, we can evaluate the performance:

Package portfolioBacktest

When a trader designs a portfolio strategy, the first thing to do is to backtest it. Backtesting is the process by which the portfolio strategy is put to test using the past historical market data available.

A common approach is to do a single backtest against the existing historical data and then plot graphs and draw conclusions from that. This is a big mistake. Performing a single backtest is not representative as it is just one realization and one will definitely overfit the tested strategy if there is parameter tuning involved or portfolio comparisons involved. Section 1 of this book chapter on backtesting illustrates the dangers of backtesting.

The package portfolioBacktest performs multiple backtesting of portfolios in an automated way on a rolling-window basis by taking data randomly from different markets, different time periods, and different stock universes. Here is a simple usage example with the equally weighted portfolio:

  • Step 1 - load package & dataset (you should download many more datasets, see vignette)
  • Step 2 - define your own portfolio
  • Step 3 - do backtest (dataset10 just contains 10 datasets for illustration purposes)
  • Step 4 - check your portfolio performance

Examples of the produced tables/plots include:

  • Performance table:

  • Barplot:

  • Boxplot:

R Scripts and R Markdown

R scripts

One simple way to use R is by typing the commands in the command window one by one. However, this quickly becomes inconvenient and it is necessary to write scripts. In RStudio one can simply create a new R script or open a .R file, where the commands are written in the same order as they will be later executed (this point cannot be overemphasized).

With the R script open, one can execute line by line (either clicking a button or with a keyboard shortcut) or source the whole R file (also either clicking a button or with a keyboard shortcut). Alternatively, one can also source the R file from the command line with source("filename.R"), but first one has to make sure to be in the correct folder (to see and set the current directory use the commands: getwd()and setwd("folder_name")). Sourcing using the command source("filename.R") is very convenient when one has a library of useful functions or data that is needed prior to the execution of another main R script.

R Markdown

Another important type of scripts is the R Markdown format (with file extension .Rmd). It is an extremely versatile format that allows the combination of formattable text, mathematics based on Latex codes, R code (or any other language), and then automatic inclusion of the results from the execution of the code (plots, tables, or just other type of output). This type of format also exists for Python and they are generally referred to as Jupyter Notebooks and have recently become key in the context of reproducible research (because anybody can execute the source .Rmd file and reproduce all the plots and output). This document that you are now reading is an example of an R Markdown script.

R Markdown files can be directly created or opened from within RStudio. To compile the source .Rmd file, just click the button called Knit and an html will be automatically generated after executing all the chunks of code (other formats can also be generated like pdf).

The following is a simple header/body template that can be used to prepare projects/reports for this course:

---
title: "Hello R Markdown"
author: "Awesome Me"
date: "2018-09-05"
output: html_document
---

For more information on the R Markdown formatting:

To explore further

There are several CRAN Task Views relevant to financial applications, each of them encompases many packages: