The R package rcss provides users with a tool to approximate the value functions in the Bellman recursion under certain assumptions that guarantee desirable convergence properties. This R package represents the first software implementation of these methods using matrices and nearest neighbors. This package also employs a pathwise dynamic method to gauge the quality of these value function approximations. Statistical analysis can be performed on the results to obtain other useful practical insights. This paper describes rcss version 1.6.
Sequential decision making is usually addressed under the framework of discrete-time stochastic control. The theory of Markov Decision Processes/Dynamic Programming provides a variety of methods to deal with such questions. In generic situations, approaching analytical solutions for even some of the simplest decision processes may be a cumbersome process ((Powell 2007), (Bauerle and Rieder 2011), (Pham 2009)). The use of numerical approximations may often be far more practical given the rapid improvements in available computational power. The ability to gauge the quality of these approximations is also of significant importance. This paper will describe the implementation of fast and accurate algorithms to address problems with a finite time setting, finite action set, and convex reward functions whose state processes follow linear dynamics.
Although these assumptions seem restrictive, a large class of practically important control problems can be solved using our methods. For instance, approximating a compact action set by a sufficiently dense finite grid of representative actions, an approximate solution with excellent precision and numerical performance can be obtained. This technique is illustrated by battery storage control ((Hinz and Yee 2017a) in the present contribution. Furthermore, various extension to non-linear state dynamics are possible. By using appropriate transformations to linearize the state dynamics, our methodology becomes applicable to control partially observable systems ((Hinz and Yee 2017b)).
This R package implements algorithms to approximate the value functions in finite horizon Markov decision processes and also to gauge the quality of these approximations. Under the conditions discussed in (Hinz 2014), these value functions converge to the true unknown value functions uniformly on compact sets. Defined by the maxima half-norms over compact sets, the corresponding topology ensures that the decision rules, inferred from value function approximations, converge pointwise to the optimal decision policy.
This paper points any interested readers to the technical details contained in (Hinz 2014). The package rcss represents the first software implementation of these algorithms and has already been used to address problems such as pricing financial options ((Hinz and Yap 2015)), natural resource extraction ((Hinz et al. In Press)), battery management ((Hinz and Yee 2017a)), and optimal asset allocation under hidden state dynamics ((Hinz and Yee 2017b)). One of the major benefits of implementing these methods within R ((R Core Team 2013)) is that the results can be analysed using the vast number of statistical tools available in this language.
This R package focuses on linear state dynamics for the continuous state component. While non-linear dynamics have been covered in (Yee Preprint), linear dynamics allow much of the computational effort to be reduced to a series of multiplications and additions of matrices. This is an attractive feature given the availability of well developed linear algebra libraries. Linear state dynamics also allow for a standardized problem formulation which is desirable for a software package. This R package also focuses on problems with finite action sets which occur frequently in many real world problems. While this may seem restrictive for other problems, note that action sets are often discretized in practice to allow for numerical tractability.
Suppose that state space \(\mathbf{X} = \mathbf{P} \times \mathbf{Z}\) is the product of a finite set \(\mathbf{P}\) and an open convex set \(\mathbf{Z} \subseteq \mathbb{R}^d\). Furthermore, assume that a finite set \(\mathbf{A}\) represents all possible actions. Given a finite time horizon \(\{0, 1, \dots, T\} \subset {\mathbb N}\), consider a fully observable controlled Markovian process \((X_t)_{t=0}^T := (P_t, Z_t)_{t=0}^T\) which consists of two parts. The discrete component \((P_t)_{t=0}^T\) describes the evolution of a finite-state controlled Markov chain which takes values in a finite set \(\mathbf{P}\). Further assume that at any time \(t=0, \dots, T-1\) the controller takes an action \(a \in \mathbf{A}\) from a finite set \(\mathbf{A}\) of all admissible actions in order to cause the one-step transition from the mode \(p \in \mathbf{P}\) to the mode \(p' \in \mathbf{P}\) with a probability \(\alpha_{p, p'}(a)\), where \((\alpha_{p, p'}(a))_{p, p' \in \mathbf{P}}\) are pre-specified stochastic matrices for all \(a \in \mathbf{A}\).
Let us now turn to the evolution of the other component \((Z_t)_{t=0}^T\) of the state process \((X_t)_{t=0}^T\). Here, we assume that it follows an uncontrolled evolution in the space \(\mathbf{Z}\) driven as \[Z_{t+1}=W_{t+1} Z_{t}, \qquad t=0, \dots, T-1\] by independent disturbance matrices \((W_{t})_{t=1}^T\). That is, the transition kernels \({\cal K}^a_t\) governing the evolution of our controlled Markov process \((X_t)_{t=0}^T := (P_t, Z_t)_{t=0}^T\) from time \(t\) to \(t+1\) are given for each \(a \in \mathbf{A}\) by \[{\cal K}^a_t v(p,z)= \sum_{p' \in \mathbf{P}} \alpha_{p, p'}(a) \mathbb E(v(p', W_{t+1}z)), \quad p \in \mathbf{P}, \, z \in \mathbf{Z}, \, t=0, \dots, T-1\] which acts on each function \(v: \mathbf{P} \times \mathbf{Z} \to {\mathbb R}\) where the above expectations are well-defined.
If the system is in the state \((p, z)\), the costs of applying action \(a \in \mathbf{A}\) at time \(t=0, \dots, T-1\) are expressed through \(r_t(p, z, a)\). Having arrived at time \(t=T\) in the state \((p, z)\), a final scrap value \(r_T(p, z)\) is collected. Thereby the reward and scrap functions \[r_t: \mathbf{P} \times \mathbf{Z} \times \mathbf{A} \to {\mathbb R}, \quad r_T: \mathbf{P} \times \mathbf{Z} \to {\mathbb R}\] are exogenously given for \(t=0, \dots, T-1\). At each time \(t=0, \dots, T\), the decision rule \(\pi_{t}\) is given by a mapping \(\pi_{t}: \mathbf{P} \times \mathbf{Z} \to \mathbf{A}\), prescribing at time \(t\) an action \(\pi_{t}(p,z) \in \mathbf{A}\) for a given state \((p, z) \in \mathbf{P} \times \mathbf{Z}\). Note that each decision rule refers to the recent state of the system, representing a feedback control. A sequence \(\pi = (\pi_{t})_{t=0}^{T-1}\) of decision rules is called a policy. For each policy \(\pi = (\pi_{t})_{t=0}^{T-1}\), the so-called policy value \(v^{\pi}_{0}(p_0, z_{0})\) is defined as the total expected reward \[v^{\pi}_{0}(p_0, z_{0})=\mathbb{E}^{ (p_0, z_0), \pi } \left[ \sum_{t=0}^{T} r_{t}(P_t, Z_{t}, \pi_{t}(P_t, Z_t)) + r_{t}(P_t, Z_{t}) \right].\] In this formula \(\mathbb{E}^{ (p_0, z_0), \pi }\) stands for the expectation with respect to the probability distribution of \((X_t)_{t=0}^T := (P_t, Z_t)_{t=0}^T\) defined by Markov transitions from \((P_{t}, Z_{t})\) to \((P_{t+1}, Z_{t+1})\), which are induced by the kernels \({\cal K}^{\pi_t(P_t, Z_t)}_t\) for \(t=0, \dots, T-1\), started at the initial point \((P_0, Z_0)=(p_0, z_0)\).
Now we turn to the optimization goal. A policy \(\pi^{*}=(\pi^{*}_{t})_{t=0}^{T-1}\) is called optimal if it maximizes the total expected reward over all policies \(\pi \mapsto v^{\pi}_{0}(p, z)\). To obtain such policy, one introduces for \(t = 0, \dots, T-1\) the so-called Bellman operator \[\label{bell} {\cal T}_{t}v(p,z)=\max_{a \in \mathbf{A}} \left[ r_{t}(p,z , a) + \sum_{p'\in\mathbf{P}} \alpha_{p,p'}(a)\mathbb{E}[v(p',W_{t+1}z)]\right] \tag{1}\] for \((p, z) \in \mathbf{P} \times \mathbf{Z}\), acting on all functions \(v\) where the stochastic kernel is defined. Consider the Bellman recursion, also referred to as backward induction: \[\label{backward_induction} v_{T}= r_{T}, \quad v_{t}= {\cal T}_{t} v_{t+1} \qquad \hbox{for t=T-1,} \dots, 0. \tag{2}\] Having assumed that the reward functions are convex and globally Lipschitz and the disturbances \(W_{t+1}\) are integrable, there exists a unique recursive solution \((v^{*}_{t})_{t=0}^{T}\) to the Bellman recursion. These functions \((v^{*}_{t})_{t=0}^{T}\) are called value functions and they determine an optimal policy (possibly not unique) \(\pi^{*}=(\pi^{*}_{t})_{t=0}^{T}\) via \[\begin{aligned} \pi^{*}_{t}(p,z) & = \arg\max_{a \in \mathbf{A}}\left[r_{t}(p, z, a)+ \sum_{p'\in\mathbf{P}} \alpha_{p,p'}(a)\mathbb{E}[v^{*}_{t+1}(p',W_{t+1}z)] \right], \end{aligned}\] for \(t = T-1,\dots, 0\).
The aim of the package rcss is to approximate the true value functions \((v^{*}_{t})_{t=0}^{T-1}\) and the corresponding optimal policies \(\pi^{*}=(\pi^{*}_{t})_{t=0}^{T-1}\). Given these approximations, our package also determines their distance to optimality which will allow the user to decide whether it is within some acceptable margin or whether further fine tuning is needed to obtain even more accurate results.
Since the reward and scrap functions are convex in the continuous variable, the value functions are also convex and can be approximated by piecewise linear and convex functions. For this, introduce the so-called subgradient envelope \({\cal S}_{\mathbf{G}^m}f\) of a convex function \(f: \mathbf{Z} \to {\mathbb R}\) on a grid \(\mathbf{G}^m \subset \mathbf{Z}\) with \(m\) points: \[({\cal S}_{\mathbf{G}^m}f) (z)=\max_{g \in \mathbf{G}^m} (\triangledown_{g}f )(z),\] for \(z\in\mathbf{Z}\) where \(\triangledown_{g}f\) is the tangent of \(f\) at grid point \(g \in \mathbf{G}^m\). Using the subgradient envelope operator, define the double-modified Bellman operator as \[{\cal T}^{m, n}_{t}v(p, z) = {{\cal S}_{\mathbf{G}^{m}}} {\max_{a \in \mathbf{A}}}\left( {r_{t}(p, z , a)+} {\sum_{p'\in\mathbf{P}} \alpha_{p,p'}^a\sum_{k=1}^{n}\nu^{(k)}_{t+1}v(p', W^{(k)}_{t+1}z)} \right),\] where the probability weights \((\nu^{(k)}_{t+1})_{k=1}^{n}\) correspond to the distribution sampling \((W_{t+1}^{(k)})_{k=1}^{n}\) of each disturbance \(W_{t+1}\). The corresponding backward induction \[\begin{aligned} \label{scheme1} v^{m, n}_{T}(p, z)&=&{\cal S}_{\mathbf{G}^{m}} r_{T}(p, z), \end{aligned} \tag{3}\]
\[\begin{aligned} \label{scheme2} v^{m, n}_{t}(p, z)&=&{\cal T}^{m, n}_{t}v^{m, n}_{t+1}(p, z), \qquad t=T-1, \dots 0, \end{aligned} \tag{4}\] yields the so-called double-modified value functions \((v^{m, n}_{t})_{t=0}^{T}\). Under appropriate assumptions regarding grid density and the disturbance sampling, the double-modified value functions converge uniformly to the true value functions on compact sets (see (Hinz 2014)). To gauge the quality of the above approximations, we construct two random variables whose expectations bound the true value function i.e. \[\label{gap} \mathbb E(\underline{\upsilon}_0(p,z_0)) \leq v_0(p,z_0) \leq \mathbb E(\overline{\upsilon}_0(p,z_0)), \qquad p_0 \in \mathbf{P}, \quad z_0 \in \mathbf{Z}. \tag{5}\] We refer any interested readers to (Hinz and Yap 2015) for the technical details. This process exhibits a helpful self-tuning property. The closer the value function approximations resemble their true unknown counterparts, the tighter the bounds in Equation (5) and the lower the standard errors of the bound estimates. The R package rcss represents these piece-wise linear functions as matrices and utilises nearest neighbor algorithms (from (Mangenat and Jefferies 2018)) to reduce the computational effort. Most of the computational work is done in C++ via Rcpp ((Eddelbuettel and Francois 2011)) and is parallelized using OpenMp ((Dagum and Menon 1998)). The following sections give some code demonstrations.
Stochastic control problems of optimal switching type naturally arise in the framework of valuation of financial contracts. A simple example is given by the Bermudan put option. This option gives its owner the right but not an obligation to choose a time to exercise the option in a finite number of time points before the maturity of the option in order to receive a payment which depends on the price of the underlying asset at the exercise time. The so-called fair price of the Bermudan option is related to the solution of an optimal stopping problem (see (Glasserman 2003)). Here, the risk-neutral undiscounted asset price process \(( \tilde Z_{t})_{t=0}^{T}\) at time steps \(0, \dots, T\) is modeled as a sampled geometric Brownian motion \[\tilde Z_{t+1}= \epsilon_{t+1} \tilde Z_{t}, \quad t=0, \dots, T-1, \, Z_{0} \in {\mathbb R}_{+}, \label{prev} \tag{6}\] where \(( \varepsilon_{t})_{t=1}^{T}\) are independent identically distributed random variables following a log-normal distribution, usually modeled in terms of \[\varepsilon_t=\exp(((\mu-\frac{\sigma^2}{2})\Delta + \sigma \sqrt{\Delta}N_t),\] with independent, identically standard normally distributed random variables \((N_t)_{t=1}^T\) where the parameters \(\sigma>0\) and \(\mu \in {\mathbb R}_{+}\) represent the stock price volatility and the continuously compounded interest rate respectively , measured on yearly scale. The tenor \(\Delta\) describes the length (measured in years) of the equidistant time interval between exercise times. The fair price of such option with strike price \(K\), interest rate parameter \(\rho=\mu \Delta\), and maturity date \(\Delta T\), is given by the solution to the optimal stopping problem \[\begin{array}{c} \sup\{ \mathbb E(e^{-\rho \tau} \max(( K- \tilde Z_{\tau}), 0) ): \enspace {\tau \text{is} \{0,1, \ldots, T\}-\text{valued stopping time}}\}. \end{array}\] A transformation of the state space is required to represent the reward functions in a convenient way. Thus, we introduce an augmentation with 1 via \[Z_{t}=\left[\begin{array}{c} 1 \\ \tilde Z_{t} \end{array}\right], \qquad t=0, \dots, T.\] then it becomes possible to represent the evolution as the linear state dynamics \[Z_{t+1}=W_{t+1}Z_{t}, \qquad t=0, \dots, T-1\] with independent and identically distributed matrix-valued random variables \((W_{t})_{t=1}^{T}\) given by \[W_{t+1}= \left[ \begin{array}{cc} 1 & 0 \\ 0 & \epsilon_{t+1} \end{array}\right], \quad t=0,...,T-1. \label{bm} \tag{7}\]
This switching system is defined by two positions \(\mathbf{P}=\{1, 2\}\) and two actions \(\mathbf{A}=\{1, 2\}\). Here, the positions “exercised” and “not exercised” are represented by \(p=1\) and \(p=2\) respectively, and the actions “don’t exercise” and “exercise” are denoted by \(a=1\) and \(a=2\) respectively. With this interpretation, the position change is given by deterministic transitions to specified states: \[\alpha_{p,p'}^a =\left\{ \begin{array}{cl} 1 & \text{if} p'=\alpha(p, a) \\ 0 & \text{else; } \label{deftwodim1} \end{array} \right. \tag{8}\] deterministically determined by the target positions \[\label{twodim1} (\alpha(p, a))_{p,a=1}^{2} \sim \left[\begin{array}{cc} \alpha(1,1) & \alpha(1,2)\\ \alpha(2,1) & \alpha(2,2) \end{array} \right]. =\left[\begin{array}{cc} 1 & 1\\ 2 & 1 \end{array} \right], \tag{9}\] While the rewards at time \(t=0, \dots, T\), are defined as \[\begin{aligned} r_{t}(p, (z^{(1)}, z^{(2)}), a) &=& e^{-\rho t}\max(K- z^{(2)}, 0)(p-\alpha(p, a)), \label{rewput} \end{aligned} \tag{10}\]
\[\begin{aligned} r_{T}(p, (z^{(1)}, z^{(2)}) ) &=& e^{-\rho T} \max(K- z^{(2)}, 0)(p-\alpha(p, 2)), \label{scrapput} \end{aligned} \tag{11}\] for all \(p\in \mathbf{P}\), \(a \in \mathbf{A}\), \(z \in {\mathbb R}_{+}\).
As a demonstration, let us consider a Bermuda put option with strike price 40 that expires in 1 year. The put option is exercisable at 51 evenly spaced time points in the year, which includes the start and end of the year. The following code approximates the value functions in the Bellman recursion.
library(rcss)
<- 0.06 ## Interest rate
rate <- 0.02 ## Time step between decision epochs
step <- 0.2 ## Volatility of stock price process
vol <- 51 ## Number of decision epochs
n_dec <- 40 ## Strike price
strike <- matrix(c(c(1, 1), c(2, 1)), nrow = 2, byrow = TRUE) ## Control
control <- as.matrix(cbind(rep(1, 301), seq(30, 60, length = 301))) ## Grid
grid ## Disturbance sampling
<- (rate - 0.5 * vol^2) * step
u <- vol * sqrt(step)
sigma <- function(a, b) {
condExpected <- (log(a) - (u + sigma^2)) / sigma
aa <- (log(b) - (u + sigma^2)) / sigma
bb return(exp(u + sigma^2 / 2) * (pnorm(bb) - pnorm(aa)))
}<- rep(1 / 1000, 1000)
weight <- array(0, dim = c(2, 2, 1000))
disturb 1,1,] <- 1
disturb[<- qlnorm(seq(0, 1, length = 1000 + 1), u, sigma)
part for (i in 1:1000) {
2,2,i] <- condExpected(part[i], part[i+1]) / (plnorm(part[i+1], u, sigma) -
disturb[plnorm(part[i], u, sigma))
}## Subgradient representation of reward
<- grid[,2] <= strike
in_money <- array(0, dim = c(301, 2, 2, 2, n_dec - 1))
reward 1,2,2,] <- strike
reward[in_money,2,2,2,] <- -1
reward[in_money,for (tt in 1:n_dec - 1){
<- exp(-rate * step * (tt - 1)) * reward[,,,,tt]
reward[,,,,tt]
}## Subgrad representation of scrap
<- array(data = 0, dim = c(301, 2, 2))
scrap 1,2] <- strike
scrap[in_money,2,2] <- -1
scrap[in_money,<- exp(-rate * step * (n_dec - 1)) * scrap
scrap ## Bellman
<- matrix(c(2, 2), ncol = 2)
r_index <- FastBellman(grid, reward, scrap, control, disturb, weight, r_index) bellman
The matrix grid
represents the grid points where each row represents a
point. The 3-dimensional array disturb
represents the sampling of the
disturbances where disturb[,,i]
gives the i-th sample. Here, we use
local averages on a \(1000\) component partition of the disturbance space.
The 5-dimensional array reward
represents the subgradient
approximation with reward[,,a,p,t]
representing
\({\mathcal S}_{{\mathbf G}^m} r_{t}(p,.,a)\). The object bellman
is a
list containing the approximations of the value functions and expected
value functions for all positions and decision epochs. Please refer to
the package manual for the format of the inputs and outputs. To
visualise the value function of the Bermuda put option, simply run the
following plot command:
plot(grid[,2], rowSums(bellman$value[,,2,1] * grid), type = "l", xlab = "Stock Price",
ylab = "Option Value")
The following code then performs the solution diagnostics by computing the lower and upper bound estimates for the value of the option when \(Z_0 = 36\).
## Reward function
<- function(state, time) {
RewardFunc <- array(data = 0, dim = c(nrow(state), 2, 2))
output 2,2] <- exp(-rate * step * (time - 1)) * pmax(40 - state[,2], 0)
output[,return(output)
}## Scrap function
<- function(state) {
ScrapFunc <- array(data = 0, dim = c(nrow(state), 2))
output 2] <- exp(-rate * step * (n_dec - 1)) * pmax(40 - state[,2], 0)
output[,return(output)
}## Get primal-dual bounds
<- c(1, 36)
start ## Path disturbances
set.seed(12345)
<- 500
n_path <- array(0, dim = c(2, 2, n_path, n_dec - 1))
path_disturb 1, 1,,] <- 1
path_disturb[<- rnorm(n_path * (n_dec - 1) / 2)
rand1 <- as.vector(rbind(rand1, -rand1))
rand1 2, 2,,] <- exp((rate - 0.5 * vol^2) * step + vol * sqrt(step) * rand1)
path_disturb[<- PathDisturb(start, path_disturb)
path <- FastPathPolicy(path, grid, control, RewardFunc, bellman$expected)
policy ## Subsim disturbances
<- 500
n_subsim <- array(0, dim = c(2, 2, n_subsim, n_path, (n_dec - 1)))
subsim 1,1,,,] <- 1
subsim[<- rnorm(n_subsim * n_path * (n_dec - 1) / 2)
rand2 <- as.vector(rbind(rand2, -rand2))
rand2 2,2,,,] <- exp((rate - 0.5 * vol^2) * step + vol * sqrt(step) * rand2)
subsim[<- rep(1 / n_subsim, n_subsim)
subsim_weight <- FastAddDual(path, subsim, subsim_weight, grid, bellman$value, ScrapFunc)
mart <- AddDualBounds(path, control, RewardFunc, ScrapFunc, mart, policy) bounds
The above code takes the exact reward and scrap functions as inputs. The
function FastPathPolicy
computes the candidate optimal policy. The
object bounds
is a list containing the primals and duals for each
sample path \(i\) and each position \(p\) at each decision time \(t\). Again,
please refer to the package manual for the format of the inputs and
outputs. If the price of the underlying asset is \(36\), the \(99\%\)
confidence interval for the option price is given by the following.
> print(GetBounds(bounds, 0.01, 2))
1] 4.475802 4.480533 [
The package rcss also allows the user to test the prescribed policy from the Bellman recursion on any supplied set of sample paths. The resulting output can then be further studied with time series analysis or other statistical work. In the following code, we will use the previously generated \(500\) sample paths to backtest our policy and generate histograms.
<- FullTestPolicy(2, path, control, RewardFunc, ScrapFunc, policy)
test ## Histogram of cumulated rewards
hist(test$value, xlab = "Cumulated Rewards", main = "")
## Exercise times
<- apply(test$position == 1, 1, function(x) min(which(x)))
ex == Inf] <- 51
ex[ex <- ex - 1
ex hist(ex, xlab = "Exercise Times", main = "")
Figure 2 contains the histograms for the cumulated rewards and exercise times. These can be useful to a practitioner (e.g., anticipating when the holder will exercise their option). Let us emphasise the usefulness of such scenario generation. Given an approximately optimal policy and backtesting, one can perform statistical analysis on the backtested values to obtain practical insights such as for risk analysis purposes.
Let us now consider the swing option which is a financial contract popular in the energy business. In the simplest form, it gives the owner the right to obtain a certain commodity (such as gas or electricity) at a pre-specified price and volume at a number of exercise times which can be freely chosen by the contract owner. Let us consider a specific case of such a contract, referred to as a unit-time refraction period swing option. In this contract, there is a limit to exercise only one right at any time. Given the discounted commodity price \((S_{t})_{t=0}^{T}\), the so-called fair price of a swing option with \(N\) rights is given by the supremum, \[\sup_{0\leq\tau_1<\dots<\tau_N\leq T}\mathbb{E}\Big{[}\sum^N_{n=1}(S_{\tau_n}-Ke^{-\rho {\tau_n}})^+\Big{]},\] over all stopping times, \(\tau_1,\dots,\tau_N\), with values in \(\{0, \dots, T\}\). In order to represent this control problem as a switching system, we use the position set \(\mathbf{P}=\{1,\dots,N+1\}\) to describe the number of exercise rights remaining. That is, \(p \in \mathbf{P}\) stands for the situation when there are \(p-1\) rights remaining to be exercised. The action set \(\mathbf{A}=\{1,2\}\) represents the choice between exercising (\(a=1\)) or not exercising (\(a=2\)). The control matrices \((\alpha_{p,p'}^{a})\) are given for exercise action \(a=1\) \[\alpha_{p, p'}^{1}=\left\{ \begin{array}{ll} 1 & \text{if} p'=\max\{1, (p-1)\} \\ 0 & \text{else,} \end{array}\right.\] and for not-exercise action \(a=2\) as \[\alpha_{p, p'}^{2}=\left\{ \begin{array}{ll} 1 & \text{if p'=p} \\ 0 & \text{else} \end{array}\right.\] for all \(p, p' \in \mathbf{P}\). In the case of the swing option, the transition between \(p\) and \(p'\) occurs deterministically, since once the controller decides to exercise the right, the number of rights remaining is diminished by one. The deterministic control of the discrete component is easier to describe in terms of the matrix \((\alpha(p,a))_{p \in \mathbf{P}, a \in \mathbf{A}}\) where \(p'=\alpha(p, a) \in \mathbf{P}\) stands for the discrete component which is reached from \(p \in \mathbf{P}\) by the action \(a \in \mathbf{A}\). For the case of the swing option this matrix is \[(\alpha(p,a))_{p \in \mathbf{P},a \in \mathbf{A}} = \left[ \begin{array}{cc} 1 & 1 \\ 1 & 2\\ 2& 3 \\ \dots & \dots \\ N & N+1 \\ \end{array} \right]. \label{twodim} \tag{12}\] Having modelled the discounted commodity price process as an exponential mean-reverting process with a reversion parameter \(\kappa \in [0, 1[\), long run mean \(\mu>0\), and volatility \(\sigma>0\), we obtain the logarithm of the discounted price process as \[\label{ou} \tilde Z_{t+1} = (1-\kappa) (\tilde Z_{t}-\mu)+\mu + \sigma \epsilon_{t+1}, \quad \tilde Z_{0}=\ln(S_{0}). \tag{13}\] A further transformation of the state space is required before linear state dynamics can be achieved. If we introduce an augmentation with 1 via \[Z_{t}=\left[\begin{array}{c} 1 \\ \tilde Z_{t} \end{array}\right], \qquad t=0, \dots, T,\] then it becomes possible to represent the evolution as the linear state dynamics \[Z_{t+1}=W_{t+1}Z_{t}, \qquad t=0, \dots, T-1,\] with independent and identically distributed matrix-valued random variables \((W_{t})_{t=1}^{T}\) given by \[W_{t+1}= \left[ \begin{array}{cc} 1 & 0 \\ \kappa \mu + \sigma \epsilon_{t+1} & (1 - \kappa) \end{array}\right], \quad t=0,...,T-1.\] The reward and scrap values are given by \[\label{eq:reward2} r_t(p,(z^{(1)}, z^{(2)}),a)= (e^{z^{(2)}}-Ke^{-\rho t})^{+} \big{(} p-\alpha(p,a) \big{)} \tag{14}\] for \(t=0,\dots,T-1\) and \[\label{eq:scrap2} r_T(p,(z^{(1)}, z^{(2)}))=(e^{z^{(2)}}-Ke^{-\rho T})^{+} \big{(} p-\alpha(p,1) \big{)} \tag{15}\] respectively for all \(p \in \mathbf{P}\) and \(a \in \mathbf{A}\).
In this example, consider a swing option with \(5\) rights exercisable on \(101\) time points. As before, we begin by performing the value function approximation.
## Parameters
<- 0
rho <- 0.9
kappa <- 0
mu <- 0.5
sigma <- 0
K <- 101 ## number of time epochs
n_dec <- 5 ## number of rights
N <- N + 1 ## number of positions
n_pos <- cbind(rep(1, 101), seq(-2, 2, length = 101)) ## Grid
grid ## Control matrix
<- cbind(c(1, 1:N), 1:(N + 1))
control ## Reward subgradient representation
<- array(0, dim = c(101, 2, 2, nrow(control), n_dec - 1))
reward <- exp(grid[, 2])
slope for (tt in 1:(n_dec - 1)) {
<- exp(-rho * (tt - 1))
discount for (pp in 2:n_pos) {
<- (exp(grid[,2]) - K * discount) - slope * grid[, 2]
intercept 1, 1, pp, tt] <- intercept
reward[, 2, 1, pp, tt] <- slope
reward[,
}
}## Scrap subgradient representation
<- array(0, dim = c(101, 2, nrow(control)))
scrap <- exp(-rho * (n_dec - 1))
discount for (pp in 2:n_pos) {
<- (exp(grid[,2]) - K * discount) - slope * grid[, 2]
intercept 1, pp] <- intercept
scrap[, 2, pp] <- slope
scrap[,
}## Disturbance sampling
<- rep(1/1000, 1000)
weight <- array(0, dim = c(2, 2, 1000))
disturb 1, 1,] <- 1
disturb[2, 2,] <- 1 - kappa
disturb[<- function(a, b){
CondExpected return(1/sqrt(2 * pi) * (exp(-a^2/2)- exp(-b^2/2)))
}<- qnorm(seq(0, 1, length = 1000 + 1))
part for (i in 1:1000) {
2,1,i] <- kappa * mu + sigma *
disturb[CondExpected(part[i], part[i+1]) / (pnorm(part[i+1]) - pnorm(part[i])))
(
}## Bellman recursion
<- matrix(c(2, 1), ncol = 2)
r_index <- FastBellman(grid, reward, scrap, control, disturb, weight, r_index) bellman
After obtaining these function approximations, the following code computes the \(99\%\) confidence intervals for the value of a swing option with \(5\) remaining rights.
## Exact reward function
<- function(state, time) {
RewardFunc <- array(0, dim = c(nrow(state), 2, nrow(control)))
output <- exp(-rho * (time - 1))
discount for (i in 2:nrow(control)) {
1, i] <- pmax(exp(state[, 2]) - K * discount, 0)
output[,
}return(output)
}## Exact scrap function
<- function(state) {
ScrapFunc <- array(0, dim = c(nrow(state), nrow(control)))
output <- exp(-rho * (n_dec - 1))
discount for (i in 2:nrow(control)) {
<- pmax(exp(state[, 2]) - K * discount, 0)
output[, i]
}return(output)
}## Generate paths
set.seed(12345)
<- 500
n_path <- array(0, dim = c(2, 2, n_path, n_dec - 1))
path_disturb 1, 1,,] <- 1
path_disturb[2, 2,,] <- 1 - kappa
path_disturb[<- rnorm(n_path * (n_dec - 1) / 2)
rand1 <- as.vector(rbind(rand1, -rand1))
rand1 2, 1,,] <- kappa * mu + sigma * rand1
path_disturb[<- c(1, 0)
start <- PathDisturb(start, path_disturb)
path <- FastPathPolicy(path, grid, control, RewardFunc, bellman$expected)
policy ## Set subsimulation disturbances
<- 500
n_subsim <- array(0, dim = c(2, 2, n_subsim, n_path, n_dec - 1))
subsim 1, 1,,,] <- 1
subsim[2, 2,,,] <- 1 - kappa
subsim[<- rnorm(n_subsim * n_path * (n_dec - 1) / 2)
rand2 <- as.vector(rbind(rand2, -rand2))
rand2 2, 1,,,] <- kappa * mu + sigma * rand2
subsim[<- rep(1 / n_subsim, n_subsim)
subsim_weight ## Primal-dual
<- FastAddDual(path, subsim, subsim_weight, grid, bellman$value, ScrapFunc)
mart <- AddDualBounds(path, control, RewardFunc, ScrapFunc, mart, policy) bounds
> print(GetBounds(bounds, 0.01, 6))
1] 13.42159 13.44162 [
The tight \(99\%\) confidence interval for the price of the swing option above seems to indicate that the function approximations are of a good quality. With this in mind, we now proceed to backtesting the associated policy. We reuse the \(1,000\) sample path generated for the solution diagnostics.
<- FullTestPolicy(6, path, control, RewardFunc, ScrapFunc, policy)
test ## Histogram of cumulated rewards
hist(test$value, xlab = "Cumulated Rewards", main = "")
## Exercise times
<- rep(0, n_dec)
ex for (i in 1:n_path) {
for (t in 1:(n_dec - 1)) {
if ((test$position[i, t] != 1) && (test$action[i, t] == 1)) {
<- ex[t] + 1
ex[t]
}
}
}plot(0:(n_dec-1), ex, xlab="Time", ylab="Number of Exercises", type="b", main="")
Figure 3 show the histogram for the cumulated rewards and the number of exercises at each time point over the \(1,000\) sample paths. As mentioned before, more sophisticated statistical analysis can be performed on these path values to gain practical insights.
This section examines optimal energy storage for an energy retailer in the presence of a battery. The model is beyond the scope of this paper and this paper points any interested readers to (Hinz and Yee 2017a) for more details.
The model is represented by the above figure and the basic gist is as follows:
The code below begins by specifying the grid, the battery specifications, the action set, and the effect of the action on the battery level. The grid will be used to represent the continuous state component that drives the forward price in the electricity market. Here, the battery has capacity 100 and discretized into steps of 5 from 0. The action set represents the safety margin (the amount of energy bought in the day ahead market that exceeds the predicted net demand) that the retailer may choose. The control array below captures the transition probabilities of the expected change in the battery level based on the current battery level and the retailer’s safety margin.
## Grid
<- 501 ## number of grid points
n_grid <- -15 ## lowest state
grid_start <- 15 ## highest state
grid_end <- cbind(rep(1, n_grid), seq(grid_start, grid_end, length = n_grid))
grid ## Battery specification
<- 5 ## step between battery levels
step <- 0
pos_start <- 100
pos_end <- seq(pos_start, pos_end, by = step) ## battery levels
position <- length(position)
n_pos ## Standard deviation for the consumer demand
<- 10
std ## Actions
<- 11 ## number of safety margins
n_action <- 0
safety_start <- 50
safety_end <- seq(safety_start, safety_end, length = n_action) ## safety margins
safety ## Control array
<- array(data = 0, dim = c(n_pos, n_action, n_pos))
control for (p in 1:n_pos) {
for (a in 1:n_action) {
<- position[p] + safety[a] ## center of normal distribution
temp 1] <- pnorm(pos_start + step/2, temp, std)
control[p,a,<- 1 - pnorm(pos_end - step/2, temp, std)
control[p,a,n_pos] for (pp in 2:(n_pos-1)) {
<- pnorm(position[pp] + step/2, temp, std) -
control[p,a,pp] pnorm(position[pp] - step/2, temp, std)
}
} }
Next the code computes the expected excess and shortages of electricity faced by the retailer at each time period.
## Functions to calculate expected excess and shortage energy demand
<- function(x){ ## error function
erf return(2 * pnorm(x * sqrt(2)) - 1)
}<- function(pos, act) {
Excess <- pos_end + step/2
temp1 <- pos + act
temp2 <- std/sqrt(2*pi) * exp(-(temp1-temp2)^2/(2*std^2)) +
result - pos_end)/2 * (1 - erf(1/sqrt(2*std^2) * (temp1 - temp2)))
(temp2 return(result)
}<- function(pos, act) {
Shortage <- pos_start - step/2
temp1 <- pos + act
temp2 <- std/sqrt(2*pi) * exp(-(temp1-temp2)^2/(2*std^2)) +
result - temp2)/2 * (erf(1/sqrt(2*std^2) * (temp1 - temp2)) + 1)
(pos_start return(result)
}## Expected excess and shortage energy demand
<- matrix(data = NA, nrow = n_pos, ncol = n_action)
excess <- matrix(data = NA, nrow = n_pos, ncol = n_action)
shortage for (p in 1:n_pos) {
for (a in 1:n_action) {
<- Excess(position[p], safety[a])
excess[p,a] <- Shortage(position[p], safety[a])
shortage[p,a]
} }
The subgradient envelope of the reward and scrap functions are then specified below. Essentially, the reward is given by the cost of the forward trading and the cost of any expected shortages or excess energy that needs to be balanced. At the final time, the retailer sells all stored energy in the battery at the forward price. This is captured by the scrap reward function.
## Subgradient representation of reward functions
<- 48 * 7 ## number of decision epochs
n_dec <- 10 + cos((0:(n_dec-1)) * 2*pi/48 + 3*pi/2)
u_t <- 1 + (sin((0:(n_dec-1)) * 2*pi/48 + 3*pi/2))/2
v_t <- 20 ## price to buy from grid
buy <- 0 ## price to sell to grid
sell <- array(0, dim = c(n_grid, 2, n_action, n_pos, n_dec - 1))
reward for (p in 1:n_pos) {
for (a in 1:n_action) {
for (t in 1:(n_dec-1)) {
1,a,p,t] <- -safety[a] * u_t[t] - shortage[p, a] * buy + excess[p, a] * sell
reward[,2,a,p,t] <- -safety[a] * v_t[t]
reward[,
}
}
}<- array(0, dim = c(n_grid, 2, n_pos))
scrap for (p in 1:n_pos) {
1,p] <- position[p] * u_t[n_dec]
scrap[,2,p] <- position[p] * v_t[n_dec]
scrap[, }
Finally, the disturbance sampling and the resulting double modified value functions are computed using the subgradient approach. Here, we assume that the continuous state (\(Z_t\)) driving the forward price is an autoregression of order one. For the disturbance sample, we partition the disturbance space into 1000 components of equal probability measure and take the local average on each component.
## Parameters for AR(1) process (Z_t)
<- 0
mu <- 0.5
sigma <- 0.9
phi ## Disturbance sampling
<- 1000 ## size of sampling
n_disturb <- rep(1/n_disturb, n_disturb) ## probability weights
disturb_weight <- array(matrix(c(1, 0, 0, phi), ncol = 2, byrow = TRUE), dim = c(2, 2, n_disturb))
disturb <- function(a, b){
CondExpected return(1/sqrt(2 * pi) * (exp(-a^2/2)- exp(-b^2/2)))
}<- qnorm(seq(0, 1, length = n_disturb + 1))
part for (i in 1:n_disturb) {
2,1,i] <- mu + sigma *
disturb[CondExpected(part[i], part[i+1]) / (pnorm(part[i+1]) - pnorm(part[i])))
(
}<- matrix(c(2, 1), ncol = 2) ## randomness index
r_index ## Fast bellman recursion
<- FastBellman(grid, reward, scrap, control, disturb, disturb_weight, r_index) bellman
The following code example performs the solution diagnostics, computes the bound estimates, and then outputs the results (shown in Table 1). This gives the expected cumulative cost (and its \(99\%\) confidence interval) faced by the retailer depending on the current amount of stored energy in the battery. If interested, we point the reader to (Hinz and Yee 2017a) for more details regarding this interesting problem.
## Exact reward function
<- function(state, time) {
RewardFunc <- array(0, dim = c(nrow(state), n_action, n_pos))
output for (p in 1:n_pos) {
for (a in 1:n_action) {
<- -safety[a] * (u_t[time] + v_t[time] * state[,2]) -
output[,a,p] * buy + excess[p,a] * sell
shortage[p,a]
}
}return(output)
}## Scrap function
<- function(state) {
ScrapFunc <- array(0, dim = c(nrow(state), n_pos))
output for (p in 1:n_pos) {
<- position[p] * (u_t[n_dec] + v_t[n_dec] * state[,2])
output[,p]
}return(output)
}## Generate sample path disturbances
set.seed(12345)
<- 100
n_path <- array(matrix(c(1, 0, 0, phi), ncol = 2, byrow = TRUE),
path_disturb dim = c(2, 2, n_path, n_dec - 1))
<- rnorm(n_path * (n_dec - 1) / 2)
rand <- as.vector(rbind(rand, -rand))
rand 2,1,,] <- mu + sigma * rand
path_disturb[<- c(1, 0) ## z_0
start <- PathDisturb(start, path_disturb)
path <- FastPathPolicy(path, grid, control, RewardFunc, bellman$expected)
policy ## Specifying subsimulation disturbance matrices
<- 100
n_subsim <- rep(1/n_subsim, n_subsim)
subsim_weight <- array(matrix(c(1, 0, 0, phi), ncol = 2, byrow = TRUE),
subsim dim = c(2, 2, n_subsim, n_path, n_dec - 1))
<- rnorm(n_subsim * n_path * (n_dec - 1)/2)
rand <- as.vector(rbind(rand, -rand))
rand 2,1,,,] <- mu + sigma * rand
subsim[## Compute primal and dual values
<- FastAddDual(path, subsim, subsim_weight, grid, bellman$value, ScrapFunc)
mart <- AddDualBounds(path, control, RewardFunc, ScrapFunc, mart, policy)
bounds ## Storing the results
<- matrix(data = NA, nrow = n_pos, ncol = 3)
results <- 0.01
alpha for (p in 1:n_pos) {
1] <- sum(bellman$value[251,,p,1] * grid[251,])
results[p,2:3] <- GetBounds(bounds, alpha, p)
results[p,
}print(results)
Start Level (MWh) | Subgradient Estimate | \(99\%\) CI |
---|---|---|
0 | -1679.859 | (-1679.862, -1679.612) |
5 | -1629.859 | (-1629.862, -1629.612) |
10 | -1579.859 | (-1579.862, -1579.612) |
15 | -1529.859 | (-1529.862, -1529.612) |
20 | -1480.168 | (-1480.171, -1479.922) |
25 | -1433.574 | (-1433.577, -1433.328) |
30 | -1389.685 | (-1389.688, -1389.441) |
35 | -1348.509 | (-1348.512, -1348.266) |
40 | -1310.129 | (-1310.131, -1309.887) |
45 | -1274.601 | (-1274.604, -1274.362) |
50 | -1241.952 | (-1241.955, -1241.714) |
55 | -1212.186 | (-1212.188, -1211.949) |
60 | -1185.294 | (-1185.297, -1185.059) |
65 | -1161.261 | (-1161.263, -1161.027) |
70 | -1140.063 | (-1140.066, -1139.831) |
75 | -1121.677 | (-1121.680, -1121.446) |
80 | -1106.080 | (-1106.083, -1105.850) |
85 | -1093.250 | (-1093.253, -1093.021) |
90 | -1083.161 | (-1083.164, -1082.932) |
95 | -1075.727 | (-1075.730, -1075.499) |
100 | -1070.728 | (-1070.731, -1070.500) |
As a final example, let us consider the case of optimal resource extraction in (Hinz et al. In Press). Suppose we have a hypothetical copper mine represented in Table 2.
Output rate: 10 million pounds/year | Inventory level: 150 million pounds |
Costs of production: $0.50/pound | Opening/closing costs: $200,000 |
Maintenance costs: $500,000/year | Inflation rate: 8%/year |
Convenience yield: 1%/year | Price variance: 8%/year |
Real estate tax: 2%/year | Income tax: 50% |
Royalty tax: 0% | Interest rate: l0%/year |
The following code approximates the value function of the above mine using the following setting:
In the code below, the grid will be used to represent the price of the ore which is assumed to follow geometric Brownian motion. The control matrix will be used to reflect changes in the amount of unextracted ore based on the owner’s dynamic decision to open or close. For the disturbance sampling, the disturbance space is partitioned into 1000 components of equal probability measure and the local averages on each partition component is taken.
## Parameters
<- 501
n_grid <- as.matrix(cbind(rep(1, n_grid), seq(0, 10, length = n_grid)))
grid <- 0.1
rate <- 0.01
delta <- sqrt(0.08)
vol <- 0.25
step ## Disturbance sampling
<- 1000
n_disturb <- rep(1/n_disturb, n_disturb)
disturb_weight <- array(0, dim = c(2, 2, n_disturb))
disturb 1, 1,] <- 1
disturb[<- rep(NA, n_disturb)
quantile <- (rate - delta - 0.5 * vol^2) * step
u <- vol * sqrt(step)
sigma <- qlnorm(seq(0, 1, length = n_disturb + 1), u, sigma)
part <- function(a, b){ ##[a,b]
condExpected <- (log(a) - (u+sigma^2))/sigma
aa <- (log(b) - (u+sigma^2))/sigma
bb <- exp(u + sigma^2/2) * (pnorm(bb) - pnorm(aa))
vv return(vv)
}for (i in 1:n_disturb) {
<- condExpected(part[i], part[i+1]) /
quantile[i] plnorm(part[i+1], u, sigma) - plnorm(part[i], u, sigma))
(
}2, 2,] <- quantile
disturb[<- matrix(c(2, 2), ncol = 2)
r_index ## Control matrix, a = 1 (close), 2 (abandon), 3 (open)
## p = 1 (exhausted), levels + 1 (full open), 2 * levels + 1 (full closed)
<- 15 / step
levels <- matrix(data = 1, nrow = (2 * levels + 1), ncol = 3)
control 2:(2 * levels + 1), 1] <- (levels + 2):(2 * levels + 1)
control[2:(2 * levels + 1), 3] <- 1:levels control[
Next, we define the subgradient representation of the reward and scrap
functions. When the mine is open, the reward is essentially the income
from selling the extracted ore minus the cost of extracting that ore.
When the mine is closed, the reward is the cost of maintenance. If the
owner switches between open and closed states, there is a switching
cost. When the owner relinquishes ownership of the mine, there is no
reward or cost. The scrap reward function is set to zero in the below
code. These objects are then passed into the FastBellman
function
which computes the value function approximations using nearest
neighbors.
## Subgrad rep of rewards
<- 5 * step
H1 <- -2.5 * step
H2 <- 0.5 * step
maint switch <- 0.2
<- 0.08
inflation <- 0.02
tax_property <- 1 + 30 / step + 1
n_dec <- exp(-(rate + tax_property) * step)
discount <- nrow(control)
n_pos <- ncol(control)
n_action <- ncol(grid)
n_dim <- array(data = 0, dim = c(n_grid, n_dim, n_action, n_pos, n_dec - 1))
reward for (p in 2:n_pos) {
for (t in 1:(n_dec - 1)) {
<- exp(inflation * (t-1) * step)
pi <- discount ^ (t-1)
adjust ## Close the asset
if (p > (levels + 1)) { ## Closed
## Close
1, 1, p, t] <- adjust * -maint * pi
reward[, ## Open
1, 3, p, t] <- (H2 - switch) * pi * adjust
reward[, 2, 3, p, t] <- H1 * adjust
reward[, else if (p <= (levels + 1)) { ## Opened
} ## Close
1, 1, p, t] <- -(maint + switch) * pi * adjust
reward[, ## Open
1, 3, p, t] <- H2 * pi * adjust
reward[, 2, 3, p, t] <- H1 * adjust
reward[,
}
}
}## Subgrad rep of scrap
<- array(data = 0, dim = c(n_grid, n_dim, n_pos))
scrap ## Performing fast bellman recursion
<- FastBellman(grid, reward, scrap, control, disturb, disturb_weight, r_index) bellman
Finally, running the following code gives an illustration of the value of an open mine based on the current ore price.
plot(grid[,2], rowSums(bellman$value[,,levels,1] * grid), type = "l",
xlab = "Current Ore Price ($)", ylab = "Mine Value ($ millions)")
For more details regarding this problem, we point any interested readers to (Hinz et al. In Press).
This paper gives a demonstration of our R package, rcss in solving optimal switching problems. Our package represents the first software implementation of this class of algorithms. The problem setting discussed in this paper is broad and can be used to model a wide range of problems. Using nearest neighbor algorithms, the package rcss is able to solve some real world problems in an accurate and quick manner. It is also worth mentioning that this R package can be used to address Markov decision processes when there is a hidden Markov model as shown in (Hinz and Yee 2017b). However, this is not demonstrated here in order to preserve the brevity of this paper.
HighPerformanceComputing, NumericalMathematics
This article is converted from a Legacy LaTeX article using the texor package. The pdf version is the official version. To report a problem with the html, refer to CONTRIBUTE on the R Journal homepage.
Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Hinz & Yee, "rcss: R package for optimal convex stochastic switching", The R Journal, 2018
BibTeX citation
@article{RJ-2018-054, author = {Hinz, Juri and Yee, Jeremy}, title = {rcss: R package for optimal convex stochastic switching}, journal = {The R Journal}, year = {2018}, note = {https://rjournal.github.io/}, volume = {10}, issue = {2}, issn = {2073-4859}, pages = {38-54} }