rcss : R package for optimal convex stochastic switching

The R package rcss provides users with a tool to approximate the value functions in the Bellman recursion under certain assumptions that guarantee desirable convergence properties. This R package represents the first software implementation of these methods using matrices and nearest neighbors. This package also employs a pathwise dynamic method to gauge the quality of these value function approximations. Statistical analysis can be performed on the results to obtain other useful practical insights. This paper describes rcss version 1.6.

t = 0, . . ., T − 1 the controller takes an action a ∈ A from a finite set A of all admissible actions in order to cause the one-step transition from the mode p ∈ P to the mode p ∈ P with a probability α p,p (a), where (α p,p (a)) p,p ∈P are pre-specified stochastic matrices for all a ∈ A.
Let us now turn to the evolution of the other component (Z t ) T t=0 of the state process (X t ) T t=0 .Here, we assume that it follows an uncontrolled evolution in the space Z driven as Z t+1 = W t+1 Z t , t = 0, . . ., T − 1 by independent disturbance matrices (W t ) T t=1 .That is, the transition kernels K a t governing the evolution of our controlled Markov process (X t ) T t=0 := (P t , Z t ) T t=0 from time t to t + 1 are given for each a ∈ A by K a t v(p, z) = ∑ p ∈P α p,p (a)E(v(p , W t+1 z)), p ∈ P, z ∈ Z, t = 0, . . ., T − 1 which acts on each function v : P × Z → R where the above expectations are well-defined.
If the system is in the state (p, z), the costs of applying action a ∈ A at time t = 0, . . ., T − 1 are expressed through r t (p, z, a).Having arrived at time t = T in the state (p, z), a final scrap value r T (p, z) is collected.Thereby the reward and scrap functions r t : P × Z × A → R, r T : P × Z → R are exogenously given for t = 0, . . ., T − 1.At each time t = 0, . . ., T, the decision rule π t is given by a mapping π t : P × Z → A, prescribing at time t an action π t (p, z) ∈ A for a given state (p, z) ∈ P × Z.Note that each decision rule refers to the recent state of the system, representing a feedback control.A sequence π = (π t ) T−1 t=0 of decision rules is called a policy.For each policy π = (π t ) T−1 t=0 , the so-called policy value v π 0 (p 0 , z 0 ) is defined as the total expected reward v π 0 (p 0 , z 0 ) = E (p 0 ,z 0 ),π T ∑ t=0 r t (P t , Z t , π t (P t , Z t )) + r t (P t , Z t ) .
In this formula E (p 0 ,z 0 ),π stands for the expectation with respect to the probability distribution of (X t ) T t=0 := (P t , Z t ) T t=0 defined by Markov transitions from (P t , Z t ) to (P t+1 , Z t+1 ), which are induced by the kernels K π t (P t ,Z t ) t for t = 0, . . ., T − 1, started at the initial point (P 0 , Z 0 ) = (p 0 , z 0 ).Now we turn to the optimization goal.A policy π * = (π * t ) T−1 t=0 is called optimal if it maximizes the total expected reward over all policies π → v π 0 (p, z).To obtain such policy, one introduces for t = 0, . . ., T − 1 the so-called Bellman operator T t v(p, z) = max a∈A r t (p, z, a) for (p, z) ∈ P × Z, acting on all functions v where the stochastic kernel is defined.Consider the Bellman recursion, also referred to as backward induction: Having assumed that the reward functions are convex and globally Lipschitz and the disturbances W t+1 are integrable, there exists a unique recursive solution (v * t ) T t=0 to the Bellman recursion.These functions (v * t ) T t=0 are called value functions and they determine an optimal policy (possibly not unique) for t = T − 1, . . ., 0.
The aim of the package rcss is to approximate the true value functions (v * t ) T−1 t=0 and the corresponding optimal policies π * = (π * t ) T−1 t=0 .Given these approximations, our package also determines their distance to optimality which will allow the user to decide whether it is within some acceptable margin or whether further fine tuning is needed to obtain even more accurate results.

Numerical approach
Since the reward and scrap functions are convex in the continuous variable, the value functions are also convex and can be approximated by piecewise linear and convex functions.For this, introduce the so-called subgradient envelope S G m f of a convex function f : Z → R on a grid G m ⊂ Z with m points: (S G m f )(z) = max g∈G m ( g f )(z), for z ∈ Z where g f is the tangent of f at grid point g ∈ G m .Using the subgradient envelope operator, define the double-modified Bellman operator as where the probability weights (ν yields the so-called double-modified value functions (v m,n t ) T t=0 .Under appropriate assumptions regarding grid density and the disturbance sampling, the double-modified value functions converge uniformly to the true value functions on compact sets (see Hinz (2014)).To gauge the quality of the above approximations, we construct two random variables whose expectations bound the true value function i.e.
We refer any interested readers to Hinz and Yap (2015) for the technical details.This process exhibits a helpful self-tuning property.The closer the value function approximations resemble their true unknown counterparts, the tighter the bounds in Equation 6and the lower the standard errors of the bound estimates.The R package rcss represents these piece-wise linear functions as matrices and utilises nearest neighbor algorithms (from Mangenat and Jefferies (2018)) to reduce the computational effort.Most of the computational work is done in C++ via Rcpp (Eddelbuettel and Francois (2011)) and is parallelized using OpenMp (Dagum and Menon (1998)).The following sections give some code demonstrations.

Example: Bermuda put
Stochastic control problems of optimal switching type naturally arise in the framework of valuation of financial contracts.A simple example is given by the Bermudan put option.This option gives its owner the right but not an obligation to choose a time to exercise the option in a finite number of time points before the maturity of the option in order to receive a payment which depends on the price of the underlying asset at the exercise time.The so-called fair price of the Bermudan option is related to the solution of an optimal stopping problem (see Glasserman (2003)).Here, the risk-neutral undiscounted asset price process ( Zt ) T t=0 at time steps 0, . . ., T is modeled as a sampled geometric Brownian motion Zt+1 = t+1 Zt , t = 0, . . ., where (ε t ) T t=1 are independent identically distributed random variables following a log-normal distribution, usually modeled in terms of with independent, identically standard normally distributed random variables (N t ) T t=1 where the parameters σ > 0 and µ ∈ R + represent the stock price volatility and the continuously compounded interest rate respectively , measured on yearly scale.The tenor ∆ describes the length (measured in years) of the equidistant time interval between exercise times.The fair price of such option with strike price K, interest rate parameter ρ = µ∆, and maturity date ∆T, is given by the solution to the optimal stopping problem sup{E(e −ρτ max((K − Zτ ), 0)) : τ is {0, 1, . . ., T}-valued stopping time}.
The R Journal Vol.10/2, December 2018 ISSN 2073-4859 A transformation of the state space is required to represent the reward functions in a convenient way.Thus, we introduce an augmentation with 1 via then it becomes possible to represent the evolution as the linear state dynamics with independent and identically distributed matrix-valued random variables (W t ) T t=1 given by This switching system is defined by two positions P = {1, 2} and two actions A = {1, 2}.Here, the positions "exercised" and "not exercised" are represented by p = 1 and p = 2 respectively, and the actions "don't exercise" and "exercise" are denoted by a = 1 and a = 2 respectively.With this interpretation, the position change is given by deterministic transitions to specified states: deterministically determined by the target positions While the rewards at time t = 0, . . ., T, are defined as for all p ∈ P, a ∈ A, z ∈ R + .

Code example
As a demonstration, let us consider a Bermuda put option with strike price 40 that expires in 1 year.The put option is exercisable at 51 evenly spaced time points in the year, which includes the start and end of the year.The following code approximates the value functions in the Bellman recursion.
library(rcss) rate <-0.06 ## Interest rate step <-0.02 ## Time step between decision epochs vol <-0.2 ## Volatility of stock price process n_dec <-51 ## Number of decision epochs strike <-40 (grid, reward, scrap, control, disturb, weight, r_index) The matrix grid represents the grid points where each row represents a point.The 3-dimensional array disturb represents the sampling of the disturbances where disturb [,,i] gives the i-th sample.Here, we use local averages on a 1000 component partition of the disturbance space.The 5-dimensional array reward represents the subgradient approximation with reward[,,a,p,t] representing S G m r t (p, ., a).The object bellman is a list containing the approximations of the value functions and expected value functions for all positions and decision epochs.Please refer to the package manual for the format of the inputs and outputs.To visualise the value function of the Bermuda put option, simply run the following plot command: The following code then performs the solution diagnostics by computing the lower and upper bound estimates for the value of the option when Z 0 = 36.The above code takes the exact reward and scrap functions as inputs.The function FastPathPolicy computes the candidate optimal policy.The object bounds is a list containing the primals and duals for each sample path i and each position p at each decision time t.Again, please refer to the package manual for the format of the inputs and outputs.If the price of the underlying asset is 36, the 99% confidence interval for the option price is given by the following.

## Reward function
> print(GetBounds(bounds, 0.01, 2)) [1] 4.475802 4.480533 The package rcss also allows the user to test the prescribed policy from the Bellman recursion on any supplied set of sample paths.The resulting output can then be further studied with time series analysis or other statistical work.In the following code, we will use the previously generated 500 sample paths to backtest our policy and generate histograms.Figure 2 contains the histograms for the cumulated rewards and exercise times.These can be useful to a practitioner (e.g., anticipating when the holder will exercise their option).Let us emphasise The R Journal Vol.10/2, December 2018 ISSN 2073-4859 the usefulness of such scenario generation.Given an approximately optimal policy and backtesting, one can perform statistical analysis on the backtested values to obtain practical insights such as for risk analysis purposes.

Example: swing option
Let us now consider the swing option which is a financial contract popular in the energy business.In the simplest form, it gives the owner the right to obtain a certain commodity (such as gas or electricity) at a pre-specified price and volume at a number of exercise times which can be freely chosen by the contract owner.Let us consider a specific case of such a contract, referred to as a unit-time refraction period swing option.In this contract, there is a limit to exercise only one right at any time.Given the discounted commodity price (S t ) T t=0 , the so-called fair price of a swing option with N rights is given by the supremum, over all stopping times, τ 1 , . . ., τ N , with values in {0, . . ., T}.In order to represent this control problem as a switching system, we use the position set P = {1, . . ., N + 1} to describe the number of exercise rights remaining.That is, p ∈ P stands for the situation when there are p − 1 rights remaining to be exercised.The action set A = {1, 2} represents the choice between exercising (a = 1) or not exercising (a = 2).The control matrices (α a p,p ) are given for exercise action a = 1 and for not-exercise action a = 2 as for all p, p ∈ P. In the case of the swing option, the transition between p and p occurs deterministically, since once the controller decides to exercise the right, the number of rights remaining is diminished by one.The deterministic control of the discrete component is easier to describe in terms of the matrix (α(p, a)) p∈P,a∈A where p = α(p, a) ∈ P stands for the discrete component which is reached from p ∈ P by the action a ∈ A. For the case of the swing option this matrix is Having modelled the discounted commodity price process as an exponential mean-reverting process with a reversion parameter κ ∈ [0, 1[, long run mean µ > 0, and volatility σ > 0, we obtain the logarithm of the discounted price process as A further transformation of the state space is required before linear state dynamics can be achieved.If we introduce an augmentation with 1 via then it becomes possible to represent the evolution as the linear state dynamics with independent and identically distributed matrix-valued random variables (W t ) T t=1 given by The reward and scrap values are given by The R Journal Vol.10/2, December 2018 ISSN 2073-4859 for t = 0, . . ., T − 1 and r T (p, (z (1) , z (2) )) = (e z (2) − Ke −ρT ) + p − α(p, 1) ( 16) respectively for all p ∈ P and a ∈ A.

Example: optimal energy storage
This section examines optimal energy storage for an energy retailer in the presence of a battery.The model is beyond the scope of this paper and this paper points any interested readers to Hinz and Yee (2017a) for more details.

Consumer
Renewable Energy

Optimal Control
Battery Storage & Forward Contracts

⇐⇒ Grid
Figure 4: Energy dispatch in the presence of renewable energy and battery storage.
The model is represented by the above figure and the basic gist is as follows: • At every decision epoch, the energy retailer faces an energy imbalance determined by the net energy demand and the retailer's forward position in the electricity market.
• Any real time imbalance must be rectified by buying or selling at real time grid prices.
• When the electricity retailer has access to a battery, any imbalances can be offset (at least partially) by stored energy.This battery has fixed capacity.
• The retailer's actions is given by the size of the retailer's forward positions in the electricity market.These actions determine the expected amount of energy stored in the battery at each time.
• The goal of the retailer is to reduce the amount of balancing costs and forward trading costs.
The code below begins by specifying the grid, the battery specifications, the action set, and the effect of the action on the battery level.The grid will be used to represent the continuous state component that drives the forward price in the electricity market.Here, the battery has capacity 100 and discretized into steps of 5 from 0. The action set represents the safety margin (the amount of energy bought in the day ahead market that exceeds the predicted net demand) that the retailer may choose.The control array below captures the transition probabilities of the expected change in the battery level based on the current battery level and the retailer's safety margin.The subgradient envelope of the reward and scrap functions are then specified below.Essentially, the reward is given by the cost of the forward trading and the cost of any expected shortages or excess energy that needs to be balanced.At the final time, the retailer sells all stored energy in the battery at the forward price.This is captured by the scrap reward function.
## Parameters for AR(1) process (Z_t) mu <-0 sigma <-0.5 phi <-0.9 ## Disturbance sampling n_disturb <-1000 ## size of sampling disturb_weight <-rep(1/n_disturb, n_disturb) ## probability weights disturb <-array(matrix(c(1, 0, 0, phi )) } r_index <-matrix(c(2, 1), ncol = 2) ## randomness index ## Fast bellman recursion bellman <-FastBellman (grid, reward, scrap, control, disturb, disturb_weight, r_index) The following code example performs the solution diagnostics, computes the bound estimates, and then outputs the results (shown in Table 1).This gives the expected cumulative cost (and its 99% confidence interval) faced by the retailer depending on the current amount of stored energy in the battery.If interested, we point the reader to Hinz and Yee (2017a)  (2 * levels + 1), 3] <-1:levels Next, we define the subgradient representation of the reward and scrap functions.When the mine is open, the reward is essentially the income from selling the extracted ore minus the cost of extracting that ore.When the mine is closed, the reward is the cost of maintenance.If the owner switches between open and closed states, there is a switching cost.When the owner relinquishes ownership of the mine, there is no reward or cost.The scrap reward function is set to zero in the below code.These objects are then passed into the FastBellman function which computes the value function approximations using nearest neighbors.

Conclusion
This paper gives a demonstration of our R package, rcss in solving optimal switching problems.Our package represents the first software implementation of this class of algorithms.The problem setting The R Journal Vol.10/2, December 2018 ISSN 2073-4859 discussed in this paper is broad and can be used to model a wide range of problems.Using nearest neighbor algorithms, the package rcss is able to solve some real world problems in an accurate and quick manner.It is also worth mentioning that this R package can be used to address Markov decision processes when there is a hidden Markov model as shown in Hinz and Yee (2017b).However, this is not demonstrated here in order to preserve the brevity of this paper.

Figure 2 :
Figure 2: Distribution of cumulated rewards and exercise times.

Figure 3 :
Figure 3: Distribution of cumulated rewards and number of exercises.
for more details regarding this interesting problem.