Hierarchical Analyses Applied to Computer System Performance: Review and Call for Further Studies

Alexander Thomasian
Thomasian and Associates
Pleasantville, NY
[email protected]

Abstract

We review studies based on analytic (A) and simulation (S) methods for hierarchical performance analysis of Queueing Network - QN models. A at lower and S at higher level have been applied most. The proposed methods result in an order of magnitude reduction in performance evaluation cost with respect to simulation. The computational cost at the lower level to obtain an exact solution is reduced when the computer system can be modeled as a product-form QN amenable to a low cost solution. A Continuous Time Markov Chain - CTMC or discrete-event simulation can then be used at the higher level. We first consider a multiprogrammed transaction - txn processing system with Poisson arrivals and predeclared lock requests. Txns with lock conflicts with active txns are held in a FCFS queue and txns are activated after they acquire all requested locks. Txn throughputs obtained by the analysis of multiprogrammed computer systems serve as the transition rates in a higher level CTMC to determine txn response times. We next analyze a task system where task precedence relationships are specified by a directed acyclic graph to determine its makespan. Task service demands are specified on the devices of a computer system. The composition of tasks in execution determines their processing time and throughputs, which serve as transition rates among the states of the CTMC model. To reduce memory space requirements the CTMC is built and solved one set of task completions at a time. As a third example we consider the hierarchical simulation of a timesharing system with two user classes. Txn throughputs in processing various combinations of requests are obtained by analyzing a closed product-form QN model. A discrete event simulator is provided. More detailed QN modeling parameters, such as the distribution of the number of cycles of tasks consisting of Fork/Join (F/J) requests affect performance. This detail can be taken into account in Schwetman’s hybrid simulation method, which counts remaining number of cycles in CSM-like queueing model. We discuss an extension to hybrid simulation to adjust job service demands according to elapsed time, rather than counting cycles. A section reviewing related studies is provided. Equilibrium Point Analysis to reduce the computational cost in applying hierarchical analysis is presented in the Appendix. The discussion is applicable to performance modeling of manufacturing systems.

1 Introduction

Product-form Queueing Networks - QNs were initially restricted to single- and multi-server nodes with exponential service times and FCFS scheduling Jackson 1957 [25]. Product-form QN’s were extended to Processor-Sharing - PS and Last-Come First-Served Preemptive Resume - LCFSPR Kleinrock 1976 [29] and delay servers. The latter servers allow general service times according to the BCMP theorem Baskett et al. 1975 [5]. PS is an extreme form of round-robin CPU scheduling, where each job is allowed a quantum $q\rightarrow 0$ time units before preemption [29].

The Buzen Convolution Algorithm - BCA Buzen 1973 [8] was a first step in efficiently solving product-form closed QNs, where completed jobs are immediately replaced by a new job. BCA was applied to the Central server Model - CSM described below.

Central Server Model - CSM

CSM is a closed QN model of a multiprogrammed computer system Buzen 1973 [8], which consists of a CPU and multiple disks. Jobs alternate between CPU and disk processing until they are completed. Completed jobs are immediately replaced by another job in closed systems or after think times modeled as a delay servers in time-sharing systems.

The CPU is designated as the central station ${\cal S}_{1}$ and the $N-1$ disks as peripheral stations ${\cal S}_{n},2\leq n\leq N$ . Given the state transition probabilities ${\cal S}_{i}\xrightarrow{p_{i,j}}{\cal S}_{j}$ the following transitions are applicable to CSM.

p_{1,n},\hskip 5.69054pt2\leq n\leq N,\hskip 8.53581ptp_{n,1}=1,\hskip 5.69054% pt2\leq n\leq N

The self-transition $p_{1,1}=1-\sum_{n=2}^{N}p_{1,n}$ implies the completion of a job in a closed QN (or a job that leaves the system in an open QN). The number of visits to the CPU ( $\bar{v}_{1}$ ) is given by the geometric distribution Trivedi 2001 [66].

q_{k}=p_{1,1}(1-p_{1,1})^{k-1},k\geq 1\hskip 5.69054pt\bar{k}=v_{1}=1/p_{1,1}.

The relative number of visits to the stations is obtained by solving

\underline{v}=\underline{v}{\bf P}.

It follows $v_{n}=p_{1,n}v_{1}=p_{1,n}/p(1,1),2\leq n\leq N$ .

Given mean service time at ${\cal S}_{n}$ per visit is $\bar{x}_{n}$ , the mean loading per job is $X_{n}=v_{n}\bar{x}_{n},1\leq n\leq N$ .

Sauer and Chandy 1975 [47] consider the analysis of a CSM with a CPU with FCFS and priority (nonpreemptive and preemptive) scheduling and nonexponential service time scheduling.

CSM is a single job class QN but BCA was extended to multiple job classes by Buzen and coworkers at BGS for inclusion in the BEST/1 capacity planning tool for MVS OS, renamed z/OS [9]. This extension was also done independently at IBM by Reiser and Kobayashi in 1975 [41]. The tutorial by Williams and Bhandiwad 1976 [70] on the use of generating functions in develo** the convolution algorithm for multiple job classes was extended in Thomasian and Nadji 1981 [54].

Mean Value Analysis - MVA method developed by Reiser and Lavenberg 1980 [42, 43] has the same computational cost as BCA, but higher memory requirements. It has numerical problems in dealing with state-dependent servers whose service rate varies with the number of jobs, such as multiserver queues. MVA on the other hand has led to several low-cost, iterative solution methods, such as Bard-Schweitzer, see e.g., Lazowska et al. 1984 [34], and Linearizer Chandy and Neuse 1982 [12] Efficient approximate computational methods were later developed by extending MVA to non-product-form QNs, such as FCFS scheduling with general service times Lazowska et al. 1984 [34].

Analysis of open (resp. closed) QNs requires the arrival rate (resp. degree of concurrency or MultiProgramming Level - MPL) and service demands or loadings. which are the product of the mean number of job visits to the devices of a computer and the mean service time per visit [8].

IBM’s Software Measurement Facility - SMF measures the mean time computer devices (CPU and disk) are busy serving tasks. The service demands differ according to job class, e.g., batch versus online transactions - txns. Given the MultiProgramming Level - MPL and service demands BEST/1 Buzen et al. 1978 [9, 10] and MAP Lazowska et al. 1984 [34] capacity planning tools use QN analysis to obtained performance metrics of interest such as job throughput, device utilizations, response times, and queuelengths [33, 34, 6, 31].

When tasks are to be processed at heterogeneous computer systems, e.g., with different CPU speeds or different storage systems: Hard Disk Drives - HDDs versus Solid State Disks - SSDs. task processing requirements should be specified in device independent manner, e.g., program pathlenghts which can be converted to CPU time based on its MIPS.

Processing Time of Fork/Join Requests

As an example of hierarchical modeling consider the time it takes to execute the tasks of a single k-way F/J request on a multiprogrammed computer system. The approximate hierarchical analysis method based on decomposition Courtois 1975 [16] (see e.g., Section 9.3.1 in Lazowska et al. 1984 [34]) uses a Flow-Equivalent Service Center - FESC, whose throughput characteristic is obtained by analyzing the underlying QN model.

The $K$ tasks are assumed to have identical service demands that can be activated concurrently by a computer system with maximum MPL $M_{max}\geq K$ , Task completion rates can be determined at low cost yielding $T(k),1\leq k\leq K$ . In hierarchical modeling task completions are assumed to be exponentially distributed and the completing time of $K$ tasks van be determined by a death process [28].

S_{k}\stackrel{{\scriptstyle T(k)}}{{\longrightarrow}}S_{k-1},K\geq k\geq 1.

The completion time of F/J requests is:

R_{F/J}(K)=\sum_{k=1}^{K}R(k)\mbox{ where }R(k)=[T(k)]^{-1}.

1.1 Degree of Concurrency Constraints

Product form QN models of computer systems with Poisson arrivals with rate $\lambda$ is not amenable to a direct solution when the degree of concurrency or MPL, say $M_{max}$ , is taken into account, because the number of jobs at the QN may exceed $M_{max}$ .

Assuming that the throughout characteristic $T(M)$ is a nondecreasing function of $M$ the maximum system throughput $\lambda_{max}<T(M_{max})$ , since otherwise the system will become saturated, i.e., a queue of infinite length will be formed Kleinrock 1975 [28].

The MPL constraint is taken into account in applying a birth-death queueing model with arrival rate $\lambda$ and service rate $\mu_{k}$ by “flattening” the throughput characteristic beyond $T(M_{max})$ for the FESC as given by Eq. (1):

\displaystyle\mu_{k}=\begin{cases}T(k),\hskip 5.69054pt1\leq k\leq M_{max}\\ T(M_{max}),\hskip 19.91692ptk\geq M_{max}\end{cases}

(1)

The mean number of tasks in the system and the memory queue (MemQ) are obtained as follows.

\displaystyle\bar{N}=\sum_{k\geq 1}kp_{k},\hskip 5.69054pt\bar{N}_{MemQ}=\sum_% {k\geq M_{max}}(k-M_{max})p_{k}.

(2)

The mean response time in the systems and mean waiting time in the queue are obtained by applying Little’s result by dividing by the task arrival rate $\lambda$ [28]:

R_{system}=N_{systems}/\lambda\mbox{ and }W_{memQ}=\bar{N}_{MemQ}/\lambda.

State probabilities of the birth-death process with arrival rate $\lambda$ and processing rate $\mu_{k},k\geq 1$ are obtained by setting $S=p_{0}=1$ , $\bar{N}=0$

	$\displaystyle p_{k}=(\lambda/\mu_{k})p_{k-1},\hskip 5.69054ptS+=p_{k},\hskip 5% .69054pt\bar{N}+=kp_{k}$		(3)
	$\displaystyle k\geq 1\mbox{ till }p_{k}\leq\epsilon,\bar{N}=\bar{N}/S,R=\bar{N% }/\lambda.$

Example I: Transactions with predeclared lock requests: Txns with predeclared lock requests arriving according to a Poisson process with frequency $f_{j},1\leq j\leq J$ can execute concurrently if they have no conflicts Thomasian 1985 [58]. Txn response time is the sum of the queueing delay in a FCFS queue awaiting the acquisition of all locks at which point the txn is activated and task execution time at the computer system. It is assumed that the maximum MPL is not a constraint.

An approximate solution is also presented by analyzing the QN for various degrees of concurrency and using the resulting throughout as if there is a single job class.

Example II: Tasks with precedence relationships: Task precedence relationships are specified by a directed acyclic graph - dag which is referred to as a task system in Coffman and Denning 1973 [15]. Task processing times are specified by their execution time on the devices of a single computer. An optimal scheduling algorithm for two processors is presented in this book, while scheduling with more processors is explored in Adam et al. 1973 [2]. The results were compared against the bound by Fernandez and Bussell 1973 [20].

A Continuous Time Markov Chain - CTMC at the higher level model and a product-form QN model for task execution on a multiprogrammed computer system is considered in Thomasian and Bay 1986 [59]

Example III: Timesharing system: Simulation is a flexible approach for the higher level and its use is illustrated in the context of a timesharing system with two job classes Sauer 1981 [49]. At the lower modeling level task throughputs are obtained by analyzing product-form closed QN model. Section 4 specifies a discrete event simulation for higher level analysis of a timesharing system, whose tasks are processed in a multiprogrammed computer system.

Example IV: Fork/Join Analysis: A detailed QN model is required in evaluating the performance of a Fork/Join - F/J systems Thomasian 2014 [64]. This is because the completion time of several tasks started concurrently is affected by the distribution of the number of processing cycles. Detailed modeling can be better handled by hybrid simulation Schwetman 1978 [51].

The paper is organized as follows. Section 2 discusses a hierarchical model for analyzing a txn processing system with predeclared lock requests. Section 3 determines the makespan of a task system, whose tasks execute at a computer system. Section 4 describes a simulation model to estimate the mean response times of timesharing requests. The effect of transition probabilities on completion times is discussed in section 5. Section 6 describes the hybrid simulation method and propose extensions to it which were earlier discussed on [56], which requires further investigation and validation. Related work is presented in Section 7. Conclusions and further work are provided in Section 8. Equilibrium Point Analysis - EPA applied to reducing the cost of solving a txn processing systems is presented in the Appendix.

2 Transaction Processing with Predeclared Lock Requests

The effect of granularity of locking on txn response time is investigated in Thomasian 1985 [58], Txns are activated after acquiring all locks, while txns with lock conflicts with currently active txns are held in a queue until requested locks are released by completed txns. Txns are processed in FCFS order.

Txn response time is the sum of queueing due to acquire all locks and txn execution time at the computer system, which is represented by a product-form QN model. We consider $J=5$ txn classes and a maximum degree of concurrency $K=2$ , since only txns in class ${\cal C}_{1}$ and ${\cal C}_{2}$ can be processed concurrently. Txns in ${\cal C}_{j}$ are a fraction $f_{j}$ in Poisson arrival stream.

Using the hierarchical decomposition method these throughputs are then incorporated into the higher-level model which is a 2-dimensional CTMC. One dimension is the composition of timesharing requests in execution

{\cal S}_{j},1\leq j\leq J\mbox{ and }{\cal S}_{1,2}.

and another dimension the number of requests in the system.

Wallace and Rosenberg’s 1966 Recursive Queueing Analyzer - RQA [68] was used to succinctly specify the sparse regularly structured state transition matrix ( $Q$ ). The number of states in the second dimension is set to be sufficiently large so that the fraction of txns lost due to the finite capacity is negligibly small for the given arrival rate. An iterative method of the form

\underline{\pi}(k+1)=\underline{\pi}(k)(c{\bf Q}+{\bf I}),

where $c$ is a constant and ${\bf I}$ is the unity matrix Kleinrock 1975 [28], Bolch et al. 2006 [6].

Given the state probabilities we can obtain the mean number of txns in different classes $\overline{N}_{j},1\leq j\leq J$ . The mean txn response times follow as $R_{j}=\overline{N}_{j}/\lambda_{j}$ .

The analysis can be extended to FCFS with skip** for static locking as in the analysis of static locking in Thomasian and Ryu 1983 [55] The latter analysis postulates a fine granularity of locking and that lock requests are uniformly distributed over database granules.

Aggregating Multiple Transaction Classes

The resulting system can be specified as:

T(k),1\leq k\leq K_{max}\mbox{ and }T(k)=T(K_{max}),k\geq K_{max}

and then incorporating the throughout in a higher birth-death model.

Txn throughput with a single class is a weighted sum according to txn frequencies. Note that txns in the same class are not compatible with each other and only txns in ${\cal C}_{1}$ and ${\cal C}_{2}$ can be executed together. With at most two txns in execution and an infinite backlog of txns processed in FCFS order we have the transition rate matrix $T$ among the execution states:

T_{j,j}=-\sum_{i\neq j}T_{j,i},j=1,6.

Solving the set of linear equations yields the state probabilities.

\underline{p}{\bf T(2)}=0\mbox{ and }\sum_{\forall(i)}p_{i}=1.

The matrix for a closed task system with two tasks and compatible C ${}_{1}$ and C ${}_{2}$ classes is as follows.

\displaystyle\tiny\begin{pmatrix}T_{1,1}&0&\frac{f_{3}}{1-f_{2}}T_{1}&\frac{f_% {4}}{1-f_{2}}T_{1}&\frac{f_{5}}{1-f_{2}}T_{1}&\frac{f_{1}f_{2}}{1-f_{2}}T_{1}% \\ 0&T_{2,2}&\frac{f_{3}}{1-f_{1}}T_{2}&\frac{f_{4}}{1-f_{1}}T_{2}&\frac{f_{5}}{1% -f_{1}}T_{2}&\frac{f_{1}f_{2}}{1-f_{1}}T_{2}\\ f_{1}(1-f_{2})T_{3}&f_{2}(1-f_{1})T_{3}&T_{3.3}&f_{4}T_{3}&f_{5}T_{3}&2f_{1}f_% {4}T_{3}\\ f_{1}(1-f_{2})T_{4}&f_{2}(1-f_{1})T_{4}&f_{3}T_{4}&T_{4,4}&f_{5}T_{4}&2f_{1}f_% {2}T_{4}\\ f_{1}(1-f_{2})T_{5}&f_{2}(1-f_{1})T_{5}&f_{3}T_{5}&f_{4}T_{5}&T_{5,5}&2f_{1}f_% {2}T_{5}\\ (1-f_{2}){T^{\prime}}_{2}&(1-f_{1}){T^{\prime}}_{1}&0&0&0&T_{6,6}\end{pmatrix}

(10)

After solving the set of linear equations to obtain the state probabilities $E_{i},\forall{i}$ we can obtain the throughputs for class ${\cal C}_{j}$ with $k$ txns in execution.

\displaystyle T_{j}(k)=\sum_{\forall{i}}P[E_{i}]T_{j}(E_{i})

(11)

The overall txn throughput is

\displaystyle T(k)=\sum_{\forall{j}}T_{j}(k)

(12)

The mean number of txns in C ${}_{j}$ with $j$ txns in the systems is:

\displaystyle\bar{N}_{j}(k)=\sum_{\forall{i}}P[E_{i}]|E_{i}|_{j}

(13)

where $|E_{i}|_{j}$ is the number of txns in class ${\cal C}_{j}$ executing in that state (zero or one).

Txn throughputs for the five classes with $k=2$ txns are:

T_{j}(2)=P({\cal S}_{1})T_{1}({\cal S}_{1})+P({\cal S}_{1,2}){T}_{j}({\cal S}_% {1,2}),\hskip 5.69054ptj=1,2.

T_{j}(2)=P({S_{j}})T_{j}({\cal S}_{j}),\hskip 5.69054ptj=3,5.

For execution states with $k$ txns in the system are:

T_{i}(k)/T_{j}(k)=f_{i}/f_{j}\mbox{ and }T_{i}(k)=f_{i}T(k),

where $T(k)=\sum_{\forall{i}}T_{i}(k)$ . For degree of concurrency $k=2$ the mean number of txns in different classes is:

\overline{N}_{1}(2)=P[{\cal S}_{1}](1+\frac{f_{1}}{1-f_{2}})+P({\cal S}_{1,2})% +f_{1}\sum_{j=3}^{5}P({\cal S}_{j})

\overline{N}_{2}(2)=P[{\cal S}_{2}](1+\frac{f_{2}}{1-f_{1}})+P({\cal S}_{1,2})% +f_{2}\sum_{j=3}^{5}P({\cal S}_{j})

	$\displaystyle\overline{N}_{j}(2)=P[{\cal S_{j}}](1+f_{j})+P[{\cal S}_{1}][f_{1% }/(1-f_{2})]$
	$\displaystyle P[{\cal S}_{2}][f_{j}/(1-f_{1})],\hskip 5.69054pt3\leq j\leq 5.$

3 Makespan of Task System with Multiprogrammed Tasks

Given a task system is specified by a dag with precedence relationships among tasks Tasks in Coffman and Denning 1973 [15] have fixed execution times. We consider a task system whose tasks are specified by their service demands at the devices of a multiprogrammed computer system. In Thomasian and Bay 1986 we develop a hierarchical analysis to determine the makespan, the completion time of the task system.

We consider a simple task system ${\cal\bf T}$ with six tasks. Two complementary tasks are added: $\tau_{0}$ , which precedes all tasks and $\tau_{\infty}$ which succeeds tasks with no successors otherwise. These two tasks are processed instantaneously.

{\cal\bf T}=\{\tau_{0},\tau_{1},\tau_{2},\tau_{3},\tau_{4},\tau_{5},\tau_{6},% \tau_{\infty}\}

with the following precedence relationships:
$\tau_{0}\prec\{\tau_{1},\tau_{2}\}\prec\tau_{3}$ ,
$\tau_{0}\prec\tau_{4}\prec\{\tau_{5},\tau_{6}\}$ ,
$\{\tau_{3},\tau_{5},\tau_{6}\}\prec\tau_{\infty}$ .

The task system makespan is $C=\mbox{Init}_{\infty}$ . We are also interested in the initiation ( $Init_{i}$ ), completion ( $Comp_{i}$ ) and execution $Exec_{i}=Comp_{i}-Init_{i}$ time of the $i^{th}$ task. The execution time of a task is the time the task spends in the system.

The task system ${\cal\bf T}$ leads to the CTMC for task execution states given in Table 1. Task combinations executed together known as tasksets are given in a list. An implicit instant transition from state $\{\tau_{\infty}\}$ to state $\{\tau_{0}\}$ can be postulated so the execution of the task system is repeated.

$L_{0}$	$\{\tau_{0}\}$
$L_{1}$	$\{\tau_{1},\tau_{2},\tau_{4}\}$
$L_{2}$	$\{\tau_{1},\tau_{4}\}$ ,	$\{\tau_{2},\tau_{4}\}$	$\{\tau_{1},\tau_{2},\tau_{5},\tau_{6}\}$
$L_{3}$	$\{\tau_{1},\tau_{2},\tau_{5}\}$	$\{\tau_{1},\tau_{2},\tau_{6}\}$	$\{\tau_{1},\tau_{5},\tau_{6}\}$	$\{\tau_{2},\tau_{5},\tau_{6}\}$	$\{\tau_{3},\tau_{4}\}$
$L_{4}$	$\{\tau_{1},\tau_{2}\}$	$\{\tau_{1},\tau_{5}\}$	$\{\tau_{1},\tau_{6}\}$	$\{\tau_{2},\tau_{5}\}$	$\{\tau_{2},\tau_{6}\}$	$\{\tau_{3},\tau_{5},\tau_{6}\}$	$\{\tau_{4}\}$
$L_{5}$	$\{\tau_{1}\}$	$\{\tau_{2}\}$	$\{\tau_{3},T_{5}\}$	$\{\tau_{3},\tau_{6}\}$	$\{\tau_{5},\tau_{6}\}$
$L_{6}$	$\{\tau_{3}\}$	$\{\tau_{5}\}$	$\{\tau_{6}\}$
$L_{7}$	{ $\tau_{\infty}\}$

Table 1: CTMC has been built taking into account precedence relationships top to bottom. Given that one task completes per level the number of levels equals the number of tasks. Tasksets at lower levels are either a subset of tasksets at the higher level or additional tasks being activated when precedence relationships are satisfied.

The completion of $\tau_{4}$ leads to the following transition

S=\{\tau_{1},\tau_{2},\tau_{4}\}\rightarrow\{\tau_{1},\tau_{2},\tau_{5},\tau_{% 6}\}

The state holding time is the inverse of the sum of task throughputs, which determine the rates of an exponential distribution according to the decomposition principle Courtois 1975 [16], which is discussed informally in Lazowska et al. 1984 [34]. The notation used in this section is as follows:

$I$ : Number of tasks including two dummy tasks, which complete instantaneously.
$L$ : Number of CTMC levels with $L=I$ , since one task completed per level.
$S$ : State representation.
$S_{\ell}$ : Set of states at level $\ell$ .
$|S|=\{\tau_{i},\tau_{j},\ldots\}$ : Set of tasks in execution at state ${\cal S}$ .
${S}_{i}$ : Set of states at which task $\tau_{i}$ is executed.
${S}^{+}$ : Immediate successors to state $S$ .
${S^{-}}$ : Immediate predecessors to state $S$ .
$P(|S|)$ : Steady state probability of being in state $S$ .
$T_{i}(S)$ : Completion rate or throughput of task $\tau_{i}\in|S|$ .
$T(S)=\sum_{\tau_{i}\in(|S|}T_{i}(S)$ : Sum of completion rates at ${S}$ .
$H(S)=[T(S)]^{-1}$ : Mean holding time in $S$ .
$b_{R}(S)$ : Branching probability from $S$ to $R$ .
$p(R)=\sum_{S\in R_{-}}p(S)b_{R}(S)$ /* path probability to $R$ */
/* The Mean delay to complete state $R$ is: */
$D(R)$ = $\sum_{S\in R_{-}}p(S)b_{R}(S)P(s)+H(R)$
$C$ : = Completion time of all tasks.

Path probability to reach state $R$ .

\displaystyle p(S)=\sum_{R\in S^{-}}p(R)b_{S}(R).

(14)

The mean delay to the completion of state $R$ weighed by the path probabilities.

\displaystyle D(S)=H(S)+\sum_{RinS^{+}}p(R)b_{S}(R)D(R).

(15)

Initiation time of $\tau_{i}$ is a weighed sum of all delays for its activation in state $R$ .

\displaystyle\mbox{Init}_{i}=\sum_{(S\in R^{-})\land(\tau_{i}\notin S)\land(% \tau_{i}\in R)}p(S)D(S).

(16)

The completion time of $\tau_{i}$ which completes at $S$ leads to $R$ which does not include $\tau_{i}$

\displaystyle Comp_{i}=\sum_{(S\in R^{-})\land(\tau_{i}\in S)\land(\tau_{i}% \notin R)}p(S)D(S).

(17)

Unnormalized state probabilities are computed level by level by setting the probability of the initial state $S=\{\tau_{0}\}$ to one.

\displaystyle P(R)=\sum_{S\in R^{-}}T(S)b_{R}(S)P(S).

(18)

The state probabilities are normalized by

\mbox{NormConstant}=\sum_{\forall S}P(S).

The execution time of $\tau_{i}$ is $Exec_{i}=Comp_{i}-Init_{i}$ . Alternatively, completion time $C$ times the sum of state probabilities of states in which the task was executing.

\displaystyle E_{i}=C\sum_{S\in{\cal S_{i}}}P(S).

(19)

Each entry in the CTMC is represented as:

\left[P(S);p({S};D(S);T_{i}(S),\tau_{i}\in|S|;T(S),H(S),b_{R}(S),R\in S^{+}\right]

Procedure for Performance Analysis of Task System

Input: Set of $I$ tasks, precedence relationships and service demands at the $N$ devices of a multiprogrammed computer system:

X_{i,n},1\leq i\leq I,1\leq n\leq N.

Given $S=\{tau_{0}\}$ set $H(S)=0$ , $P(S)=1$ , $p(S)=1$

for levels $\ell=0$ to $L+1$ do

for states $S\in{\cal S}_{\ell}$ do

Given that the completion rate of $\tau_{i}\in|S|$ is $T_{i}(S)$

T(S)=\sum_{\tau_{i}\in|{\cal S}}T_{i}(S)\mbox{ hence }H(S)=1/T(S)

Determine all successor states to ${\cal S}_{\ell}$ and merge with the set of previously created states at $L_{\ell+1}$ .

{\cal R}_{\ell+1}={\cal R}_{\ell+1}\cup R

Obtain probability of reaching state $R$ via $S$ : $p(R)=p(S)\times b_{R}(S)$

Completion of $\tau_{i}$ at $S$ leads to $R$ with probability: $b_{R}(S)=T_{i}(S)/T(S)$ .

Path probability: $p(R)=p(R)+\sum_{S\in R^{-}}p(S)b_{R}(S)$

D(R)=D(R)+\sum_{R\in S^{-}}p_{R}(S)\times D(S)

Add to $\mbox{Init}_{i}$ tasks $\tau_{i}$ activated at this level using Eq. 16.

Update the completion time of a task $\tau_{I}$ at this level using Eq. 17.

Obtain the steady state probability of $P(R)$ using Eq. (18).

Update normalization constant for state probabilities $\mbox{Norm\_Constant}=\mbox{Norm\_Constant}+P(R)$

end /* all tasks R in level $\ell$ */

end /* level $L_{\ell}$ */

Normalize state probabilities
$P(S)=P(S)/\mbox{Norm\_Constant }\hskip 5.69054pt\forall{S}$ .

Given the solution of the computer system model state probabilities can be used to determine the mean device utilization when executing across all states.

Two numerical examples validated by simulation are provided in [59].

4 Simulation at Higher and Analysis at Lower Level

Hierarchical simulation is a more flexible method than building and solving a higher level CTMC for the analysis of task system performance. It is computationally more expensive, since using the batch method the simulation has to be repeated to obtain confidence intervals at an acceptably high level Welch 1983 [69].

The method is specified in the context of performance analysis of a timesharing system with two sets of users generating requests Sauer [49]. The analysis is repeated in Thomasian and Gargeya 1984 [57].

The first (resp. second) set of users are at $L_{1}$ (resp. $L_{2}$ ) terminals, which generate small class $C_{1}$ and large class $C_{2}$ requests. The think times at the terminals are exponentially distributed with means $Z_{1}$ and $Z_{2}$ . The maximum MPL for processing $C_{1}$ and $C_{2}$ job classes are $M_{1}$ and $M_{2}$ . The parameter settings used in experiments are based on Table 2 in Sauer 1981 [49], which is repeated in Table 2.

Case	1	2	3	4	5	6	7	8	9
$L_{1}$	20	20	20	30	30	30	40	40	40
$L_{2}$	2	2	2	3	3	3	4	4	4
$K_{1}$	4	3	1	7	5	2	14	9	5
$K_{2}$	2	1	1	2	1	1	4	3	1

Table 2: The nine cases considered in this study .

Requests are processed at the CPU with the PS discipline and access four FCFS disks with exponential service times uniform probabilities. Think times and device service times are given in milliseconds as:

Z_{1}=5,000,X_{CPU}^{1}=100,X^{1}_{Disk_{i}},1\leq i\leq 4=87.5

Z_{2}=100,000,X_{CPU}^{2}=2,000,X^{2}_{Disk_{i}},1\leq i\leq 4=175

The closed QN, including the terminals, would be product-form if $M_{j}\geq L_{j},j=1,2$ , so that there would be no blocking due to MPL constraints. A hierarchical solution method is required since realistically $M_{j}<L_{j}$ , which implies an extra delay in the memory queue.

The computer system is substituted with an FESC with the throughput characteristic for two classes:

T_{j}(k_{1},k_{2}),\hskip 14.22636pt0\leq k_{j}\leq K_{j},\hskip 14.22636ptj=1,2

There are $\prod_{j=1}^{2}(N_{j}+1)$ states in the two-dimensional CTMC, since $0\leq k_{j}\leq N_{j},j=1,2$ . At state $(k_{1},k_{2})$ class $C_{j}$ job requests are issued at rate

\Lambda_{j}=(L_{j}-k_{j})/Z_{j},j-1,2

and the throughputs obtained by solving the QN of the computer system are:

T_{j}(k_{1},k_{2}),\hskip 5.69054pt0\leq k_{j}\leq K_{j},\hskip 5.69054ptj=1,2.

The CTMC can also be build for an infinite source model, i.e., by specifying the fraction of classes in the Poisson arrival stream. The number of states in the CTMC should be set to be sufficiently large, so that probability of blocking due to finite capacity for the given arrival rate is negligibly small. The set of linear equations to obtain CTMC’s steady state probabilities can be solved using the Gauss-Seidel iterative method Stewart 2009 [52].

Procedure: Hierarchical Simulation of Timesharing System

1: Input simulation parameters.

Number of classes of timesharing users: $J=2$ .
Number of users or terminals in each class: $L_{j},j=1,J$ .
Maximum MPL in each class $K_{j},j=1,J$ .
Number of active terminals/users in two classes $N_{j}=L_{j},j=1,J$ .
Think times and job service demands at the CPU and four disks.
Percent Confidence Interval - CI (say $\pm 5\%$ )of job response times desired about the mean at a given Confidence Level - CL (say 95%) using the batch means method [69].
Settings such as: $NCompTarget_{j}=10,000,j=1,2$ $NumBatches=10$ need be adjusted to meet CL and CI target.

2: Solve lower level model.

Obtain throughputs by solving closed QN for all request compositions in two classes.

T_{j}(k_{1},k_{2})\hskip 5.69054pt0\leq k_{j}\leq K_{j}\hskip 5.69054ptj=1,2

3: Higher Level Discrete-Event Simulation.

(a) Initialization.

$Clock=0$ . /* Simulation Clock */
$BatchCtr=0$ ; /* Batch means method */
for class $C_{j}j=1,2$ do
$count_{j}=0$ /* #arrivals per class */
$N_{j}=L_{j}$ /* initialize # of thinking users */
Sample $IntArvlTime$ from Exp( $N_{j}/Z_{j}$ )
/* Set arrival time */
$ArvlTime_{j}=Clock+IntArvlTime$
/* Set departure times for all requests */
$DepartTime_{j}^{k}=\infty,1\leq k\leq K_{j}$
/* Sum class $C_{j}$ response times */
$SumResp_{j}=0$
/* # of completed jobs in $C_{j}$ */
$NComp_{j}=0$
end do /* do j */

(b) Scheduling the next event.

1.

Determine most imminent event from ( $N_{1}+N_{2}+J\times K$ ) possibilities):
$Next=\mbox{min}[(ArvlTime_{j},DepartTime_{j,k}$ ,
$1\leq k\leq K_{j},\hskip 14.22636ptj=1,2.$

If departure record request class and request’s id: $jn=j$ and $kn=k$ ,
otherwise if arrival set arriving job class $jn$ based on the arrival stream.
2.

Advance simulation time $Clock=Next$ .
3.

If event an arrival goto (c),
else goto to (d).

1.

$Count_{jn}=Count_{jn}+1$ ;
2.

/* one request generated at a time */ Sample from C ${}_{jn}$ Exp( $N_{jn}/Z_{jn}$ ).
3.

/* next arrival time */ $ArvlTime_{jn}=Clock+IntArvlT_{jn}$ .
4.

If $k_{jn}\geq K_{jn}$ enqueue
$Queue[Count_{jn}]=ArvlTime_{jn}$ ;
else goto (e)

(d) Task completed at $Dept\_Time^{kn}_{jn}$ .

•

$NComp_{jn}++$ /* increment completions */
•

$SumResp_{jn}=+(Clock-Arrival_{kn,jn})$
•

if $[(NComp_{1}\geq NCompTarget_{1})\land$
$(NComp_{2}\geq NCompTarget_{2})]$ goto (g), /* run complete */
•

$k_{jn}=k_{jn}-1$ /* degree of concurrency
•

$N_{jn}=N_{jn}+1$ . /* active users */
•

Obtain IntArvlT from Exp( $N_{jn}/Z_{jn}$ ).
•

$ArvlTime_{jn}=Clock+IntArvlT$ .
•

If $Queue[jn]$ is nonempty goto (e), else goto (b).

(e) Activate an arriving or waiting task.
Activate $C_{jn}$ task, $k_{jn}++$
$Arrival_{kn,jn}=TaskArvl[Count_{jn}]$
/* Set arrival time */
Sample execution time $E_{jn}^{kn}$ from Exp( $T_{jn}(k_{1},k_{2})$ ).
$DepartTime_{jn}^{kn}=Clock+E_{jn,kn}$
goto (b)

(g) BatchCtr++;
$Resp_{j}^{BatchCtr}=\frac{SumResp_{j}}{NComp_{j}},j=1,2$
Exit if $BatchCtr++\geq NumBatches$ ,
else goto (a)

Batch means method to obtain mean response times and its CI at given CL was utilized.

5 Effect of Transition Probabilities on Task Completion Times

We elaborates on Section 7.2 in Thomasian 2014 [64] that the distribution of the number of task cycles between CPU and disk processing affect the completion time of the task system with parallelism.

In Section 3.3.3 in Kobayashi 1978 [30] it is stated that routing in QNs need not be governed by a homogeneous first-order Markov chain and it is the mean number of visits to QN nodes that determines the usual performance metrics, but this would affect task’s sojourn time distribution.

The mean completion time of a 2-way F/J task system in a closed QN model is sensitive to the distribution of the number of cycles. A cyclic server model with two devices, a processor and a disk is postulated. The disk is accessed following CPU processing and according to the cyclic server model after disk processing tasks may require additional CPU time or leave the system. Note difference with CSM where jobs complete their processing at the CPU.

Consider the concurrent processing of two tasks whose number of cycles follows a geometric distribution:

p_{n}=(1-p)p^{n-1},n\geq 1\mbox{ with a mean }\bar{n}=1/(1-p).

The number of jobs completed in a time interval approaches a Poisson process when $p\rightarrow 1$ , signifying a large number of cycles. Poisson inter-departure times imply exponentially distributed residence times. It can be shown that the geometrically distributed sum of exponential random variables is also exponentially distributed. The argument based on thinning point processes in Salza and Lavenberg 1981 [46] does not require the per cycle residence times to be exponentially distributed or even i.i.d.

Consider the processing of two possibly heterogeneous tasks: $\tau_{1}$ and $\tau_{2}$ , which are processed concurrently at a computer system at state $S_{1,2}(\tau_{1},\tau_{2})$ . Given that $T_{i}(S_{1,2}$ is the completion rate of $\tau_{i},i=1,2$ then the mean holding time for state $S_{1,2}$ is:

H(S_{1,2})=[T_{1}(1,2)+T_{2}(1,2)]^{-1}.

The completion of $\tau_{i}$ in $S_{1,2}$ leads to $S_{j}=\{\tau_{j\neq i}\}$ and the completion of $\tau_{j}$ leads to the completion of 2-way F/J task system.

Assuming that the transition rates are exponentially distributed, the probability that $\tau_{i}$ completes first is:

p_{i}=T_{i}(S_{1,2})H[S_{1,2}],i=1,2

The time to complete the two tasks is then:

C_{geo}=H(S_{1,2})+p_{1}H(S_{2})+p_{2}H(S_{1})).

Consider two tasks whose number of cycles is given as follows: Case 1: geometrically distributed with mean $\bar{n}=5$ , Case 2: fixed with $n=5$ . The service demands per cycle is $\bar{x}_{c}=\bar{x}_{d}=\bar{x}=1$ at the CPU and the disk in both cases.

Given balanced service demands task residence times for $N=K=2$ are:

r(K)=(N+K-1)\bar{x}=3

according to Lazowska et al. [34] and Thomasian 2023 [65]. Due to the memoryless property of the geometric distribution the completion time for the two tasks in the two cases is:

C_{geo}=\bar{n}\times r(2)+\bar{n}\times r(1)=5\times 3+5\times 2=25.

C_{fixed}=n\times 3=15.

We next consider the parallel processing of two tasks $\tau_{1}$ and $\tau_{2}$ , with a different number of geometric cycles which means $\bar{n}_{1}=5$ and $\bar{n}_{2}=10$ . Setting the time per visit at the processor $\bar{x}_{c}=1$ and the disk $\bar{x}_{d}=1$ yields the service demands $X_{C}^{1}=X_{D}^{1}=5$ for $\tau_{1}$ and $X_{C}^{2}=X_{D}^{2}=10$ for $\tau_{2}$ . Task throughputs when processed together can be obtained by solving the corresponding QN model.

T_{1}(S_{1,2})=1/15\mbox{ and }T_{2}(S_{1,2})=1/30.

The mean holding time is then

H(S_{1,2})=[T_{1}(S_{1,2})+T_{2}(S_{1,2}]^{-1}=10.

The probability that $\tau_{1}$ or $\tau_{2}$ finishes first is given as:

\begin{cases}p_{1}=T_{1}(S_{1,2})H(S_{1,2})=(1/15)10=2/3\\ p_{2}=T_{2}(S_{1,2})H(S_{1,2})=(1/30)10=1/3.\end{cases}

The mean residence time of these tasks is:

R(S_{1})=X_{c}^{1}+X_{d}^{1}=10\mbox{ and }R(S_{2})=X_{c}^{2}+X_{d}^{2}=20.

It follows that the completion time for the geometric distribution is:

C_{geo}=10+\frac{2}{3}(20)+\frac{1}{3}(10)\approx 26.77.

When the number of cycles is fixed at $n_{1}=5$ and $n_{2}=10$ , then the two tasks share the CPU and disk for $\mbox{min}(n_{1},n_{2})=5$ cycles. The mean residence time per cycle for the balanced QN with $K=2$ task and $N=2$ devices based on a balanced F?J queueing system is [34], also see Thomasian 2023 [65].

r(N)=(K+N-1)\overline{x}=3.

After $\tau_{1}$ completes, $\tau_{2}$ has five remaining cycles, where each cycle takes $r(1)=2$ time units. The completion time is:

C_{fixed}=H(S_{1,2})+H(S_{2})=n_{1}\times 3+(n_{2}-n_{1})\times 2=25.

In counting the number of remaining cycles we have in effect used the hybrid simulation method. Given that the completion time of F/J and parallel task systems in general is sensitive to the number of cycles raises the issue that alternative methods to estimate the completion time of task systems.

6 Hybrid Simulation Method

Hybrid simulation is a hierarchical simulation method proposed by Schwetman 1978 [51], which was applied to the CSM QN model with a single job class. Tasks are specified by the number of required cycles and the loadings per cycle. The degree of concurrency is required for the analysis.

Tasks are specified by service demands per cycle and the initial number of required cycles. The remaining cycles for the $i^{th}$ task $\tau_{i}$ ( $CR[i]$ ) is updated based on elapsed time $X=Next-Clock,Clock=Next$ , where

Next=\mbox{min}\{ArvlTime,CT[i]\times CR[i]\},1\leq i\leq n]

$CT[n]$ is the mean cycle time computed by solving the underlying QN model and $CR[i]=CR[i]-X/CT[i]$ As txns arrive and depart the degree of txn concurrency is updated as $n=n\pm 1$ and $CT[n]$ for active tasks is recomputed.

The hybrid simulation method was extended to multiple job classes in Thomasian 1987 [60], which deals with dynamic load balancing in a distributed system. Simply stated it is better to process I/O bound jobs together with CPU bound jobs and it is not the just remaining number of job cycles that matters.

The generalization of hybrid simulation proposed in Thomasian and Bay 1983 [56] does not require cyclic processing and can be applied to multiple job classes. Rather than quantizing task processing times we may modify service demands according to Eq. (20), Given that $\tau_{i}$ starts its execution at device $d$ with loading $X_{d}^{i}$ and its mean residence time obtained by solving closed QN model for a given task composition is $R_{i}$ , the residual job service demand after elapsed time $t$ is:

\displaystyle\boxed{X_{d}^{i}=X_{d}^{i}(1-t/R_{i}).}

(20)

Tasks systems as in [59] can be dealt with the hybrid simulation as follows. Referring back to the example in Section 3 we refer to the processing of tasks $\tau_{1},\tau_{2},\tau_{4}$ , which start processing concurrently. The service demands at the CPU and Disk 1 and 2 for $\tau_{1},\tau_{2},\tau_{5},\tau_{6}$ are $(420,400,400)$ and for $\tau_{3},\tau_{4}$ are $(620,600,600)$ and the number of cycles made by tasks are fixed. Analysis of the QN model processing $\tau_{1}$ , $\tau_{2}$ , $\tau_{4}$ concurrently yields the mean residence time $R_{1}=R_{2}<R_{4}$ . $\tau_{4}$ ’s residual loadings are obtained by multiplying by $(1-R_{2}/R_{4}$ .

The modified hybrid simulation method is easier to implement and less costly than the method described in Section 3. In fact this method may be more accurate than the method based on decomposition that postulates exponentially distributed completion times. This is especially so when the number of job cycles is not geometrically distributed.

With $n$ identical tasks when the distribution of number of cycles is uniformly distributed over $(0,x)$ , the minimum of the number of cycles to completion is: $r_{min}=x/{n+1}$

Batch jobs undergo different processing phases with different loadings from phase to phase, e.g., (i) loading and preprocessing of data, (ii) computation, (iii) visualization processing. Rather dealing with a single set of task loading we assume loadings associated with its three phases designated as $\tau_{1}\rightarrow\tau_{2}\rightarrow\tau_{3}$ are known. The residence time of the batch job is the sum of the residence times of three tasks. The discussion is simplified by assuming that the computer also processes transactions whose intensity remains fixed during the execution of the batch job. Programs should be instrumented to signal change of phases for measurement purposes, i.e., to report per phase loadings based on variation in resource consumption. Phases accessing bottleneck resources will have increased residence times while other phases will have shorter residence times. Experiments are required if considering phases will yield significant differences There is also the possibility of parallel processing via multitasking.

7 Related Work

Given that many computations exhibit parallelism, several parallel models of computation have been developed, since the early days of computing. The early task system model based on a dag was described in Coffman and Denning 1973.

More complex relations in parallel processing can be represented by Petri nets Peterson 1981 [40]. The so-called UCLA graph model proposed in Martin 1966 [35] can be transformed into a Petri net. but the proof of the equivalence of the two models by Kim Gostelow is flawed according to the book.

A major extension to Petri nets was the Timed Petri Net - TPN model by several researchers including Molloy 1982 [39], which led to workshops on TPNs starting in 1985. Generalized Stochastic Petri Nets - GSPNs allow immediate and exponentially distributed transitions Ajmone-Marsan et al. 1984 [3]. GSPNs are translatable to CTMCs which can the be solved using usual methods Stewart 2009 [52]. Further examples of this modeling approach are given in Ajmone-Marsan 1995 [4]

Research Queueing Package - RESQ is a software tool for constructing and solving QN models. RESQ under a different name was imported to IBM Research from Univ. of Texas at Austin Sauer and MacNair 1983 [50]. It is also described in Sauer et al. 1977 [48]. RESQ provides a high level language for describing models also in hierarchical fashion.

Job execution in closed QN models allowing parallelism is considered in Heidelberger and Trivedi 1982 [22]. Jobs spawns two or more tasks at some point during execution, which execute independently of one another and do not require synchronization. An approximate solution method is developed and results of the approximation are compared to those of simulations. Bounds on the performance improvement due to overlap are derived.

The same authors in 1983 [23] consider parallel tasks, which wait at the end of their execution for all of their siblings to finish execution. Two approximate solution methods are developed and compared with simulations. The approximations are computationally efficient and highly accurate. A single instance of a task systems executing in a multiprogrammed computer is considered Thomasian and Bay 1986 [59].

Concurrency in parallel processing systems is the topic of Kung 1984 [32]. Jobs are modeled as dags whose nodes represent separate tasks. Four variations are considered: (1) jobs available at time zero or Poisson arrivals. (2) dags: fixed or random. (3) task service times: constant or exponentially distributed. (4) fixed or infinite number of jobs. Algorithm 1 minimizes the expected time to complete all jobs, while Algorithm 2 maximizes processor utilization.

An algorithm to search for module assignments and replications to reduce task response times is explored by Chu and Leung [13]. The objective function is the sum of task response time and a delay penalty for the violations of thread response time requirements. The PS queueing discipline is used in this study, so that with Poisson arrivals the output stream is Poisson and the nodes can be analyzed separately when there are no synchronization delays,

Chu et al. [14] propose two submodels for estimating task response times in distributed systems with resource contention. The first submodel is an extended QN to obtain module response times, which is solved by a decomposition technique to reduce computational cost by 2-3 orders of magnitude with respect to a direct approach.

The second submodel is a weighed control-flow graph model from which task response time can be obtained by aggregating module response time in accordance with precedence relationships. Task response times estimated by the analytic model compare closely with simulation results. The model can be used to study the tradeoffs among module assignments, scheduling policies, interprocessor communications, and resource contentions in distributed processing systems.

Response time is affected by interprocessor communications, precedence relationships, module assignments, hardware resource and data resource contention, and processor scheduling policies. A task response time model that considers all of these factors is proposed. A Petri net is used to represent resource contention, and the task control flow graph represents precedence relationships. A QN with resource contention is used to estimate the response time of each module. Module response time consists of delays at the processors and resource queues and is estimated by approximating the extended QN as independent finite capacity QNs. The module response time is mapped onto a control flow graph, and task response time is obtained by aggregating the module response times in accordance with their precedence relationship in the control flow graph. The task response time derived from the analytical model were validated against simulation results.

A modeling methodology for evaluating the execution of parallel programs containing loo** constructs by estimating the average execution time of such a program in a distributed, multicomputer environment is proposed in Kapelnikov et al. [26, 27]. A combination of QN analysis of graph models of program behavior is considered in these studies. Complex programs are first decomposed into program segments, which are analyzed independently. Combined results produce an approximate solution for the whole program.

Task graphs represent parallel programs with dags, which are specified as a 4-tuple { T,P,A,E } by Menasce and Barroso 1992 [37].

•

${\cal\bf T}=\{\tau_{1},\tau_{2},\dots\}$ is the set of tasks of a parallel program.
•

${\cal\bf P}$ is the precedence relationship among tasks.
${\tau_{j}}$ can be activated when all $(\tau_{i}\in{\cal\bf P}$ complete their execution.
•

A is an allocation function for tasks to processors $\tau_{i}\rightarrow P_{j}$ .
•

E: determines the execution time based on processor speeds. Similarly to [59] tasks are specified by the their service demands on a computer system. The execution time depends on task mix and hence can be determined by solving the QN.

A CTMC based technique to obtain the execution time of a task graph in a multiprogrammed computer system based on Thomasian and Bay 1986 [59] is reported by Menasce et al. in [38]. The use of the static processor assignment policy called Largest Task First Minimum Finish Time - LTFMFT shows that it is very sensitive to the degree of heterogeneity of the architecture, and that it outperforms all other policies analyzed.

Three dynamic assignment disciplines are compared and it is shown that in heterogeneous environments, the disciplines that perform better are those that consider the structure of the task graph and not only the service demands of the individual tasks. The performance of heterogeneous architectures is compared with cost-equivalent homogeneous ones taking into account different scheduling policies. Static and dynamic processor assignment disciplines are compared in terms of performance.

8 Conclusion and Further Work

Hierarchical modeling is a useful tool in develo** approximate analyses when the system does not lend itself to a direct solution or to reduce the cost of analysis or simulation by replacing a detailed model of a computer system by an FESC.

Several instances of hierarchic analysis are discussed in this paper with the goal of reducing the solution cost. An efficient solution of a CTMC for a task system and a simulation of a timesharing system with two jobs classes with MPL constraints. In both cases tasks are processed on a multiprogrammed computer system representable as a product form QN.

When the tasks of a subtask system execute at the devices of independent computer systems the completion time of subtask system may be used to determine the overall completion time. The data transmission delays among nodes is the ratio of message length and data transmission rate, assuming queueing delays are negligible.

In addition to the mean completion time the variance of completion times is of interest. With the assumption that the holding time in each state of the CTMC is exponentially distributed the variance of completion time is the variance of first passage time from the initial to final state in the CTMC. The variance of first passage times in discrete-time Markov chains is derived in Hunter 2006 [24].

Rather than considering all tasks in the distributed computer system at once, as done in [59], the mean and variance of completion time of task subsystems can be determined separately. In a distributed system can use separate task systems to determine mean and variance of completion time per system, which can be used to determine overall completion time.

Rather than the six task system in Section 3 consider two subtask systems with three tasks each:
{ $\tau_{1},\tau_{2},\tau_{3}\}$ and $\{\tau_{4},\tau_{5},\tau_{6}\}$ .
The two subtasks are executed independly at two identical computer systems. The mean makespan of of the task system can be determined by obtaining the mean and variance of each subtask system.

Assuming a normal distribution a formula for the expected value of the maximum for three such random variables is given in Dasgupta 2023 [17]. In the case of nine random variables a two step computation can be used e.g., by computing the maximum of $X_{1:3}^{max},X_{4:6}^{max},X_{7:9}^{max}$ .

The expected value of the maximum of $n$ i.i.d. random variables with mean $\mu_{X}$ and standard deviation $\sigma_{X}$ of the components of an F/J request according to David and Nagaraja 2003 [18] is given as:

\displaystyle\overline{X}_{n}^{max}\approx\mu_{X}+\sigma_{X}G(n).

(21)

A simulation based method to substitute disks with a Shortest Access Time First - SATF scheduling method with an FESC is discussed in [63]. Given $p$ pending requests the service time is reduced according to $p^{1/5}$ , i.e., service time is halved for $p=32$ random requests. This is an example of using simulation at the lower level to develop an analytic formula for disk service time.

The method developed in [59] is incorporated in the SHARPE reliability and performance modeling package at Duke University.
https://trivedi.pratt.duke.edu/

Hierarchical modeling in the context of reliability and availability engineering is discussed in Chapter 16 in Trivedi and Bobbio 2017 [67].

The discussion is applicable to queueing analysis of communication networks and manufacturing systems Buzacott and Shanthikumar 1993 [7].

Appendix: Equilibrium Point Approximation

A multilevel analysis method of dynamic locking is used to determine txn response times in Ryu and Thomasian 1990 [44]. The analysis takes into account hardware and data resource contention, which is due to lock contention. Txns encountering a lock conflict are blocked and those whose lock requests lead to a deadlock are aborted and restarted. Realistically a txn which has the least resources should be aborted as in the case of the Wait-depth Limited - WDL policy [21]. In realistic models the probability of deadlock is negligibly small.

Txns arrive according to a Poisson process. The number of activated txns ( $V$ ) is restricted by the maximum multiprogramming level $(W)$ , i.e., $V=\mbox{min}(A,W)$ , where $A$ is the number of txns in the system. Txns making successful lock requests continue their execution, while txns making a conflicting lock requests are blocked, so that $J$ txns are active and $V-J$ txns are blocked.

Txns causing a deadlock are aborted and restarted so that the number of active txns remains the same. In fact deadlocks are rare [61] and have a negligible effect on performance and their effect was ignored in further studies Thomasian 19993 [62]. Blocked txns are activated when a txn completes or is aborted releasing all of its locks. The lock and hardware resource contention models are used to determine txn throughput $\mu(V),1\leq V\leq W$ , which can then be used in conjunction with the arrival process to determine mean txn response time.

To compute the effective system throughput $\mu(V)$ we need to compute the steady-state probabilities of a Markov chain $P_{T}(J),1\leq,J\leq V$ . The transition from state $J$ to $I$ is determined by the events upon the completion of a txn step. The probabilities for these events are determined at the completion time of txn steps.

\mu(V)=\sum_{J\leq V}P_{S}(J|V)t(J)\mbox{ respect for }1\leq V\leq W.

An alternative solution method obviates the need to compute the state probabilities for $1\leq JleqV$ and reduces the solution cost of lower levels by a factor of $\approx V/\mbox{log}_{2}(V)$ .

$A(J)$ which is the mean of the difference $I-J$ at completion instants can be computed as follows:

\displaystyle A(J)=\sum_{I=J-1}^{V}(I-J)P_{tran}(j,i|V)

(22)

Given that $\bar{J}=\sum_{J=1}^{V}JP_{S}(J|V)$ the systems is in equilibrium we have

\displaystyle A(J)=0\mbox{ for }J=\bar{j}

(23)

$A(J)$ is positive (resp. negative) when $J<\bar{J}$ (resp. $J>\bar{J}$ .

Eq. 23 can be solved using the bisection method, since A(J) is a monotonically decreasing function in J. The number of iterations is bounded by $\mbox{ceil}(\mbox{log}_{2}(V))$ .

The interpretation of this relationship is that $\bar{J}$ is the system’s balance point such that the system tends to stay there [16] These studies dealt with potential overload due to thrashing in overloaded in virtual memory system, where the system throughput increases as the MultiProgramming Level - MPL is increased, but drops beyond a certain MPL. This phenomenon is explored in the context of 2-Phase Locking - 2PL using a simple model as the degree of txn concurrency is increased in [62].

The above analysis is motivated by Equilibrium Point Analysis - EPA, which was applied to the analysis of multiaccess protocols in Tasaka [53]: “EPA is a fluid-type approximation which is only applied to the steady state. It assumes that the systems is always at an equilibrium point. Therefore, EPA does not necessitate calculating state transition probabilities. An equilibrium point can easily be obtained by numerically solving a set of simultaneous nonlinear equations.”

An application of EPA is illustrated by an example if Figure 20 in [19], where there are $M=18$ users with think time $M/Z=0.9$ so that $Z=18/0.90=20$ The arrival rate of requests to the computer system is $a(N)=(M-N)/Z$ . The mean number requests at the computer systems is given by the intersection of the throughput characteristic $T(N)$ and $a(N)$ , but the calculation is simplified by using intersection of two graphs to determine $N_{intersect}\approx\bar{N}$ Otherwise we have to solve the following set of equations for $(N)$ setting $p(0)=1$ noting that they add to one.

[(M-N)/X]p(N)=t(N-1)p(N-1),1\leq N\leq M

such that:

p(0)=[1+\sum_{N=1}^{M}p(N)]^{-1}.

Acknowledgements

This paper is partially based on papers the author coauthored PhD student Paul Bay, most notably Thomasian and Bay [59]. The Appendix is based on Thomasian and Ryu [44].

References

[1]
[2] T. L. Adam, K. M. Chandy, and J. R. Dickson. A comparison of list schedules for parallel processing systems. Commun. ACM 17, 12 (1974), 685-690.
[3] M. Ajmone Marsan, G. Conte, and G. Balbo: A class of generalized stochastic Petri nets for the performance evaluation of multiprocessor systems. ACM Trans. on Computer Systems 2, 2 (May 1984), 93-122.
[4] M. Ajmone Marsan, G. Conte, G. Balbo, S. Donatelli, and G. Franceschinis. Modelling with Generalised Stochastic Petri Nets. John-Wiley & Sons, 1995.
[5] F. Baskett, K. M. Chandy, R. R. Muntz, and F. G. Palacios. Open, closed, and mixed networks of queues with different classes of customers. J. ACM 22, 2 (1975), 248-260.
[6] G. Bolch, S. Greiner, H. de Meer, and K. S. Trivedi. Queueing Networks and Markov Chains: Modeling and Performance Evaluation with Computer Science Applications, 2nd ed. Wiley-Interscience, 2006.
[7] J. A. Buzacott and J. G. Shanthikumar. Stochastic Models of Manufacturing Systems. Prentice Hall, 1993.
[8] J. P. Buzen. Computational algorithms for closed queueing networks with exponential servers. Commun. ACM 16(9): 527-531 (1973).
[9] J. P. Buzen, R. P. Goldberg, A. M. Langer, E. S. Lentz, H. S. Schwenk, D. A. Sheetz, and A. W. Shum. BEST/1 - Design of a tool for computer system capacity planning. In Proc. AFIPS National Computer Conf. - NCC 1978, 447-455.
[10] J. P. Buzen. A Queueing Network Model of MVS. ACM Computing Survey 10(3): 319-331 (1978).
[11] K. M. Chandy and C. H. Sauer. Computational algorithms for product form queueing networks. Commun. ACM 23, 10 (Oct. 1980), 573-583.
[12] K. Mani Chandy and D. Neuse. Linearizer: A heuristic algorithm for queueing network models of computing systems. Commun. ACM 25, 2 (Feb. 1982), 126-134.
[13] W. W. Chu and K. K. Leung, Module replication and assignment for real-time distributed processing systems. Proc. IEEE 75, 5 (May 1987), pp. 547-562.
[14] W. W. Chu, C. Sit, and K. K. Leung. Estimating task response time for real-time distributed systems with resource contentions. IEEE Trans. on Software Engineering 17(10): 1076-1092 (October 1991).
[15] E. G. Coffman Jr. and P. J. Denning. Operating Systems Theory. Prentice-Hall 1973.
[16] P.-J. Courtois. Decomposability, instabilities, and saturation in multiprogramming systems. Commun. ACM 18(7): 371-377 (1975
[17] A. Dasgupta. A formula for the expected value of the maximum of three independent normals and a sparse high dimensional case. Statistics Dept. at Purdue Univ. downloaded 2023
https://www.stat.purdue.edu/~dasgupta/orderstat.pdf
[18] H. A. David and H. N. Nagaraja. Order Statistics, 3rd edition, Wiley-Interscience 2003.
[19] P. J. Denning and J. P. Buzen: The operational analysis of queueing network models. ACM Computing Surveys 10, 3 (1978), 225-261.
[20] E. B. Fernandez and B. Bussell. Bounds on the number of processors and time for multiprocessor optimal schedules. IEEE Trans. Computers 22, 8 (1973), 745-751.
[21] P. A. Franaszek, J. T. Robinson, and A. Thomasian. Concurrency control for high contention environments. ACM Trans. Database Systems 17, 2 (1992), 304-345.
[22] P. Heidelberger and K. S. Trivedi: Queueing Network Models for Parallel Processing with Asynchronous Tasks. IEEE Trans. Computers 31, 11 (Nov. 1982), 1099-1109.
[23] P. Heidelberger and K. S. Trivedi: Analytic Queueing Models for Programs with Internal Concurrency. IEEE Trans. Computers 32, 1 (Jan. 1983), 73-82.
[24] J. J. Hunter. Variances of first passage times in a Markov Chain with applications to mixing times Res. Lett. Inf. Math. Sci., 10 (2006), 17-48.
[25] J. R. Jackson. Networks of Waiting Lines. Operations Research 5, 4 (1957), 516-521.
[26] A. Kapelnikov, R. R. Muntz, and M. D. Ercegovac. A Modeling Methodology for the Analysis of Concurrent Systems and Computations. J. Parallel Distributed Computing 6, 3 (1989), 568-597.
[27] A. Kapelnikov, R. R. Muntz, and M. D. Ercegovac. A methodology for performance analysis of parallel computations with loo** constructs. J. Parallel Distributed Computing 14, 2 (1992), 105-120.
[28] L. Kleinrock. Queueing Systems, Vol I: Theory. Wiley-Interscience 1975.
[29] L. Kleinrock. Queueing Systems, Vol. II: Computer Applications, Wiley-Interscience 1976.
[30] H. Kobayashi. System Design and Performance Analysis Using Analytic Models. Chapter 3 in K. M. Chandy and R. T. Yeh. Current Trends in Programming Methodology, Vol. III: Software Modeling, Prentice-Hall 1978, 72-114.
[31] H. Kobayashi and B. L. Mark. System Modeling and Analysis: Foundations of System Performance Evaluation. Pearson, 2009.
[32] K. C.-Y. Kung. Concurrency in Parallel Processing Systems. Ph.D. Dissertation. Computer Science Department, UCLA, 1984.
[33] S. S. Lavenberg. Computer Performance Modeling Handbook. Academic Press 1983.
[34] E. D. Lazowska, J. Zahorjan, G. Scott Graham, and K. C. Sevcik: Quantitative System Performance: Computer System Analysis Using Queueing Network Models Prentice-Hall 1984.
[35] D. F. Martin. The Automatic Assignment and Sequencing of Computations on Parallel Processor Systems. Ph.D. Thesis, U. of California, Los Angeles, Jan. I966.
[36] D. A. Menasce and V. A. F. Almeida. Analytic Models of Supercomputer Performance in Multiprogramming Environments. Int’l J. High Performance Computing Applications 3 2 (1989), 71-91.
[37] D. A. Menasce and L. A. Barroso. A methodology for performance evaluation of parallel applications on multiprocessors. J. Parallel Distributed Computing - JPDC 14, 1 (1992), 1-14.
[38] D. A. Menasce, D. Saha, S. C. S. Porto, V. Almeida, and S. K. Tripathi. Static and dynamic processor scheduling disciplines in heterogeneous parallel architectures. J. Parallel Distributed Computing - JPDC 28, 1 (Jan. 1995), 1-18.
[39] M. K. Molloy: Performance analysis using stochastic Petri nets. IEEE Trans. Computers 31(9): 913-917 (1982)
[40] J. L. Peterson. Petri Net Theory and the Modeling of Systems. Prentice-Hall 1981.
[41] M. Reiser and H. Kobayashi. Queuing Networks with Multiple Closed Chains: Theory and Computational Algorithms. IBM J. Research & Development 19, 3 (1975), 283-294.
[42] M. Reiser and S. S. Lavenberg. Mean-value Analysis of Closed Multi-chain Queuing Networks. J. ACM 27, 2 (1980), 313-322.
[43] M. Reiser: Mean value analysis: A personal account. Performance Evaluation 2000, 491-504
[44] I. K. Ryu and A. Thomasian. Analysis of database performance with dynamic locking. J. ACM 37, 3 (1990), 491-523.
[45] R. A. Sahner, K. S. Trivedi, and A. Puliafito. Performance and Reliability Analysis of Computer Systems: An Example-Based Approach Using the SHARPE Software Package. Kluwer 1996.
[46] S. Salza and S. S. Lavenberg. Approximating response time distributions in closed queueing network models of computer performance. In Proc. 8th Int’l Symp. on Compute Performance Modelling, Measurement and Evaluation, 1981, F. J. Kylstra, Ed., 133-144.
[47] C. H. Sauer and K. M. Chandy. Approximate analysis of central server models. IBM J. Research & Development 19, 3 (1975), 301-313.
[48] C. H. Sauer, M. Reiser, and E. A. MacNair. RESQ — A package for solution of generalized queueing networks. In Proc. Nat’l Computer Conf. 1977, 978-986.
[49] C. H. Sauer. Approximate solution of queueing networks with simultaneous resource possession. IBM J. Research & Development 25, 6 (Nov.-Dec. 1981), 894-903.
[50] C. H. Sauer and E. A. MacNair, Extended Queueing Network Models. Chapter 8 in Computer Performance Handbook, S. S. Lavenberg, (ed.), 1983.
[51] H. D. Schwetman. Hybrid simulation models of computer systems. Commun. ACM 21, 9 (Sept. 1978), 718-723.
[52] W. J. Stewart. Probability, Markov Chains, Queues, and Simulation: The Mathematical Basis of Performance Modeling Princeton Univ. Press. 2009.
[53] S. Tasaka. Performance Analysis of Multiple Access Protocols. The MIT Press 1986.
[54] A. Thomasian and B. Nadji. Algorithms for queueing network models of multiprogrammed computer systems. Computer Performance 2, 3 (Sept. 1981), 100-123.
[55] A. Thomasian and I. K. Ryu. A decomposition solution to the queueing network model of the centralized DBMS with static locking. In Proc. ACM SIGMETRICS on Measurement and Modeling of Computer Systems - SIGMERTICS 1983, 82-92.
[56] A. Thomasian and P. F. Bay. Queueing network models for parallel processing of task systems. In Proc. Int’l Conf. on Parallel Processing - ICPP 1983, 421-428
[57] A. Thomasian and K. Gargeya. Speeding up computer system simulations using hierarchical modeling. ACM SIGMETRICS Perform. Evaluation Review 12, 4 (1984), 34-39.
[58] A. Thomasian. Performance evaluation of centralized databases with static locking. IEEE Trans. Software Eng. TSE-11, 4 (April 1985), 346-355.
[59] A. Thomasian and P. F. Bay. Analytic queueing network models for parallel processing of task systems. IEEE Trans. Computers 35, 12 (Dec. 1986), 1045-1054.
[60] A. Thomasian. A performance study of dynamic load balancing in distributed systems. In Proc. Int’l Conf. on Distributed Computing Systems - ICDCS 1987: 178-184
[61] A. Thomasian and I. K. Ryu. Performance analysis of two-phase locking. IEEE Trans. Software Eng. 17, 5 (May 1991), 386-402.
[62] A. Thomasian. Two-phase locking performance and its thrashing behavior. ACM Trans. Database Systems 18, 4 (1993), 579-625.
[63] A. Thomasian. Survey and analysis of disk scheduling methods. ACM SIGARCH Computer Architecture Newsletter 39, 2 (2011), 8-25.
[64] A. Thomasian: Analysis of fork/join and related queueing systems. ACM Computing Surveys 47, 2 (Aug. 2014), 17:1-17:71.
[65] A. Thomasian Unbalanced job approximation using Taylor series expansion and review of performance bounds. https://doi.org/10.48550/arXiv.2309.15172
[66] K. S. Trivedi. Probabilistic and Statistics with Reliability, Queueing and Computer Science Applications, 2nd ed. Wiley 2001.
[67] K. S. Trivedi and A. Bobbio. Reliability and Availability Engineering: Modeling, Analysis, and Applications. Cambridge Univ. Press, 2017.
[68] V. L. Wallace and R. S. Rosenberg. Markovian models and numerical analysis of computer system behavior. In Proc. AFIPS Spring Joint Computer Conf. - SJCC 1966, Vol. 27, 141-148.
[69] P. D. Welch. Statistical Analysis of Simulation Results. Chapter 6 Computer Performance Handbook, S.S. Lavenberg (ed.), 1983.
[70] A. C. Williams and R. A. Bhandiwad. A generating function approach to queueing network analysis of multiprogrammed computer systems. Networks 6, 1 (1976), 1-22.