Guidelines for Preparing a Paper for the
European Conference on Artificial Intelligence

z Author\orcid Corresponding Author. Email: [email protected].    Second Author\orcid    Third Author\orcid Short Affiliation of First Author Short Affiliation of Second Author and Third Author
Abstract

The purpose of this paper is to show a contributor the required style for a paper submitted to the ECAI conference. Authors should realize that once a paper is accepted, the final manuscript submitted by the author will be almost identical to the final, published version that appears in the book, except for pagination and the insertion of running headlines. Author of accepted paper should submit one final zip/rar file containing only one, final, version of the paper including all files necessary to compile the paper as well as a compiled pdf output. Proofreading as regards technical content and English usage is the responsibility of the author. ECAI does not accept submissions in MsWord. If you have any questions regarding the instructions, please contact the IOS Press Book Department through iospress.com/contact. The abstract should contain no more than 200 words.

\ecaisubmission\paperid

123

1 Page limit

The page limit for ECAI scientific papers is 7 pages, plus one (1) additional page for references only. Scientific papers should report on substantial novel results. The reference list may start earlier than page 8, but only references are allowed on this additional eighth page. The page limit for ECAI highlights is 2 pages. They are intended for disseminating recent technical work (published elsewhere), position, or open problems with clear and concise formulations of current challenges.

Please consult the most recent Call For Papers (CFP) for the most up-to-date detailed instructions.

Page limits are strict. Overlength submissions will be rejected without review.

2 General specifications

The following details should allow contributors to set up the general page description for their paper:

  1. 1.

    The paper is set in two columns each 20.5 picas (86 mm) wide with a column separator of 1.5 picas (6 mm).

  2. 2.

    The typeface is Times Modern Roman.

  3. 3.

    The body text size is 9 point (3.15 mm) on a body of 11 point (3.85 mm) (i.e., 61 lines of text).

  4. 4.

    The effective text height for each page is 56 picas (237 mm). The first page has less text height. It requires an additional footer space of 3.5 picas (14.8 mm) for the copyright inserted by the publisher and 1.5 picas (6 mm) of space before the title. The effective text height of the first page is 51 picas (216 mm).

  5. 5.

    There are no running feet for the final camera-ready version of the paper. The submission paper should have page numbers in the running feet.

3 Title, author, affiliation, copyright and running feet

3.1 Title

The title is set in 20 point (7 mm) bold with leading of 22 point (7.7 mm), centered over the full text measure, with 1.5 picas (6 mm) of space before and after.

3.2 Author

The author’s name is set in 11 point (3.85 mm) bold with leading of 12 point (4.2 mm), centered over full text measure, with 1.5 picas (6 mm) of space below. A footnote indicator is set in 11 point (3.85 mm) medium and positioned as a superscript character.

3.3 Affiliation

The affiliation is set as a footnote to the first column. This is set in 8 point (2.8 mm) medium with leading of 8.6 point (3.1 mm), with a 1 point (0.35 mm) footnote rule to column width.

3.4 Copyright

The copyright details will be inserted by the publisher.

3.5 Running feet

The running feet are inserted by the publisher. For submission you may insert page numbers in the middle of the running feet. Do not, however, insert page numbers for the camera-ready version of the paper.

4 Abstract

The abstract for the paper is set in 9 point (3.15 mm) medium, on a body of 10 point (3.5 mm). The word Abstract is set in bold, followed by a full point and a 0.5 pica space.

5 Headings

Three heading levels have been specified:

  1. 1.

    A level headings

    • The first level of heading is set is 11 point (3.85 mm) bold, on a body of 12 point (4.2 mm), 1.5 lines of space above and 0.5 lines of space below.

    • The heading is numbered to one digit with a 1 pica space separating it from the text.

    • The text is keyed in capitals and is unjustified.

  2. 2.

    B level headings

    • The second level of heading is set is 11 point (3.85 mm) bold, on a body of 12 point (4.2 mm), 1.5 lines of space above and 0.5 lines of space below.

    • The heading is numbered to two digits separated with a full point, with a 1 pica space separating it from the text.

    • The text is keyed in upper and lower case with an initial capital for first word only, and is unjustified.

  3. 3.

    C level headings

    • The third level of heading is set is 10 point (3.5 mm) italic, on a body of 11 point (3.85 mm), 1.5 lines of space above and 0.5 lines of space below.

    • The heading is numbered to three digits separated with a full point, with a 1 pica space separating it from the text.

    • The text is keyed in upper and lower case with an initial capital for first word only, and is unjustified.

  4. 4.

    Acknowledgements

    This heading is the same style as an A level heading but is not numbered.

6 Text

The first paragraph of text following any heading is set to the complete measure (i.e., do not indent the first line).

Subsequent paragraphs are set with the first line indented by 1 pica (3.85 mm).

There isn’t any inter-paragraph spacing.

7 Lists

The list identifier may be an arabic number, a bullet, an em rule or a roman numeral.

The items in a list are set in text size and indented by 1 pica (4.2 mm) from the left margin. Half a line of space is set above and below the list to separate it from surrounding text.

See layout of Section 5 on headings to see the results of the list macros.

8 Tables

Tables are set in 8 point (2.8 mm) on a body of 10 point (3.5 mm). The table caption is set centered at the start of the table, with the word Table and the number in bold. The caption is set in medium with a 1 pica (4.2 mm) space separating it from the table number.

A one line space separates the table from surrounding text.

Table 1: The table caption is centered on the table measure. If it extends to two lines each is centered.
Processors
1 2 4
Window \Diamond \Diamond \Box \bigtriangleup \Diamond \Box \bigtriangleup
   1 1273 110 21.79 89% 6717 22.42 61%
   2 2145 116 10.99 50% 5386 10.77 19%
   3 3014 117 41.77 89% 7783 42.31 58%
   4 4753 151 71.55 77% 7477 61.97 49%
   5 5576 148 61.60 80% 7551 91.80 45%
\Diamond execution time in ticks  \Box speed-up values  \bigtriangleup efficiency values

9 Figures

A figure caption is set centered in 8 point (2.8 mm) medium on a leading of 10 point (3.5 mm). It is set under the figure, with the word Figure and the number in bold and with a 1 pica (4.2 mm) space separating the caption text from the figure number.

One line of space separates the figure from the caption. A one line space separates the figure from surrounding text.

Refer to caption

Figure 1: Network of transputers and the structure of individual processes

10 Equations

A display equation is numbered, using arabic numbers in parentheses. It is centered and set with one line of space above and below to separate it from surrounding text. The following example is a simple single line equation:

Ax=b𝐴𝑥𝑏Ax=bitalic_A italic_x = italic_b (1)

The next example is a multi-line equation:

(x+y)(xy)𝑥𝑦𝑥𝑦\displaystyle(x+y)(x-y)( italic_x + italic_y ) ( italic_x - italic_y ) =\displaystyle== x2xy+xyy2superscript𝑥2𝑥𝑦𝑥𝑦superscript𝑦2\displaystyle x^{2}-xy+xy-y^{2}italic_x start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - italic_x italic_y + italic_x italic_y - italic_y start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT (2)
(x+y)2superscript𝑥𝑦2\displaystyle(x+y)^{2}( italic_x + italic_y ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT =\displaystyle== x2+2xy+y2superscript𝑥22𝑥𝑦superscript𝑦2\displaystyle x^{2}+2xy+y^{2}italic_x start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 2 italic_x italic_y + italic_y start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT (3)

The equal signs are aligned in a multi-line equation.

11 Program listings

Program listings are set in 9 point (3.15 mm) Courier on a leading of 11 point (3.85 mm). That is to say, a non-proportional font is used to ensure the correct alignment.

A one line space separates the program listing from surrounding text.

void inc(x)
int* x;
{
    *x++;
}

int y = 1;
inc(&y);
printf("%d\n",y);

12 Theorems

The text of a theorem is set in 9 point (3.15 mm) italic on a leading of 11 point (3.85 mm). The word Theorem and its number are set in 9 point (3.15 mm) bold.

A one line space separates the theorem from surrounding text.

Theorem 12.1.

Let us assume this is a valid theorem. In reality it is a piece of text set in the theorem environment.

13 Footnotes

Footnotes are set in 8 point (2.8 mm) medium with leading of 8.6 point (3.1 mm), with a 1 point (0.35 mm) footnote rule to column width111This is an example of a footnote that occurs in the text. If the text runs to two lines the second line aligns with the start of text in the first line. .

14 References

The reference identifier in the text is set as a sequential number in square brackets. The reference entry itself is set in 8 point (2.8 mm) with a leading of 10 point (3.5 mm), and the list of references is sorted alphabetically.

15 Sample coding

The remainder of this paper contains examples of the specifications detailed above and can be used for reference if required.

16 Programming model

Our algorithms were implemented using the single program, multiple data model (SPMD). SPMD involves writing a single code that will run on all the processors co-operating on a task. The data are partitioned among the processors which know what portions of the data they will work on [kn:Golub89].

16.1 Structure of processes and processors

The grid has P=Pr×Pc𝑃subscript𝑃rsubscript𝑃cP=P_{\rm{r}}\times P_{\rm{c}}italic_P = italic_P start_POSTSUBSCRIPT roman_r end_POSTSUBSCRIPT × italic_P start_POSTSUBSCRIPT roman_c end_POSTSUBSCRIPT processors, where Prsubscript𝑃rP_{\rm{r}}italic_P start_POSTSUBSCRIPT roman_r end_POSTSUBSCRIPT is the number of rows of processors and Pcsubscript𝑃cP_{\rm{c}}italic_P start_POSTSUBSCRIPT roman_c end_POSTSUBSCRIPT is the number of columns of processors.

16.1.1 Routing information on the grid

A message may be either broadcast or specific. A broadcast message originates on a processor and is relayed through the network until it reaches all other processors. A specific message is one that is directed to a particular target processor.

Broadcast messages originate from a processor called central which is situated in the ‘middle’ of the grid. This processor has co-ordinates (Pr/2,Pc/2)subscript𝑃r2subscript𝑃c2(\lfloor P_{\rm{r}}/2\rfloor,\lfloor P_{\rm{c}}/2\rfloor)( ⌊ italic_P start_POSTSUBSCRIPT roman_r end_POSTSUBSCRIPT / 2 ⌋ , ⌊ italic_P start_POSTSUBSCRIPT roman_c end_POSTSUBSCRIPT / 2 ⌋ ). Messages are broadcast using the row–column broadcast algorithm (RCB), which uses the following strategy. The number of steps required to complete the RCB algorithm (i.e., until all processors have received the broadcast value) is given by Pr/2+Pc/2subscript𝑃r2subscript𝑃c2\lfloor P_{\rm{r}}/2\rfloor+\lfloor P_{\rm{c}}/2\rfloor⌊ italic_P start_POSTSUBSCRIPT roman_r end_POSTSUBSCRIPT / 2 ⌋ + ⌊ italic_P start_POSTSUBSCRIPT roman_c end_POSTSUBSCRIPT / 2 ⌋.

A specific message is routed through the processors using the find-row–find-column algorithm (FRFC) detailed in [kn:deCarlini91]. The message is sent from the originator processor vertically until it reaches a processor sitting in the same row as the target processor. The message is then moved horizontally across the processors in that row until it reaches the target processor. An accumulation based on the recursive doubling technique [kn:Modi88, pp. 56–61], would require the same number of steps as the RCB requires. If either the row or column of the originator and target processors are the same then the message will travel only in a horizontal or vertical direction, respectively (see [kn:Smith85]).

17 Data partitioning

We use data partitioning by contiguity, defined in the following way. To partition the data (i.e., vectors and matrices) among the processors, we divide the set of variables V={i}i=1N𝑉superscriptsubscript𝑖𝑖1𝑁V=\{\,i\,\}_{i=1}^{N}italic_V = { italic_i } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT into P𝑃Pitalic_P subsets {Wp}p=1Psuperscriptsubscriptsubscript𝑊𝑝𝑝1𝑃\{\,W_{p}\,\}_{p=1}^{P}{ italic_W start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_p = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_P end_POSTSUPERSCRIPT of s=N/P𝑠𝑁𝑃s=N/Pitalic_s = italic_N / italic_P elements each. We assume without loss of generality that N𝑁Nitalic_N is an integer multiple of P𝑃Pitalic_P. We define each subset as Wp={(p1)s+j}j=1ssubscript𝑊𝑝superscriptsubscript𝑝1𝑠𝑗𝑗1𝑠W_{p}=\{(p-1)s+j\}_{j=1}^{s}italic_W start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT = { ( italic_p - 1 ) italic_s + italic_j } start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT (see [kn:Schofield89], [kn:daCunha92a] and [kn:Atkin] for details).

Each processor p𝑝pitalic_p is responsible for performing the computations over the variables contained in Wpsubscript𝑊𝑝W_{p}italic_W start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT. In the case of vector operations, each processor will hold segments of s𝑠sitalic_s variables. The data partitioning for operations involving matrices is discussed in Section 18.3.

18 Linear algebra operations

18.1 Saxpy

The saxpy w=u+αv𝑤𝑢𝛼𝑣w=u+\alpha vitalic_w = italic_u + italic_α italic_v operation, where u𝑢uitalic_u, v𝑣vitalic_v and w𝑤witalic_w are vectors and α𝛼\alphaitalic_α is a scalar value, has the characteristic that its computation is disjoint elementwise with respect to u,v𝑢𝑣u,vitalic_u , italic_v and w𝑤witalic_w. This means that we can compute a saxpy without any communication between processors; the resulting vector w𝑤witalic_w does not need to be distributed among the processors. Parallelism is exploited in the saxpy by the fact that P𝑃Pitalic_P processors will compute the same operation with a substantially smaller amount of data. The saxpy is computed as

wi=ui+αvi,i{Wp}p=1Pformulae-sequencesubscript𝑤𝑖subscript𝑢𝑖𝛼subscript𝑣𝑖for-all𝑖superscriptsubscriptsubscript𝑊𝑝𝑝1𝑃w_{i}=u_{i}+\alpha v_{i},\quad\forall i\in\{W_{p}\}_{p=1}^{P}italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + italic_α italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , ∀ italic_i ∈ { italic_W start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_p = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_P end_POSTSUPERSCRIPT (4)

18.2 Inner-product and vector 2-norm

The inner-product α=uTv=i=1Nuivi𝛼superscript𝑢𝑇𝑣superscriptsubscript𝑖1𝑁subscript𝑢𝑖subscript𝑣𝑖\alpha=u^{T}v=\sum_{i=1}^{N}{u_{i}v_{i}}italic_α = italic_u start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_v = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is an operation that involves accumulation of data, implying a high level of communication between all processors. The mesh topology and the processes architecture used allowed a more efficient use of the processors than, for instance, a ring topology, reducing the time that processors are idle waiting for the computed inner-product value to arrive, but the problem still remains. The use of the SPMD paradigm also implies the global broadcast of the final computed value to all processors.

The inner-product is computed in three distinct phases. Phase 1 is the computation of partial sums of the form

αp=i{Wp}ui×vi,p=1,,Pformulae-sequencesubscript𝛼𝑝subscriptfor-all𝑖subscript𝑊𝑝subscript𝑢𝑖subscript𝑣𝑖𝑝1𝑃\alpha_{p}=\sum_{\forall i\in\{W_{p}\}}{u_{i}\times v_{i}},\quad p=1,\ldots,Pitalic_α start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT ∀ italic_i ∈ { italic_W start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT } end_POSTSUBSCRIPT italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT × italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_p = 1 , … , italic_P (5)

The accumulation phase of the inner-product using the RCA algorithm is completed in the same number of steps as the RCB algorithm (Section 16.1.1). This is because of the need to relay partial values between processors without any accumulation taking place, owing to the connectivity of the grid topology.

The vector 2-norm α=u2=uTu𝛼subscriptnorm𝑢2superscript𝑢𝑇𝑢\alpha=||\,u\,||_{2}=\sqrt{u^{T}u}italic_α = | | italic_u | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = square-root start_ARG italic_u start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_u end_ARG is computed using the inner-product algorithm described above. Once the inner-product value is received by a processor during the final broadcast phase, it computes the square root of that value giving the required 2-norm value.

18.3 Matrix–vector product

For the matrix–vector product v=Au𝑣𝐴𝑢v=Auitalic_v = italic_A italic_u, we use a column partitioning of A𝐴Aitalic_A. Each processor holds a set Wpsubscript𝑊𝑝W_{p}italic_W start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT (see Section 17) of s𝑠sitalic_s columns each of N𝑁Nitalic_N elements of A𝐴Aitalic_A and s𝑠sitalic_s elements of u𝑢uitalic_u. The s𝑠sitalic_s elements of u𝑢uitalic_u stored locally have a one-to-one correspondence to the s𝑠sitalic_s columns of A𝐴Aitalic_A (e.g. a processor holding element ujsubscript𝑢𝑗u_{j}italic_u start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT also holds the j𝑗jitalic_j-th column of A𝐴Aitalic_A). Note that whereas we have A𝐴Aitalic_A partitioned by columns among the processors, the matrix–vector product is to be computed by rows.

The algorithm for computing the matrix–vector product using column partitioning is a generalization of the inner-product algorithm described in Section 18.2 (without the need for a final broadcast phase). At a given time during the execution of the algorithm, each one of P1𝑃1P-1italic_P - 1 processors is computing a vector w𝑤witalic_w of s𝑠sitalic_s elements containing partial sums required for the segment of the vector v𝑣vitalic_v in the remaining ‘target’ processor. After this computation is complete, each of the P𝑃Pitalic_P processors stores a vector w𝑤witalic_w. The resulting segment of the matrix–vector product vector which is to be stored in the target processor is obtained by summing together the P𝑃Pitalic_P vectors w𝑤witalic_w, as described below.

Each processor other than the target processor sends its w𝑤witalic_w vector to one of its neighboring processors. A processor decides whether to send the vector in either the row or column direction to reach the target processor based on the FRFC algorithm (see Section 16.1.1). If a vector passes through further processors in its route to the target processor the w𝑤witalic_w vectors are accumulated. Thus the target processor will receive at most four w𝑤witalic_w vectors which, when summed to its own w𝑤witalic_w vector, yield the desired set of s𝑠sitalic_s elements of v𝑣vitalic_v.

18.4 Matrix–vector product—finite-difference approximation

We now consider a preconditioned version of the conjugate-gradients method [kn:Golub89]. Note that we do not need to form A𝐴Aitalic_A explicitly. This implies a very low degree of information exchange between the processors which can be effectively exploited with transputers, since the required values of u𝑢uitalic_u can be exchanged independently through each link.

The preconditioning used in our implementations is the polynomial preconditioning (see [kn:Saad85], [kn:Eisenstat81], [kn:Adams85] and [kn:Johnson83]), which can be implemented very efficiently in a parallel architecture since it is expressed as a sequence of saxpys and matrix–vector products.

We have l𝑙litalic_l rows and columns in the discretization grid, which we want to partition among a Pr×Pcsubscript𝑃rsubscript𝑃cP_{\rm{r}}\times P_{\rm{c}}italic_P start_POSTSUBSCRIPT roman_r end_POSTSUBSCRIPT × italic_P start_POSTSUBSCRIPT roman_c end_POSTSUBSCRIPT mesh of processors. Each processor will then carry out the computations associated with a block of l/Pr+sign(lmodPr)𝑙subscript𝑃rsignmodulo𝑙subscript𝑃r\lfloor l/P_{\rm{r}}\rfloor+\hbox{sign}\left(l\bmod P_{\rm{r}}\right)⌊ italic_l / italic_P start_POSTSUBSCRIPT roman_r end_POSTSUBSCRIPT ⌋ + sign ( italic_l roman_mod italic_P start_POSTSUBSCRIPT roman_r end_POSTSUBSCRIPT ) rows and l/Pc+sign(lmodPc)𝑙subscript𝑃csignmodulo𝑙subscript𝑃c\lfloor l/P_{\rm{c}}\rfloor+\hbox{sign}\left(l\bmod P_{\rm{c}}\right)⌊ italic_l / italic_P start_POSTSUBSCRIPT roman_c end_POSTSUBSCRIPT ⌋ + sign ( italic_l roman_mod italic_P start_POSTSUBSCRIPT roman_c end_POSTSUBSCRIPT ) columns of the interior points of the grid.

The matrix–vector product using the column partitioning is highly parallel. Since there is no broadcast operation involved, as soon as a processor on the boundary of the grid (either rows or columns) has computed and sent a wpsubscript𝑤𝑝w_{p}italic_w start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT vector destined to a processor ‘A’, it can compute and (possibly) send a wpsubscript𝑤𝑝w_{p}italic_w start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT vector to processor ‘B’, at which time its neighboring processors may also have started computing and sending their own w𝑤witalic_w vectors to processor ‘B’.

At a given point in the matrix–vector product computation, the processors are computing w𝑤witalic_w vectors destined to processor A. When these vectors have been accumulated in the row of that processor (step 1), the processors in the top and bottom rows compute and send the w𝑤witalic_w vectors for processor B, while the processors on the left and right columns of the row of processor A send the accumulated r𝑟ritalic_r vectors to processor A (step 2). Processor A now stores its set of the resulting v𝑣vitalic_v vector (which is the accumulation of the w𝑤witalic_w vectors). In step 3, the processors in the bottom row compute and send the w𝑤witalic_w vectors for processor C while the processor at the left-hand end of the row of processor B sends the accumulated w𝑤witalic_w vectors of that column towards processor B. The next steps are similar to the above.

In our implementation, we exploit the geometry associated with the regular grid of points used to approximate the PDE. A geometric partitioning is used to match the topology and connectivity present in the grid of transputers (Section 16.1).

The discretization of the PDE is obtained by specifying a grid size l𝑙litalic_l defining an associated grid of N=l2𝑁superscript𝑙2N=l^{2}italic_N = italic_l start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT interior points (note that this is the order of the linear system to be solved). With each interior point, we associate a set of values, namely the coefficients C,N,S,E𝐶𝑁𝑆𝐸C,N,S,E\,italic_C , italic_N , italic_S , italic_E and W𝑊Witalic_W.

19 Conclusion

We have shown that an iterative method such as the preconditioned conjugate-gradients method may be successfully parallelized by using highly efficient parallel implementations of the linear algebra operations involved. We have used the same approach to parallelize other iterative methods with similar degrees of efficiency (see [kn:daCunha92a] and [kn:daCunha92b]).

\ack

We would like to thank the referees for their comments, which helped improve this paper considerably