This note describes some operations on arrays in R. These operations have been implemented to facilitate implementation of graphical models and Bayesian networks in R.
The documentation of R states the following about arrays: An array in R can have one, two or more dimensions. It is simply a vector which is stored with additional attributes giving the dimensions (attribute “dim”) and optionally names for those dimensions (attribute “dimnames”). A two-dimensional array is the same thing as a matrix. One-dimensional arrays often look like vectors, but may be handled differently by some functions.
Arrays appear for example in connection with cross classified data. The array hec below is an excerpt of the HairEyeColor array in R:
hec <- c(32, 53, 11, 50, 10, 25, 36, 66, 9, 34, 5, 29)
dim(hec) <- c(2, 3, 2)
dimnames(hec) <- list(Hair = c("Black", "Brown"),
Eye = c("Brown", "Blue", "Hazel"),
Sex = c("Male", "Female"))
hec
#> , , Sex = Male
#>
#> Eye
#> Hair Brown Blue Hazel
#> Black 32 11 10
#> Brown 53 50 25
#>
#> , , Sex = Female
#>
#> Eye
#> Hair Brown Blue Hazel
#> Black 36 9 5
#> Brown 66 34 29
Above, hec is an array because it has a dim attribute. Moreover, also has a dimnames attribute naming the levels of each dimension. Notice that each dimension is given a name.
Printing arrays can take up a lot of space. Alternative views on an array can be obtained with ftable() or by converting the array to a dataframe with as.data.frame.table(). We shall do so in the following.
##flat <- function(x) {ftable(x, row.vars=1)}
flat <- function(x, n=4) {as.data.frame.table(x) |> head(n)}
hec |> flat()
#> Hair Eye Sex Freq
#> 1 Black Brown Male 32
#> 2 Brown Brown Male 53
#> 3 Black Blue Male 11
#> 4 Brown Blue Male 50
An array with named dimensions is in this package called a named array. The functionality described below relies heavily on arrays having named dimensions. A check for an object being a named array is provided by is_named_array()
Another way is to use tab_new() from gRbase. This function is flexible wrt the input; for example:
dn <- list(Hair=c("Black", "Brown"), Eye=~Brown:Blue:Hazel, Sex=~Male:Female)
counts <- c(32, 53, 11, 50, 10, 25, 36, 66, 9, 34, 5, 29)
z3 <- tab_new(~Hair:Eye:Sex, levels=dn, value=counts)
z4 <- tab_new(c("Hair", "Eye", "Sex"), levels=dn, values=counts)
Default dimnames are generated with
z5 <- tab_new(~Hair:Eye:Sex, levels=c(2, 3, 2), values = counts)
dimnames(z5) |> str()
#> List of 3
#> $ Hair: chr [1:2] "1" "2"
#> $ Eye : chr [1:3] "1" "2" "3"
#> $ Sex : chr [1:2] "1" "2"
Using tab_new, arrays can be normalized to sum to one in two ways: 1) Normalization can be over the first variable for each configuration of all other variables and 2) over all configurations. For example:
In the following we shall denote the dimnames (or variables) of the array by H, E and S and we let (h, e, s) denote a configuration of these variables. The contingency table above shall be denoted by THES and we shall refer to the (h, e, s)-entry of THES as THES(h, e, s).
Normalize an array with Entries of an array can be normalized to sum to one in two ways: 1) Normalization can be over the first variable for each configuration of all other variables and 2) over all configurations. For example:
We can subset arrays (this will also be called ``slicing’’) in different ways. Notice that the result is not necessarily an array. Slicing can be done using standard R code or using . The virtue of tab_slice() comes from the flexibility when specifying the slice:
The following leads from the original 2 × 3 × 2 array to a 2 × 2 array by cutting away the Sex=Male and Eye=Brown slice of the array:
tab_slice(hec, slice=list(Eye=c("Blue", "Hazel"), Sex="Female"))
#> Eye
#> Hair Blue Hazel
#> Black 9 5
#> Brown 34 29
## Notice: levels can be written as numerics
## tab_slice(hec, slice=list(Eye=2:3, Sex="Female"))
We may also regard the result above as a 2 × 2 × 1 array:
tab_slice(hec, slice=list(Eye=c("Blue", "Hazel"), Sex="Female"), drop=FALSE)
#> , , Sex = Female
#>
#> Eye
#> Hair Blue Hazel
#> Black 9 5
#> Brown 34 29
If slicing leads to a one dimensional array, the output will by default not be an array but a vector (without a dim attribute). However, the result can be forced to be a 1-dimensional array:
## A vector:
t1 <- tab_slice(hec, slice=list(Hair=1, Sex="Female")); t1
#> Brown Blue Hazel
#> 36 9 5
## A 1-dimensional array:
t2 <- tab_slice(hec, slice=list(Hair=1, Sex="Female"), as.array=TRUE); t2
#> Eye
#> Brown Blue Hazel
#> 36 9 5
## A higher dimensional array (in which some dimensions only have one level)
t3 <- tab_slice(hec, slice=list(Hair=1, Sex="Female"), drop=FALSE); t3
#> , , Sex = Female
#>
#> Eye
#> Hair Brown Blue Hazel
#> Black 36 9 5
The difference between the last two forms can be clarified:
Collapsing: The HE–marginal array THE of THES is the array with values Inflating: The ``opposite’’ operation is to extend an array. For example, we can extend THE to have a third dimension, e.g. . That is so T̃SHE(s, h, e) is constant as a function of s.
With gRbase we can collapse arrays with:
he <- tab_marg(hec, c("Hair", "Eye"))
he
#> Eye
#> Hair Brown Blue Hazel
#> Black 68 20 15
#> Brown 119 84 54
## Alternatives
tab_marg(hec, ~Hair:Eye)
#> Eye
#> Hair Brown Blue Hazel
#> Black 68 20 15
#> Brown 119 84 54
tab_marg(hec, c(1, 2))
#> Eye
#> Hair Brown Blue Hazel
#> Black 68 20 15
#> Brown 119 84 54
hec %a_% ~Hair:Eye
#> Eye
#> Hair Brown Blue Hazel
#> Black 68 20 15
#> Brown 119 84 54
Notice that collapsing is a projection in the sense that applying the operation again does not change anything (except possibly changing the order of variables):
he1 <- tab_marg(hec, c("Hair", "Eye"))
he2 <- tab_marg(he1, c("Eye", "Hair"))
tab_equal(he1, he2)
#> [1] TRUE
Expand an array by adding additional dimensions with tab_expand():
extra.dim <- list(Sex=c("Male", "Female"))
tab_expand(he, extra.dim)
#> , , Sex = Male
#>
#> Hair
#> Eye Black Brown
#> Brown 68 119
#> Blue 20 84
#> Hazel 15 54
#>
#> , , Sex = Female
#>
#> Hair
#> Eye Black Brown
#> Brown 68 119
#> Blue 20 84
#> Hazel 15 54
## Alternatives
he %a^% extra.dim
#> , , Sex = Male
#>
#> Hair
#> Eye Black Brown
#> Brown 68 119
#> Blue 20 84
#> Hazel 15 54
#>
#> , , Sex = Female
#>
#> Hair
#> Eye Black Brown
#> Brown 68 119
#> Blue 20 84
#> Hazel 15 54
Notice that expanding and collapsing brings us back to where we started:
A reorganization of the table can be made with tab_perm() (similar to aperm()), but tab_perm() allows for a formula and for variable abbreviation:
tab_perm(hec, ~Eye:Sex:Hair) |> flat()
#> Eye Sex Hair Freq
#> 1 Brown Male Black 32
#> 2 Blue Male Black 11
#> 3 Hazel Male Black 10
#> 4 Brown Female Black 36
Alternative forms (the first two also works for ):
tab_perm(hec, c("Eye", "Sex", "Hair"))
#> , , Hair = Black
#>
#> Sex
#> Eye Male Female
#> Brown 32 36
#> Blue 11 9
#> Hazel 10 5
#>
#> , , Hair = Brown
#>
#> Sex
#> Eye Male Female
#> Brown 53 66
#> Blue 50 34
#> Hazel 25 29
tab_perm(hec, c(2, 3, 1))
#> , , Hair = Black
#>
#> Sex
#> Eye Male Female
#> Brown 32 36
#> Blue 11 9
#> Hazel 10 5
#>
#> , , Hair = Brown
#>
#> Sex
#> Eye Male Female
#> Brown 53 66
#> Blue 50 34
#> Hazel 25 29
tab_perm(hec, ~Ey:Se:Ha)
#> , , Hair = Black
#>
#> Sex
#> Eye Male Female
#> Brown 32 36
#> Blue 11 9
#> Hazel 10 5
#>
#> , , Hair = Brown
#>
#> Sex
#> Eye Male Female
#> Brown 53 66
#> Blue 50 34
#> Hazel 25 29
tab_perm(hec, c("Ey", "Se", "Ha"))
#> , , Hair = Black
#>
#> Sex
#> Eye Male Female
#> Brown 32 36
#> Blue 11 9
#> Hazel 10 5
#>
#> , , Hair = Brown
#>
#> Sex
#> Eye Male Female
#> Brown 53 66
#> Blue 50 34
#> Hazel 25 29
Two arrays are defined to be identical 1) if they have the same dimnames and 2) if, possibly after a permutation, all values are identical (up to a small numerical difference):
We can align one array according to the ordering of another:
The product of two arrays THE and THS is defined to be the array T̃HES with entries T̃HES(h, e, s) = THE(h, e) + THS(h, s)
The sum, difference and quotient is defined similarly: This is done with tab_rod(), tab_add(), tab_diff() and tab_div():
hs <- tab_marg(hec, ~Hair:Eye)
tab_mult(he, hs)
#> Eye
#> Hair Brown Blue Hazel
#> Black 4624 400 225
#> Brown 14161 7056 2916
Available operations:
tab_add(he, hs)
#> Eye
#> Hair Brown Blue Hazel
#> Black 136 40 30
#> Brown 238 168 108
tab_subt(he, hs)
#> Eye
#> Hair Brown Blue Hazel
#> Black 0 0 0
#> Brown 0 0 0
tab_mult(he, hs)
#> Eye
#> Hair Brown Blue Hazel
#> Black 4624 400 225
#> Brown 14161 7056 2916
tab_div(he, hs)
#> Eye
#> Hair Brown Blue Hazel
#> Black 1 1 1
#> Brown 1 1 1
tab_div0(he, hs) ## Convention 0/0 = 0
#> Eye
#> Hair Brown Blue Hazel
#> Black 1 1 1
#> Brown 1 1 1
Shortcuts:
## Alternative
he %a+% hs
#> Eye
#> Hair Brown Blue Hazel
#> Black 136 40 30
#> Brown 238 168 108
he %a-% hs
#> Eye
#> Hair Brown Blue Hazel
#> Black 0 0 0
#> Brown 0 0 0
he %a*% hs
#> Eye
#> Hair Brown Blue Hazel
#> Black 4624 400 225
#> Brown 14161 7056 2916
he %a/% hs
#> Eye
#> Hair Brown Blue Hazel
#> Black 1 1 1
#> Brown 1 1 1
he %a/0% hs ## Convention 0/0 = 0
#> Eye
#> Hair Brown Blue Hazel
#> Black 1 1 1
#> Brown 1 1 1
Multiplication and addition of (a list of) multiple arrays is accomplished with tab_prod() and tab_sum() (much like prod() and sum()):
es <- tab_marg(hec, ~Eye:Sex)
tab_sum(he, hs, es)
#> , , Sex = Male
#>
#> Eye
#> Hair Brown Blue Hazel
#> Black 221 101 65
#> Brown 323 229 143
#>
#> , , Sex = Female
#>
#> Eye
#> Hair Brown Blue Hazel
#> Black 238 83 64
#> Brown 340 211 142
## tab_sum(list(he, hs, es))
Lists of arrays are processed with
tab_list_add(list(he, hs, es))
#> , , Sex = Male
#>
#> Eye
#> Hair Brown Blue Hazel
#> Black 221 101 65
#> Brown 323 229 143
#>
#> , , Sex = Female
#>
#> Eye
#> Hair Brown Blue Hazel
#> Black 238 83 64
#> Brown 340 211 142
tab_list_mult(list(he, hs, es))
#> , , Sex = Male
#>
#> Eye
#> Hair Brown Blue Hazel
#> Black 393040 24400 7875
#> Brown 1203685 430416 102060
#>
#> , , Sex = Female
#>
#> Eye
#> Hair Brown Blue Hazel
#> Black 471648 17200 7650
#> Brown 1444422 303408 99144
If an array consists of non–negative numbers then it may be regarded as an (unnormalized) discrete multivariate density. With this view, the following examples should be self explanatory:
tabDist(hec, marg=~Hair:Eye)
#> Eye
#> Hair Brown Blue Hazel
#> Black 0.189 0.0556 0.0417
#> Brown 0.331 0.2333 0.1500
tabDist(hec, cond=~Sex)
#> , , Sex = Male
#>
#> Eye
#> Hair Brown Blue Hazel
#> Black 0.177 0.0608 0.0552
#> Brown 0.293 0.2762 0.1381
#>
#> , , Sex = Female
#>
#> Eye
#> Hair Brown Blue Hazel
#> Black 0.201 0.0503 0.0279
#> Brown 0.369 0.1899 0.1620
tabDist(hec, marg=~Hair, cond=~Sex)
#> Sex
#> Hair Male Female
#> Black 0.293 0.279
#> Brown 0.707 0.721
We specify conditional probabilities p(r), p(s|r) and p(w|s, r) as follows (notice that the vertical conditioning bar (|) is indicated by the horizontal underscore:
yn <- c("y","n")
lev <- list(rain=yn, sprinkler=yn, wet=yn)
r <- tab_new(~rain, levels=lev, values=c(.2, .8))
s_r <- tab_new(~sprinkler:rain, levels = lev, values = c(.01, .99, .4, .6))
w_sr <- tab_new( ~wet:sprinkler:rain, levels=lev,
values=c(.99, .01, .8, .2, .9, .1, 0, 1))
r
#> rain
#> y n
#> 0.2 0.8
s_r |> flat()
#> sprinkler rain Freq
#> 1 y y 0.01
#> 2 n y 0.99
#> 3 y n 0.40
#> 4 n n 0.60
w_sr |> flat()
#> wet sprinkler rain Freq
#> 1 y y y 0.99
#> 2 n y y 0.01
#> 3 y n y 0.80
#> 4 n n y 0.20
The joint distribution p(r, s, w) = p(r)p(s|r)p(w|s, r) can be obtained with tab_prod():
joint <- tab_prod(r, s_r, w_sr); joint |> flat()
#> wet sprinkler rain Freq
#> 1 y y y 0.00198
#> 2 n y y 0.00002
#> 3 y n y 0.15840
#> 4 n n y 0.03960
What is the probability that it rains given that the grass is wet? We find p(r, w) = ∑sp(r, s, w) and then p(r|w) = p(r, w)/p(w). Can be done in various ways: with tabDist():
We consider the 3–way data from :
data(lizard, package="gRbase")
lizard |> flat()
#> diam height species Freq
#> 1 <=4 >4.75 anoli 32
#> 2 >4 >4.75 anoli 11
#> 3 <=4 <=4.75 anoli 86
#> 4 >4 <=4.75 anoli 35
where m(d, h) = ∑sm(d, h.s) etc. The updates are as follows: For the first term we have
$$ m(d,h,s) \leftarrow m(d,h,s) \frac{n(d,h)}{m(d,h)} % , \mbox{ where } % m(d,h) = \sum_s m(d,h,s) $$
After iterating the updates will not change and we will have equality: $ m(d,h,s) = m(d,h,s) $ and summing over s shows that the equation m(d, h) = n(d, h) is satisfied.
A rudimentary implementation of iterative proportional scaling for log–linear models is straight forward:
myips <- function(indata, glist){
fit <- indata
fit[] <- 1
## List of sufficient marginal tables
md <- lapply(glist, function(g) tab_marg(indata, g))
n_iter <- 4
n_generators <- length(glist)
for (i in 1:n_iter){
for (j in 1:n_generators){
mf <- tab_marg(fit, glist[[j]])
# adj <- tab_div( md[[ j ]], mf)
# fit <- tab_mult( fit, adj )
## or
adj <- md[[ j ]] %a/% mf
fit <- fit %a*% adj
}
}
pearson <- sum((fit - indata)^2 / fit)
list(pearson=pearson, fit=fit)
}
glist <- list(c("species", "diam"),c("species", "height"),c("diam", "height"))
fm1 <- myips(lizard, glist)
fm1$pearson
#> [1] 665
fm1$fit |> flat()
#> species diam height Freq
#> 1 anoli <=4 >4.75 32.8
#> 2 dist <=4 >4.75 60.2
#> 3 anoli >4 >4.75 10.2
#> 4 dist >4 >4.75 41.8
fm2 <- loglin(lizard, glist, fit=T)
#> 4 iterations: deviation 0.00962
fm2$pearson
#> [1] 0.151
fm2$fit |> flat()
#> diam height species Freq
#> 1 <=4 >4.75 anoli 32.8
#> 2 >4 >4.75 anoli 10.2
#> 3 <=4 <=4.75 anoli 85.2
#> 4 >4 <=4.75 anoli 35.8
For e.g. a 2 × 3 × 2 array, the entries are such that the first variable varies fastest so the ordering of the cells are (1, 1, 1), (2, 1, 1), (1, 2, 1), (2, 2, 1), (1, 3, 1) and so on. To find the value of such a cell, say, (j, k, l) in the array (which is really just a vector), the cell is mapped into an entry of a vector.
For example, cell (2, 3, 1) (Hair=Brown, Eye=Hazel, Sex=Male) must be mapped to entry 4 in
hec
#> , , Sex = Male
#>
#> Eye
#> Hair Brown Blue Hazel
#> Black 32 11 10
#> Brown 53 50 25
#>
#> , , Sex = Female
#>
#> Eye
#> Hair Brown Blue Hazel
#> Black 36 9 5
#> Brown 66 34 29
c(hec)
#> [1] 32 53 11 50 10 25 36 66 9 34 5 29
For illustration we do:
cell2name <- function(cell, dimnames) {
unlist(lapply(1:length(cell),
function(m){
dimnames[[m]][cell[m]]
}))}
cell2name(c(2,3,1), dimnames(hec))
#> [1] "Brown" "Hazel" "Male"
The map from a cell to the corresponding entry is provided by . The reverse operation, going from an entry to a cell (which is much less needed) is provided by .
Given a cell, say i = (2, 3, 1) in a 2 × 3 × 2 array we often want to find the next cell in the table following the convention that the first factor varies fastest, that is (1, 1, 2). This is provided by next_cell().
Given that we look at cells for which for which the index in dimension 2 is at level 3 (that is Eye=Hazel), i.e. cells of the form (j, 3, l). Given such a cell, what is then the next cell that also satisfies this constraint. This is provided by next_cell_slice().
next_cell_slice(c(1,3,1), slice_marg=2, dim=c( 2, 3, 2 ))
#> [1] 2 3 1
next_cell_slice(c(2,3,1), slice_marg=2, dim=c( 2, 3, 2 ))
#> [1] 1 3 2
Given that in dimension 2 we look at level 3. We want to find entries for the cells of the form (j, 3, l).
To verify that we indeed get the right cells:
r <- slice2entry(slice_cell=3, slice_marg=2, dim=c( 2, 3, 2 ))
lapply(lapply(r, entry2cell, c( 2, 3, 2 )),
cell2name, dimnames(hec))
#> [[1]]
#> [1] "Black" "Hazel" "Male"
#>
#> [[2]]
#> [1] "Brown" "Hazel" "Male"
#>
#> [[3]]
#> [1] "Black" "Hazel" "Female"
#>
#> [[4]]
#> [1] "Brown" "Hazel" "Female"
Using the operations above we can obtain the combinations of the factors as a matrix:
head( fact_grid( c(2, 3, 2) ), 6 )
#> [,1] [,2] [,3]
#> [1,] 1 1 1
#> [2,] 2 1 1
#> [3,] 1 2 1
#> [4,] 2 2 1
#> [5,] 1 3 1
#> [6,] 2 3 1
A similar dataframe can also be obtained with the standard R function (but is faster)
Slicing using standard R code can be done as follows:
hec[, 2:3, ] |> flat() ## A 2 x 2 x 2 array
#> Hair Eye Sex Freq
#> 1 Black Blue Male 11
#> 2 Brown Blue Male 50
#> 3 Black Hazel Male 10
#> 4 Brown Hazel Male 25
hec[1, , 1] ## A vector
#> Brown Blue Hazel
#> 32 11 10
hec[1, , 1, drop=FALSE] ## A 1 x 3 x 1 array
#> , , Sex = Male
#>
#> Eye
#> Hair Brown Blue Hazel
#> Black 32 11 10
Programmatically we can do the above as
do.call("[", c(list(hec), list(TRUE, 2:3, TRUE))) |> flat()
#> Hair Eye Sex Freq
#> 1 Black Blue Male 11
#> 2 Brown Blue Male 50
#> 3 Black Hazel Male 10
#> 4 Brown Hazel Male 25
do.call("[", c(list(hec), list(1, TRUE, 1)))
#> Brown Blue Hazel
#> 32 11 10
do.call("[", c(list(hec), list(1, TRUE, 1), drop=FALSE))
#> , , Sex = Male
#>
#> Eye
#> Hair Brown Blue Hazel
#> Black 32 11 10
provides two alterntives for each of these three cases above:
tab_slice_prim(hec, slice=list(TRUE, 2:3, TRUE)) |> flat()
#> Hair Eye Sex Freq
#> 1 Black Blue Male 11
#> 2 Brown Blue Male 50
#> 3 Black Hazel Male 10
#> 4 Brown Hazel Male 25
tab_slice(hec, slice=list(c(2, 3)), margin=2) |> flat()
#> Hair Eye Sex Freq
#> 1 Black Blue Male 11
#> 2 Brown Blue Male 50
#> 3 Black Hazel Male 10
#> 4 Brown Hazel Male 25
tab_slice_prim(hec, slice=list(1, TRUE, 1))
#> Brown Blue Hazel
#> 32 11 10
tab_slice(hec, slice=list(1, 1), margin=c(1, 3))
#> Brown Blue Hazel
#> 32 11 10
tab_slice_prim(hec, slice=list(1, TRUE, 1), drop=FALSE)
#> , , Sex = Male
#>
#> Eye
#> Hair Brown Blue Hazel
#> Black 32 11 10
tab_slice(hec, slice=list(1, 1), margin=c(1, 3), drop=FALSE)
#> , , Sex = Male
#>
#> Eye
#> Hair Brown Blue Hazel
#> Black 32 11 10