| Title: | Groupwise Statistics, LSmeans, Linear Estimates, Utilities |
|---|---|
| Description: | Utility package containing: Main categories: Working with grouped data: 'do' something to data when stratified 'by' some variables. General linear estimates. Data handling utilities. Functional programming, in particular restrict functions to a smaller domain. Miscellaneous functions for data handling. Model stability in connection with model selection. Miscellaneous other tools. |
| Authors: | Ulrich Halekoh [aut, cph], Søren Højsgaard [aut, cre, cph] |
| Maintainer: | Søren Højsgaard <[email protected]> |
| License: | GPL (>= 2) |
| Version: | 4.7.2 |
| Built: | 2026-05-24 08:52:21 UTC |
| Source: | https://github.com/hojsgaard/doby |
Convert right hand sided formula to a list
.rhsf2list(f).rhsf2list(f)
f |
A right hand sided formula |
Add interaction columns to data frame
add_int(.data, .formula)add_int(.data, .formula)
.data |
dataframe |
.formula |
right hand sided formula |
dataframe
Søren Højsgaard
Add predicted values of different types to dataframe
add_pred(data, model, var = "pred", type = NULL, transformation = NULL)add_pred(data, model, var = "pred", type = NULL, transformation = NULL)
data |
dataframe or tibble |
model |
model object |
var |
name of new variable in dataframe / tibble |
type |
type of predicted value |
transformation |
A possible transformation of predicted variable, e.g. reciprocal(), log() etc |
dataframe / tibble
Søren Højsgaard
data(cars) lm1 <- lm(dist ~ speed + I(speed^2), data=cars) lm1 |> response() |> head() cars <- cars |> add_pred(lm1) cars |> head() cars <- cars |> add_resid(lm1) carsdata(cars) lm1 <- lm(dist ~ speed + I(speed^2), data=cars) lm1 |> response() |> head() cars <- cars |> add_pred(lm1) cars |> head() cars <- cars |> add_resid(lm1) cars
Add residuals of different types to dataframe
add_resid(data, model, var = "resid", type)add_resid(data, model, var = "resid", type)
data |
dataframe or tibble |
model |
model object |
var |
name of new variable in dataframe / tibble |
type |
type of residual value |
dataframe / tibble
Søren Højsgaard
data(cars) lm1 <- lm(dist ~ speed + I(speed^2), data=cars) lm1 |> response() |> head() cars <- cars |> add_pred(lm1) cars |> head() cars <- cars |> add_resid(lm1) carsdata(cars) lm1 <- lm(dist ~ speed + I(speed^2), data=cars) lm1 |> response() |> head() cars <- cars |> add_pred(lm1) cars |> head() cars <- cars |> add_resid(lm1) cars
Extracts and aligns the fixed-effect estimates from a list of fitted model objects,
returning them in a single tidy data frame with consistent columns for easy comparison.
Works with a mix of model types such as lm, glm, gls, lmer, etc.
For models without p-values (e.g., lmer), the function computes approximate
Wald statistics and two-sided normal p-values.
align_coefs(models)align_coefs(models)
models |
A named list of fitted model objects. Each element should be a
model that can be passed to |
A tibble with columns:
The name of the model (from the list).
The term name (coefficient).
The estimated coefficient.
The standard error.
The Wald statistic (estimate / std.error).
Two-sided normal p-value.
# Example using the built-in CO2 dataset data(CO2) # Fit models lm_fit <- lm(uptake ~ conc + Type + Treatment, data = CO2) glm_fit <- glm(uptake ~ conc + Type + Treatment, family = Gamma(identity), data = CO2) # Combine estimates models_list <- list(lm = lm_fit, glm = glm_fit) result <- align_coefs(models_list) print(result)# Example using the built-in CO2 dataset data(CO2) # Fit models lm_fit <- lm(uptake ~ conc + Type + Treatment, data = CO2) glm_fit <- glm(uptake ~ conc + Type + Treatment, family = Gamma(identity), data = CO2) # Combine estimates models_list <- list(lm = lm_fit, glm = glm_fit) result <- align_coefs(models_list) print(result)
Yield and sugar percentage in sugar beets from a split plot experiment. Data is obtained from a split plot experiment. There are 3 blocks and in each of these the harvest time defines the "whole plot" and the sowing time defines the "split plot". Each plot was 25 square meters and the yield is recorded in kg. See 'details' for the experimental layout.
beetsbeets
The format is: chr "beets"
Experimental plan
Sowing times 1 4. april
2 12. april
3 21. april
4 29. april
5 18. may
Harvest times 1 2. october
2 21. october
Plot allocation:
Block 1 Block 2 Block 3
+-----------|-----------|-----------+
Plot | 1 1 1 1 1 | 2 2 2 2 2 | 1 1 1 1 1 | Harvest time
1-15 | 3 4 5 2 1 | 3 2 4 5 1 | 5 2 3 4 1 | Sowing time
|-----------|-----------|-----------|
Plot | 2 2 2 2 2 | 1 1 1 1 1 | 2 2 2 2 2 | Harvest time
16-30 | 2 1 5 4 3 | 4 1 3 2 5 | 1 4 3 2 5 | Sowing time
+-----------|-----------|-----------+
Ulrich Halekoh, Søren Højsgaard (2014)., A Kenward-Roger Approximation and Parametric Bootstrap Methods for Tests in Linear Mixed Models - The R Package pbkrtest., Journal of Statistical Software, 58(10), 1-30., https://www.jstatsoft.org/v59/i09/
data(beets) beets$bh <- with(beets, interaction(block, harvest)) summary(aov(yield ~ block + sow + harvest + Error(bh), beets)) summary(aov(sugpct ~ block + sow + harvest + Error(bh), beets))data(beets) beets$bh <- with(beets, interaction(block, harvest)) summary(aov(yield ~ block + sow + harvest + Error(bh), beets)) summary(aov(sugpct ~ block + sow + harvest + Error(bh), beets))
Convert binomial data to bernoulli data by expanding dataset.
binomial_to_bernoulli_data( data., y, size, type = c("rest", "total"), response_name = "response", rest_name = NULL )binomial_to_bernoulli_data( data., y, size, type = c("rest", "total"), response_name = "response", rest_name = NULL )
data. |
A dataframe |
y |
Column with 'successes' in binomial distribution |
size |
Column with 'failures', i.e. size-y or 'total', i.e. size. |
type |
Whether |
response_name |
Name of response variable in output dataset. |
rest_name |
Name of 'failures' in column |
dat <- budworm dat <- dat[dat$dose %in% c(1,2), ] dat$ntotal <- 5 dat dat.a <- dat |> binomial_to_bernoulli_data(ndead, ntotal, type="total") dat.b <- dat |> dplyr::mutate(nalive=ntotal-ndead) |> dplyr::select(-ntotal) |> binomial_to_bernoulli_data(ndead, nalive, type="rest") m0 <- glm(cbind(ndead, ntotal-ndead) ~ dose + sex, data=dat, family=binomial()) m1 <- glm(ndead / ntotal ~ dose + sex, data=dat, weight=ntotal, family=binomial()) ma <- glm(response ~ dose + sex, data=dat.a, family=binomial()) mb <- glm(response ~ dose + sex, data=dat.b, family=binomial()) dat.a$response dat.b$response ## Not same and therefore the following do not match all.equal(coef(m0), coef(ma)) all.equal(coef(m0), coef(mb)) all.equal(coef(m1), coef(ma)) all.equal(coef(m1), coef(mb))dat <- budworm dat <- dat[dat$dose %in% c(1,2), ] dat$ntotal <- 5 dat dat.a <- dat |> binomial_to_bernoulli_data(ndead, ntotal, type="total") dat.b <- dat |> dplyr::mutate(nalive=ntotal-ndead) |> dplyr::select(-ntotal) |> binomial_to_bernoulli_data(ndead, nalive, type="rest") m0 <- glm(cbind(ndead, ntotal-ndead) ~ dose + sex, data=dat, family=binomial()) m1 <- glm(ndead / ntotal ~ dose + sex, data=dat, weight=ntotal, family=binomial()) ma <- glm(response ~ dose + sex, data=dat.a, family=binomial()) mb <- glm(response ~ dose + sex, data=dat.b, family=binomial()) dat.a$response dat.b$response ## Not same and therefore the following do not match all.equal(coef(m0), coef(ma)) all.equal(coef(m0), coef(mb)) all.equal(coef(m1), coef(ma)) all.equal(coef(m1), coef(mb))
Backquote a list of functions
bquote_fun_list(fun_list)bquote_fun_list(fun_list)
fun_list |
List of functions |
base::bquote(), set_default(), section_fun()
## Evaluate a list of functions f1 <- function(x){x + 1} f2 <- function(x){x + 8} f1_ <- set_default(f1, list(x=10)) f2_ <- set_default(f2, list(x=10)) f1_(); f2_() fn_list <- list(f1_, f2_) fn_list_ <- bquote_fun_list(fn_list) eval(fn_list[[1]]) ## No sapply(fn_list, eval) ## No eval(fn_list_[[1]]) ## Yes sapply(fn_list_, eval) ## Yes## Evaluate a list of functions f1 <- function(x){x + 1} f2 <- function(x){x + 8} f1_ <- set_default(f1, list(x=10)) f2_ <- set_default(f2, list(x=10)) f1_(); f2_() fn_list <- list(f1_, f2_) fn_list_ <- bquote_fun_list(fn_list) eval(fn_list[[1]]) ## No sapply(fn_list, eval) ## No eval(fn_list_[[1]]) ## Yes sapply(fn_list_, eval) ## Yes
This function is a wrapper for calling lapply on the list resulting from first calling splitBy.
lapply_by(data, formula, FUN, ...) lapplyBy(formula, data = parent.frame(), FUN, ...) sapply_by(data, formula, FUN, ..., simplify = TRUE, USE.NAMES = TRUE) sapplyBy( formula, data = parent.frame(), FUN, ..., simplify = TRUE, USE.NAMES = TRUE )lapply_by(data, formula, FUN, ...) lapplyBy(formula, data = parent.frame(), FUN, ...) sapply_by(data, formula, FUN, ..., simplify = TRUE, USE.NAMES = TRUE) sapplyBy( formula, data = parent.frame(), FUN, ..., simplify = TRUE, USE.NAMES = TRUE )
data |
A dataframe. |
formula |
A formula describing how data should be split. |
FUN |
A function to be applied to each element in the split list, see 'Examples' below. |
... |
optional arguments to FUN. |
simplify |
Same as for |
USE.NAMES |
Same as for |
A list.
Søren Højsgaard, [email protected]
fun <- function(x) range(x$uptake) lapplyBy(~Treatment + Type, data=CO2, FUN=fun) sapplyBy(~Treatment + Type, data=CO2, FUN=fun) # Same as lapply(splitBy(~Treatment + Type, data=CO2), FUN=fun)fun <- function(x) range(x$uptake) lapplyBy(~Treatment + Type, data=CO2, FUN=fun) sapplyBy(~Treatment + Type, data=CO2, FUN=fun) # Same as lapply(splitBy(~Treatment + Type, data=CO2), FUN=fun)
The data is split into strata according to the levels of the grouping factors and individual lm fits are obtained for each stratum.
lm_by(data., formula., id = NULL, ...) lmBy(formula., data., id = NULL, ...)lm_by(data., formula., id = NULL, ...) lmBy(formula., data., id = NULL, ...)
data. |
A dataframe |
formula. |
A linear model formula object of the form |
id |
A formula describing variables from data which are to be available also in the output. |
... |
Additional arguments passed on to |
A list of lm fits.
Søren Højsgaard, [email protected]
lm_lst <- lmBy(1 / uptake ~ log(conc) | Treatment, data=CO2) coef(lm_lst) fitted(lm_lst) residuals(lm_lst) summary(lm_lst) coef(summary(lm_lst)) coef(summary(lm_lst), simplify=TRUE)lm_lst <- lmBy(1 / uptake ~ log(conc) | Treatment, data=CO2) coef(lm_lst) fitted(lm_lst) residuals(lm_lst) summary(lm_lst) coef(summary(lm_lst)) coef(summary(lm_lst), simplify=TRUE)
Ordering (sorting) rows of a data frame by the certain
variables in the data frame. This function is essentially a
wrapper for the order() function - the important
difference being that variables to order by can be given by a
model formula.
order_by(data, formula) orderBy(formula, data)order_by(data, formula) orderBy(formula, data)
data |
A dataframe |
formula |
The right hand side of a formula |
The sign of the terms in the formula determines whether sorting should be ascending or decreasing; see examples below
The ordered data frame
Søren Højsgaard, [email protected] and Kevin Wright
transformBy, transform_by, splitBy, split_by
orderBy(~ conc + Treatment, CO2) ## Sort decreasingly by conc orderBy(~ - conc + Treatment, CO2) ## Same as: order_by(CO2, c("conc", "Treatment")) order_by(CO2, c("-conc", "Treatment"))orderBy(~ conc + Treatment, CO2) ## Sort decreasingly by conc orderBy(~ - conc + Treatment, CO2) ## Same as: order_by(CO2, c("conc", "Treatment")) order_by(CO2, c("-conc", "Treatment"))
A data frame is split according to some variables in a formula, and a sample of a certain fraction of each is drawn.
sample_by(data, formula, frac = 0.1, replace = FALSE, systematic = FALSE) sampleBy( formula, frac = 0.1, replace = FALSE, data = parent.frame(), systematic = FALSE )sample_by(data, formula, frac = 0.1, replace = FALSE, systematic = FALSE) sampleBy( formula, frac = 0.1, replace = FALSE, data = parent.frame(), systematic = FALSE )
data |
A data frame. |
formula |
A formula defining the grouping of the data frame. |
frac |
The part of data to be sampled. |
replace |
Is the sampling with replacement. |
systematic |
Should sampling be systematic. |
If systematic=FALSE (default) then frac gives the fraction of data sampled. If systematic=TRUE and frac=.2 then every 1/.2 i.e. every 5th observation is taken out.
A dataframe.
orderBy, order_by,
splitBy, split_by,
summaryBy, summary_by,
transformBy, transform_by
data(dietox) sampleBy(formula = ~ Evit + Cu, frac=.1, data = dietox)data(dietox) sampleBy(formula = ~ Evit + Cu, frac=.1, data = dietox)
Split a dataframe according to the levels of variables in the dataframe. Uses vparse() to interpret flexible input.
split_by(data., ..., omit = TRUE) splitBy(formula, data, omit = TRUE) ## S3 method for class 'splitByData' head(x, n = 6L, ...) ## S3 method for class 'splitByData' tail(x, n = 6L, ...) split_by.legacy(data, formula, drop = TRUE) splitBy.legacy(formula, data = parent.frame(), drop = TRUE)split_by(data., ..., omit = TRUE) splitBy(formula, data, omit = TRUE) ## S3 method for class 'splitByData' head(x, n = 6L, ...) ## S3 method for class 'splitByData' tail(x, n = 6L, ...) split_by.legacy(data, formula, drop = TRUE) splitBy.legacy(formula, data = parent.frame(), drop = TRUE)
data. |
A data frame (or tibble) to split |
... |
Variables defining the groups |
omit |
If TRUE (default), group-defining variables are omitted in each split group |
formula |
A right hand sided formula (for the old interface) |
data |
A data frame (for the old interface) |
x |
A splitByData object. |
n |
An integer vector. |
drop |
Obsolete |
An object of class \"splitByData\" (a named list with group attributes)
Søren Højsgaard, [email protected]
orderBy, order_by,
summaryBy, summary_by,
transformBy, transform_by
split_by(CO2, ~Treatment+Type) split_by(CO2, Treatment, Type) split_by(CO2, c("Treatment", "Type")) split_by(CO2, "Treatment", "Type") x <- split_by(CO2, "Treatment", "Type") head(x, 3) tail(x, 3) ## Via wrapper: foo2 <- function(x) { x <- rlang::enquo(x) split_by(CO2, !!x) } foo2(~Treatment) ## The "Old" interface splitBy(~Treatment + Type, CO2) splitBy(~Treatment + Type, data=CO2) splitBy(c("Treatment", "Type"), data=CO2)split_by(CO2, ~Treatment+Type) split_by(CO2, Treatment, Type) split_by(CO2, c("Treatment", "Type")) split_by(CO2, "Treatment", "Type") x <- split_by(CO2, "Treatment", "Type") head(x, 3) tail(x, 3) ## Via wrapper: foo2 <- function(x) { x <- rlang::enquo(x) split_by(CO2, !!x) } foo2(~Treatment) ## The "Old" interface splitBy(~Treatment + Type, CO2) splitBy(~Treatment + Type, data=CO2) splitBy(c("Treatment", "Type"), data=CO2)
A data frame is split by a formula into groups. Then subsets are found within each group, and the result is collected into a data frame.
subset_by(data, formula, subset, select, drop = FALSE, join = TRUE, ...) subsetBy( formula, subset, data = parent.frame(), select, drop = FALSE, join = TRUE, ... )subset_by(data, formula, subset, select, drop = FALSE, join = TRUE, ...) subsetBy( formula, subset, data = parent.frame(), select, drop = FALSE, join = TRUE, ... )
data |
A data frame. |
formula |
A right hand sided formula or a character vector of variables to split by. |
subset |
logical expression indicating elements or rows to keep: missing values are taken as false. |
select |
expression, indicating columns to select from a data frame. |
drop |
passed on to |
join |
If FALSE the result is a list of data frames (as defined by 'formula'); if TRUE one data frame is returned. |
... |
further arguments to be passed to or from other methods. |
A data frame.
Søren Højsgaard, [email protected]
data(dietox) subsetBy(~Evit, Weight < mean(Weight), data=dietox)data(dietox) subsetBy(~Evit, Weight < mean(Weight), data=dietox)
Computes summary statistics by groups, similar to the summary procedure in SAS.
A more flexible alternative to base R's aggregate.
summary_by( data, formula, id = NULL, FUN = mean, keep.names = FALSE, p2d = FALSE, order = TRUE, full.dimension = FALSE, var.names = NULL, fun.names = NULL, ... ) summaryBy( formula, data = parent.frame(), id = NULL, FUN = mean, keep.names = FALSE, p2d = FALSE, order = TRUE, full.dimension = FALSE, var.names = NULL, fun.names = NULL, ... )summary_by( data, formula, id = NULL, FUN = mean, keep.names = FALSE, p2d = FALSE, order = TRUE, full.dimension = FALSE, var.names = NULL, fun.names = NULL, ... ) summaryBy( formula, data = parent.frame(), id = NULL, FUN = mean, keep.names = FALSE, p2d = FALSE, order = TRUE, full.dimension = FALSE, var.names = NULL, fun.names = NULL, ... )
data |
A data frame. |
formula |
A formula specifying response and grouping variables. |
id |
A formula indicating variables to retain (not grouped by). |
FUN |
A function or list of functions to apply to the response variables. |
keep.names |
Logical; keep original variable names if only one function is applied. |
p2d |
Replace parentheses in output names with dots? |
order |
Logical; should result be ordered by grouping variables? |
full.dimension |
Logical; if TRUE, repeat rows so output matches input size. |
var.names |
Optional custom names for response variables. |
fun.names |
Optional custom names for functions applied. |
... |
Additional arguments passed to functions in |
Extra arguments in ... are passed to all functions in FUN. If needed, wrap functions to handle these consistently (e.g., for na.rm = TRUE).
A data frame of grouped summary statistics.
Søren Højsgaard, [email protected]
aggregate, orderBy, transformBy, splitBy
data(CO2) # Simple groupwise mean summaryBy(uptake ~ Type + Treatment, data = CO2, FUN = mean) summaryBy(cbind(uptake, conc) ~ Type + Treatment, data = CO2, FUN = mean) # Compare with aggregate(cbind(uptake, conc) ~ Type + Treatment, data = CO2, FUN = mean) ## Using '.' on the right hand side of a formula means to stratify by ## all variables not used elsewhere: summaryBy(uptake ~ ., data = CO2, FUN = mean) # Multiple functions using a custom summary function myfun <- function(x, ...) c(m = mean(x, na.rm = TRUE), v = var(x, na.rm = TRUE), n = length(x)) summaryBy(uptake ~ Type + Treatment, data = CO2, FUN = myfun) # Summary on transformed variables # works: summaryBy(cbind(lu=log(uptake), conc) ~ Type, data = CO2, FUN = mean) # fails: #summaryBy(cbind(log(uptake), conc) ~ Type, data = CO2, FUN = mean)data(CO2) # Simple groupwise mean summaryBy(uptake ~ Type + Treatment, data = CO2, FUN = mean) summaryBy(cbind(uptake, conc) ~ Type + Treatment, data = CO2, FUN = mean) # Compare with aggregate(cbind(uptake, conc) ~ Type + Treatment, data = CO2, FUN = mean) ## Using '.' on the right hand side of a formula means to stratify by ## all variables not used elsewhere: summaryBy(uptake ~ ., data = CO2, FUN = mean) # Multiple functions using a custom summary function myfun <- function(x, ...) c(m = mean(x, na.rm = TRUE), v = var(x, na.rm = TRUE), n = length(x)) summaryBy(uptake ~ Type + Treatment, data = CO2, FUN = myfun) # Summary on transformed variables # works: summaryBy(cbind(lu=log(uptake), conc) ~ Type, data = CO2, FUN = mean) # fails: #summaryBy(cbind(log(uptake), conc) ~ Type, data = CO2, FUN = mean)
Function to make groupwise transformations of data by applying the transform function to subsets of data.
transform_by(data, formula, ...) transformBy(formula, data, ...)transform_by(data, formula, ...) transformBy(formula, data, ...)
data |
A data frame |
formula |
A formula with only a right hand side, see examples below |
... |
Further arguments of the form tag=value |
The ... arguments are tagged vector expressions, which are evaluated in the data frame data. The tags are matched against names(data), and for those that match, the value replace the corresponding variable in data, and the others are appended to data.
The modified value of the dataframe data.
Søren Højsgaard, [email protected]
orderBy, order_by, summaryBy, summary_by,
splitBy, split_by
data(dietox) transformBy(~Pig, data=dietox, minW=min(Weight), maxW=max(Weight), gain=diff(range(Weight)))data(dietox) transformBy(~Pig, data=dietox, minW=min(Weight), maxW=max(Weight), gain=diff(range(Weight)))
Measurement of lean meat percentage of 344 pig carcasses together with auxiliary information collected at three Danish slaughter houses
carcasscarcass
carcassall: A data frame with 344 observations on the following 17 variables.
weightWeight of carcass
lengthcLength of carcass from back toe to head (when the carcass hangs in the back legs)
lengthfLength of carcass from back toe to front leg (that is, to the shoulder)
lengthpLength of carcass from back toe to the pelvic bone
Fat02, Fat03, Fat11, Fat12, Fat13, Fat14, Fat16Thickness of fat layer at different locations on the back of the carcass (FatXX refers to thickness at (or rather next to) rib no. XX. Notice that 02 is closest to the head
Meat11, Meat12, Meat13Thickness of meat layer at different locations on the back of the carcass, see description above
LeanMeatLean meat percentage determined by dissection
slhouseSlaughter house; a factor with levels slh1 and slh2.
sexSex of the pig; a factor with levels castrate and female.
sizeSize of the carcass; a factor with levels normal and large.
Here, normal refers to carcass weight under 80 kg; large refers to carcass weights between 80 and 110 kg.
: Notice that there were slaughtered large pigs only at one slaughter house.
carcass: Contains only the variables Fat11, Fat12, Fat13, Meat11, Meat12, Meat13, LeanMeat
Busk, H., Olsen, E. V., Brøndum, J. (1999) Determination of lean meat in pig carcasses with the Autofom classification system, Meat Science, 52, 307-314
data(carcass) head(carcass)data(carcass) head(carcass)
dataframe with heights of 39 boys and 54 girls from age 1 to 18 and the ages at which they were collected.
child_growthchild_growth
Gender of child
Age at time of data recording
Identification for each child
Height of child
Notice that the ages are not equally spaced. Data are taken from the fda package (growth) but put in long format here.
Ramsay, James O., Hooker, Giles, and Graves, Spencer (2009), _Functional data analysis with R and Matlab_, Springer, New York. Ramsay, James O., and Silverman, Bernard W. (2005), _Functional Data Analysis, 2nd ed._, Springer, New York. Ramsay, James O., and Silverman, Bernard W. (2002), _Applied Functional Data Analysis_, Springer, New York. Tuddenham, R. D., and Snyder, M. M. (1954) "Physical growth of California boys and girls from birth to age 18", _University of California Publications in Child Development_, 1, 183-364.
Stomach content data for Atlantic cod (Gadus morhua) in the Gulf of St.Lawrence, Eastern Canada. Note: many prey items were of no interest for this analysis and were regrouped into the "Other" category.
codstomcodstom
A data frame with 10000 observations on the following 10 variables.
regiona factor with levels SGSL NGSL
representing the southern and northern Gulf of St. Lawrence, respectively
ship.typea factor with levels 2 3 31
34 90 99
ship.ida factor with levels 11558 11712
136148 136885
136902 137325 151225 151935 99433
tripa factor with levels 10 11
12 179 1999
2 2001 20020808 3 4 5
6 7 8
88 9 95
seta numeric vector
fish.ida numeric vector
fish.lengtha numeric vector, length in mm
prey.massa numeric vector, mass of item in stomach, in g
prey.typea factor with levels Ammodytes_sp
Argis_dent
Chion_opil Detritus Empty Eualus_fab
Eualus_mac Gadus_mor Hyas_aran
Hyas_coar
Lebbeus_gro Lebbeus_pol Leptocl_mac
Mallot_vil
Megan_norv Ophiuroidea Other Paguridae
Pandal_bor Pandal_mon Pasiph_mult
Sabin_sept
Sebastes_sp Them_abys Them_comp Them_lib
Cod are collected either by contracted commerical fishing vessels
(ship.type 90 or 99) or by research vessels. Commercial vessels are
identified by a unique ship.id.
Either one research vessel or several commercial vessels conduct a survey
(trip), during which a trawl, gillnets or hooked lines are set
several times. Most trips are random stratified surveys (depth-based
stratification).
Each trip takes place within one of the regions. The trip
label is only guaranteed to be unique within a region and the set
label is only guaranteed to be unique within a trip.
For each fish caught, the fish.length is recorded and the fish is
allocated a fish.id, but the fish.id is only guaranteed to be
unique within a set. A subset of the fish caught are selected for
stomach analysis (stratified random selection according to fish length; unit
of stratification is the set for research surveys, the combination ship.id
and stratum for surveys conducted by commercial vessels, although strata are
not shown in codstom).
The basic experimental unit in this data set is a cod stomach (one stomach
per fish). Each stomach is uniquely identified by a combination of
region, ship.type, ship.id, trip, set,
and fish.id.
For each prey item found in a stomach, the species and mass of the prey item
are recorded, so there can be multiple observations per stomach. There may
also be several prey items with the same prey.type in the one stomach
(for example many prey.types have been recoded Other, which
produced many instances of Other in the same stomach).
If a stomach is empty, a single observation is recorded with
prey.type Empty and a prey.mass of zero.
Small subset from a larger dataset (more stomachs, more variables,
more prey.types) collected by D. Chabot and M. Hanson, Fisheries &
Oceans Canada [email protected].
data(codstom) str(codstom) # removes multiple occurences of same prey.type in stomachs codstom1 <- summaryBy(prey.mass ~ region + ship.type + ship.id + trip + set + fish.id + prey.type, data = codstom, FUN = sum) # keeps a single line per stomach with the total mass of stomach content codstom2 <- summaryBy(prey.mass ~ region + ship.type + ship.id + trip + set + fish.id, data = codstom, FUN = sum) # mean prey mass per stomach for each trip codstom3 <- summaryBy(prey.mass.sum ~ region + ship.type + ship.id + trip, data = codstom2, FUN = mean) ## Not run: # wide version, one line per stomach, one column per prey type library(reshape) codstom4 <- melt(codstom, id = c(1:7, 9)) codstom5 <- cast(codstom4, region + ship.type + ship.id + trip + set + fish.id + fish.length ~ prey.type, sum) k <- length(names(codstom5)) prey_col <- 8:k out <- codstom5[,prey_col] out[is.na(out)] <- 0 codstom5[,prey_col] <- out codstom5$total.content <- rowSums(codstom5[, prey_col]) ## End(Not run)data(codstom) str(codstom) # removes multiple occurences of same prey.type in stomachs codstom1 <- summaryBy(prey.mass ~ region + ship.type + ship.id + trip + set + fish.id + prey.type, data = codstom, FUN = sum) # keeps a single line per stomach with the total mass of stomach content codstom2 <- summaryBy(prey.mass ~ region + ship.type + ship.id + trip + set + fish.id, data = codstom, FUN = sum) # mean prey mass per stomach for each trip codstom3 <- summaryBy(prey.mass.sum ~ region + ship.type + ship.id + trip, data = codstom2, FUN = mean) ## Not run: # wide version, one line per stomach, one column per prey type library(reshape) codstom4 <- melt(codstom, id = c(1:7, 9)) codstom5 <- cast(codstom4, region + ship.type + ship.id + trip + set + fish.id + fish.length ~ prey.type, sum) k <- length(names(codstom5)) prey_col <- 8:k out <- codstom5[,prey_col] out[is.na(out)] <- 0 codstom5[,prey_col] <- out codstom5$total.content <- rowSums(codstom5[, prey_col]) ## End(Not run)
Mating songs of male tree crickets.
cricketscrickets
This data frame contains:
Species, (exis, nius), see details
temperature
pulse per second
Walker (1962) studied the mating songs of male tree crickets. Each
wingstroke by a cricket produces a pulse of song, and females may
use the number of pulses per second to identify males of the
correct species. Walker (1962) wanted to know whether the chirps
of the crickets Oecanthus exclamationis (abbreviated exis) and
Oecanthus niveus (abbreviated nius) had different pulse rates. See
the biostathandbook for details. (The
abbreviations are made from the the first two and last two letters
of the species.) Walker measured the pulse rate of the crickets
(variable pps) at a variety of temperatures (temp):
data(crickets) coplot(pps ~ temp | species, data=crickets)data(crickets) coplot(pps ~ temp | species, data=crickets)
Crime rates per 100,000 inhabitants in states of the USA for different crime types in 1977.
crime_ratecrime_rate
This data frame contains:
crime of murder
residential theft
unlawful taking of personal property (pocket picking)
data(crime_rate)data(crime_rate)
Crime rates per 100,000 inhabitants in states of the USA for different crime types in 1977.
crimeRatecrimeRate
This data frame contains:
State of the USA
crime of murder
residential theft
unlawful taking of personal property (pocket picking)
data(crimeRate)data(crimeRate)
Yield from Danish agricultural production of grain and root crop.
cropyieldcropyield
A dataframe with 97 rows and 7 columns.
yearFrom 1901 to 1997.
precipMilimeter precipitation.
yieldMillion feed units (see details).
areaArea in 1000 ha for grains and root crop.
fertil1000 tons fertilizer.
avgtmp1Average temperature April-June (3 months).
avgtmp2Average temperature July-Octobre (4 months).
A feed unit is the amount of energy in a kg of barley.
Danmarks statistik (Statistics Denmark).
Cross-validation for list of glm objects
cv_glm_fitlist(data., fit_list, K = 10)cv_glm_fitlist(data., fit_list, K = 10)
data. |
A data frame |
fit_list |
A list of glm objects |
K |
Number of folds |
Perturbations of the p53 pathway are associated with more aggressive and therapeutically refractory tumours. We preprocessed the data using Robust Multichip Analysis (RMA). Dataset has been truncated to the 1000 most informative genes (as selected by Wilcoxon test statistics) to simplify computation. The genes have been standardized to have zero mean and unit variance (i.e. z-scored).
breastcancerbreastcancer
A data frame with 250 observations on 1001 variables. The
first 1000 columns are numerical variables; the last column
(named code) is a factor with levels case and
control.
The factor code defines whether there was a mutation in the p53
sequence (code=case) or not (code=control).
Chris Holmes, [email protected]
Miller et al (2005, PubMed ID:16141321)
data(breastcancer) bc <- breastcancer pairs(bc[,1:5], col=bc$code) train <- sample(1:nrow(bc), 50) table(bc$code[train]) ## Not run: library(MASS) z <- lda(code ~ ., data=bc, prior = c(1, 1) / 2, subset = train) pc <- predict(z, bc[-train, ])$class pc bc[-train, "code"] table(pc, bc[-train, "code"]) ## End(Not run)data(breastcancer) bc <- breastcancer pairs(bc[,1:5], col=bc$code) train <- sample(1:nrow(bc), 50) table(bc$code[train]) ## Not run: library(MASS) z <- lda(code ~ ., data=bc, prior = c(1, 1) / 2, subset = train) pc <- predict(z, bc[-train, ])$class pc bc[-train, "code"] table(pc, bc[-train, "code"]) ## End(Not run)
Experiment on the toxicity to the tobacco budworm Heliothis virescens of doses of the pyrethroid trans-cypermethrin to which the moths were beginning to show resistance. Batches of 20 moths of each sex were exposed for three days to the pyrethroid and the number in each batch that were dead or knocked down was recorded. Data is reported in Collett (1991, p. 75).
budwormbudworm
This data frame contains 12 rows and 4 columns:
sex of the budworm.
dose of the insecticide trans-cypermethrin (in micro grams)
.
budworms killed in a trial.
total number of budworms exposed per trial.
Collett, D. (1991) Modelling Binary Data, Chapman & Hall, London, Example 3.7
Venables, W.N; Ripley, B.D.(1999) Modern Applied Statistics with S-Plus, Heidelberg, Springer, 3rd edition, chapter 7.2
data(budworm) ## function to caclulate the empirical logits empirical.logit<- function(nevent,ntotal) { y <- log((nevent + 0.5) / (ntotal - nevent + 0.5)) y } # plot the empirical logits against log-dose log.dose <- log(budworm$dose) emp.logit <- empirical.logit(budworm$ndead, budworm$ntotal) plot(log.dose, emp.logit, type='n', xlab='log-dose',ylab='emprirical logit') title('budworm: emprirical logits of probability to die ') male <- budworm$sex=='male' female <- budworm$sex=='female' lines(log.dose[male], emp.logit[male], type='b', lty=1, col=1) lines(log.dose[female], emp.logit[female], type='b', lty=2, col=2) legend(0.5, 2, legend=c('male', 'female'), lty=c(1,2), col=c(1,2)) ## Not run: * SAS example; data budworm; infile 'budworm.txt' firstobs=2; input sex dose ndead ntotal; run; ## End(Not run)data(budworm) ## function to caclulate the empirical logits empirical.logit<- function(nevent,ntotal) { y <- log((nevent + 0.5) / (ntotal - nevent + 0.5)) y } # plot the empirical logits against log-dose log.dose <- log(budworm$dose) emp.logit <- empirical.logit(budworm$ndead, budworm$ntotal) plot(log.dose, emp.logit, type='n', xlab='log-dose',ylab='emprirical logit') title('budworm: emprirical logits of probability to die ') male <- budworm$sex=='male' female <- budworm$sex=='female' lines(log.dose[male], emp.logit[male], type='b', lty=1, col=1) lines(log.dose[female], emp.logit[female], type='b', lty=2, col=2) legend(0.5, 2, legend=c('male', 'female'), lty=c(1,2), col=c(1,2)) ## Not run: * SAS example; data budworm; infile 'budworm.txt' firstobs=2; input sex dose ndead ntotal; run; ## End(Not run)
A cross classified table with observational data from a Danish heart clinic. The response variable is CAD (coronary artery disease, some times called heart attack).
data(cad1)data(cad1)
A data frame with 236 observations on the following 14 variables.
SexSex; a factor with levels Female Male
AngPecAngina pectoris (chest pain attacks); a
factor with levels Atypical None Typical
AMIAcute myocardic infarct; a factor with
levels Definite NotCertain
QWaveA reading from an electrocardiogram; a
factor with levels No Yes; Yes means pathological and is a sign of previous myocardial infarction.
QWavecodea factor with levels Nonusable
Usable. An assesment of whether QWave is reliable.
STcodea factor with levels
Nonusable Usable. An assesment of whether STchange is reliable.
STchangeA reading from an electrocardiogram; a factor
with levels No Yes. An STchange indicates a blockage of the coronary artery.
SuffHeartFSufficient heart frequency; a factor with levels No, Yes
Hypertrophia factor with levels No, Yes. Hypertrophy refers to an
increased size of the heart muscle due to exercise.
Hyperchola factor with levels No Yes. Hypercholesterolemia, also called high cholesterol,
is the presence of high levels of cholesterol in the blood.
SmokerIs the patient a smoker; a factor with levels No, Yes.
InheritHereditary predispositions for CAD; a factor with levels No, Yes.
HeartfailPrevious heart failures; a factor with levels No Yes
CADCoronary Artery Disease; a factor with levels
No Yes
. CAD refers to a reduction of blood flow to the heart muscle (commonly known as a heart attack). The diagnosis made from biopsies.
Notice that data are collected at a heart clinic, so data do not represent the population, but are conditional on patients having ended up at the clinic.
cad1: Complete dataset, 236 cases.
cad2: Incomplete dataset, 67 cases. Information on (some of) the variables 'Hyperchol', 'Smoker' and 'Inherit' is missing.
Hansen, J. F. (1980). The clinical diagnosis of ischaemic heart disease due to coronary artery disease. Danish Medical Bulletin
Højsgaard, Søren and Thiesson, Bo (1995). BIFROST - Block recursive models Induced From Relevant knowledge, Observations and Statistical Techniques. Computational Statistics and Data Analysis, vol. 19, p. 155-175 #'
data(cad1) ## maybe str(cad1) ; plot(cad1) ...data(cad1) ## maybe str(cad1) ; plot(cad1) ...
The mathmark data frame has 88 rows and 5 columns.
data(mathmark)data(mathmark)
This data frame contains the following columns: mechanics, vectors, algebra, analysis, statistics.
Søren Højsgaard, [email protected]
David Edwards, An Introduction to Graphical Modelling, Second Edition, Springer Verlag, 2000
data(mathmark)data(mathmark)
The peronality dataframe has 240 rows and 32 columns
data(personality)data(personality)
This dataframe has recordings on the following 32 variables: distant, talkatv, carelss, hardwrk, anxious, agreebl, tense, kind, opposng, relaxed, disorgn, outgoin, approvn, shy, discipl, harsh, persevr, friendl, worryin, respnsi, contrar, sociabl, lazy, coopera, quiet, organiz, criticl, lax, laidbck, withdrw, givinup, easygon
Søren Højsgaard, [email protected]
Origin unclear
data(personality) str(personality)data(personality) str(personality)
Using chemical analysis determine the origin of wines
data(wine)data(wine)
A data frame with 178 observations on the following 14 variables.
Culta factor with levels v1 v2
v3: 3 different graph varieties
AlchAlcohol
MlcaMalic acid
AshAsh
AloaAlcalinity of ash
MgnsMagnesium
TtlpTotal phenols
FlvnFlavanoids
NnfpNonflavanoid phenols
PrntProanthocyanins
ClriColor intensity
HueHue
OodwOD280/OD315 of diluted wines
PrlnProline
Data comes from the UCI Machine Learning Repository. The grape variety
Cult is the class identifier.
Frank, A. & Asuncion, A. (2010). UCI Machine Learning Repository https://archive.ics.uci.edu/ml/. Irvine, CA: University of California, School of Information and Computer Science.
See references at https://archive.ics.uci.edu/ml/datasets/Wine/
data(wine) ## maybe str(wine) ; plot(wine) ...data(wine) ## maybe str(wine) ; plot(wine) ...
Computing simple descriptive statistics of a numeric vector - not unlike what proc means of SAS does
descStat(x, na.rm = TRUE)descStat(x, na.rm = TRUE)
x |
A numeric vector |
na.rm |
Should missing values be removed |
A vector with named elements.
Gregor Gorjanc; [email protected]
x <- c(1, 2, 3, 4, NA, NaN) descStat(x)x <- c(1, 2, 3, 4, NA, NaN) descStat(x)
The dietox data frame has 861 rows and 7 columns.
dietoxdietox
This data frame contains the following columns:
Weight in Kg
Cumulated feed intake in Kg
Time (in weeks) in the experiment
Factor; id of each pig
Factor; vitamin E dose; see 'details'.
Factor, copper dose; see 'details'
Start weight in experiment, i.e. weight at week 1.
Factor, id of litter of each pig
Data contains weight of slaughter pigs measured weekly for 12 weeks. Data also contains the start weight (i.e. the weight at week 1). The treatments are 3 different levels of Evit = vitamin E (dose: 0, 100, 200 mg dl-alpha-tocopheryl acetat /kg feed) in combination with 3 different levels of Cu=copper (dose: 0, 35, 175 mg/kg feed) in the feed. The cumulated feed intake is also recorded. The pigs are litter mates.
Lauridsen, C., Højsgaard, S.,Sørensen, M.T. C. (1999) Influence of Dietary Rapeseed Oli, Vitamin E, and Copper on Performance and Antioxidant and Oxidative Status of Pigs. J. Anim. Sci.77:906-916
data(dietox) head(dietox) coplot(Weight ~ Time | Evit * Cu, data=dietox)data(dietox) head(dietox) coplot(Weight ~ Time | Evit * Cu, data=dietox)
Computes linear functions (i.e. weighted sums) of the estimated regression parameters. Can also test the hypothesis, that such a function is equal to a specific value.
esticon(obj, L, beta0, conf.int = TRUE, level = 0.95, joint.test = FALSE, ...) ## S3 method for class 'esticon_class' coef(object, ...) ## S3 method for class 'esticon_class' summary(object, ...) ## S3 method for class 'esticon_class' confint(object, parm, level = 0.95, ...) ## S3 method for class 'esticon_class' vcov(object, ...)esticon(obj, L, beta0, conf.int = TRUE, level = 0.95, joint.test = FALSE, ...) ## S3 method for class 'esticon_class' coef(object, ...) ## S3 method for class 'esticon_class' summary(object, ...) ## S3 method for class 'esticon_class' confint(object, parm, level = 0.95, ...) ## S3 method for class 'esticon_class' vcov(object, ...)
obj |
Regression object (of type lm, glm, lme, geeglm). |
L |
Matrix (or vector) specifying linear functions of the regression parameters (one linear function per row). The number of columns must match the number of fitted regression parameters in the model. See 'details' below. |
beta0 |
A vector of numbers |
conf.int |
TRUE |
level |
The confidence level |
joint.test |
Logical value. If TRUE a 'joint' Wald test for the hypothesis L beta = beta0 is made. Default is that the 'row-wise' tests are made, i.e. (L beta)i=beta0i. If joint.test is TRUE, then no confidence interval etc. is calculated. |
... |
Additional arguments; currently not used. |
object |
An |
parm |
a specification of which parameters are to be given confidence intervals, either a vector of numbers or a vector of names. If missing, all parameters are considered. |
Let the estimated parameters of the model be
A linear function of the estimates is of the form
where
is specified by the
user.
The esticon function calculates l, its standard error and by default also a
95 pct confidence interval. It is sometimes of interest to test the
hypothesis for some value
given by the user. A test is provided for the hypothesis but other values of can be specified.
In general, one can specify r such linear functions at one time by
specifying L to be an matrix where each row consists
of p numbers . Default is
then that is a p vector of 0s but other values can be
given.
It is possible to test simultaneously that all specified linear functions
are equal to the corresponding values in .
For computing contrasts among levels of a single factor, 'contrast.lm' may be more convenient.
Returns a matrix with one row per linear function. Columns contain estimated coefficients, standard errors, t values, degrees of freedom, two-sided p-values, and the lower and upper endpoints of the 1-alpha confidence intervals.
Søren Højsgaard, [email protected]
data(iris) lm1 <- lm(Sepal.Length ~ Sepal.Width + Species + Sepal.Width : Species, data=iris) ## Note that the setosa parameters are set to zero coef(lm1) ## Estimate the intercept for versicolor lambda1 <- c(1, 0, 1, 0, 0, 0) esticon(lm1, L=lambda1) ## Estimate the difference between versicolor and virgica intercept ## and test if the difference is 1 lambda2 <- c(0, 1, -1, 0, 0, 0) esticon(lm1, L=lambda2, beta0=1) ## Do both estimates at one time esticon(lm1, L=rbind(lambda1, lambda2), beta0=c(0, 1)) ## Make a combined test for that the difference between versicolor and virgica intercept ## and difference between versicolor and virginica slope is zero: lambda3 <- c(0, 0, 0, 0, 1, -1) esticon(lm1, L=rbind(lambda2, lambda3), joint.test=TRUE) # Example using esticon on coxph objects (thanks to Alessandro A. Leidi). # Using dataset 'veteran' in the survival package # from the Veterans' Administration Lung Cancer study if (require(survival)){ data(veteran) sapply(veteran, class) levels(veteran$celltype) attach(veteran) veteran.s <- Surv(time, status) coxmod <- coxph(veteran.s ~ age + celltype + trt, method='breslow') summary(coxmod) # compare a subject 50 years old with celltype 1 # to a subject 70 years old with celltype 2 # both subjects on the same treatment AvB <- c(-20, -1, 0, 0, 0) # compare a subject 40 years old with celltype 2 on treat=0 # to a subject 35 years old with celltype 3 on treat=1 CvB <- c(5, 1, -1, 0, -1) est <- esticon(coxmod, L=rbind(AvB, CvB)) est ##exp(est[, c(2, 7, 8)]) }data(iris) lm1 <- lm(Sepal.Length ~ Sepal.Width + Species + Sepal.Width : Species, data=iris) ## Note that the setosa parameters are set to zero coef(lm1) ## Estimate the intercept for versicolor lambda1 <- c(1, 0, 1, 0, 0, 0) esticon(lm1, L=lambda1) ## Estimate the difference between versicolor and virgica intercept ## and test if the difference is 1 lambda2 <- c(0, 1, -1, 0, 0, 0) esticon(lm1, L=lambda2, beta0=1) ## Do both estimates at one time esticon(lm1, L=rbind(lambda1, lambda2), beta0=c(0, 1)) ## Make a combined test for that the difference between versicolor and virgica intercept ## and difference between versicolor and virginica slope is zero: lambda3 <- c(0, 0, 0, 0, 1, -1) esticon(lm1, L=rbind(lambda2, lambda3), joint.test=TRUE) # Example using esticon on coxph objects (thanks to Alessandro A. Leidi). # Using dataset 'veteran' in the survival package # from the Veterans' Administration Lung Cancer study if (require(survival)){ data(veteran) sapply(veteran, class) levels(veteran$celltype) attach(veteran) veteran.s <- Surv(time, status) coxmod <- coxph(veteran.s ~ age + celltype + trt, method='breslow') summary(coxmod) # compare a subject 50 years old with celltype 1 # to a subject 70 years old with celltype 2 # both subjects on the same treatment AvB <- c(-20, -1, 0, 0, 0) # compare a subject 40 years old with celltype 2 on treat=0 # to a subject 35 years old with celltype 3 on treat=1 CvB <- c(5, 1, -1, 0, -1) est <- esticon(coxmod, L=rbind(AvB, CvB)) est ##exp(est[, c(2, 7, 8)]) }
Convert expression into function object.
expr_to_fun(expr_, order = NULL, vec_arg = FALSE)expr_to_fun(expr_, order = NULL, vec_arg = FALSE)
expr_ |
R expression. |
order |
desired order of function argument. |
vec_arg |
should the function take vector valued argument. |
ee <- expression(b1 + (b0 - b1)*exp(-k*x) + b2*x) ff1 <- expr_to_fun(ee) formals(ff1) ff2 <- expr_to_fun(ee, vec_arg=TRUE) formals(ff2) formals(ff2)$length_parm formals(ff2)$names_parm |> eval() ee <- expression(matrix(c(x1+x2, x1-x2, x1^2+x2^2, x1^3+x2^3), nrow=2)) ff1 <- expr_to_fun(ee) ff2 <- expr_to_fun(ee, vec_arg=TRUE) formals(ff2) formals(ff2)$length_parm formals(ff2)$names_parm |> eval()ee <- expression(b1 + (b0 - b1)*exp(-k*x) + b2*x) ff1 <- expr_to_fun(ee) formals(ff1) ff2 <- expr_to_fun(ee, vec_arg=TRUE) formals(ff2) formals(ff2)$length_parm formals(ff2)$names_parm |> eval() ee <- expression(matrix(c(x1+x2, x1-x2, x1^2+x2^2, x1^3+x2^3), nrow=2)) ff1 <- expr_to_fun(ee) ff2 <- expr_to_fun(ee, vec_arg=TRUE) formals(ff2) formals(ff2)$length_parm formals(ff2)$names_parm |> eval()
Fish oil in pig food
fatacidfatacid
A dataframe.
A fish oil fatty acid X14 has been added in
different concentrations to the food for pigs in a
study. Interest is in studying how much of the fatty acid can
be found in the tissue. The concentrations of x14 in the
food are verb+dose+={0.0, 4.4, 6.2, 9.3}.
The pigs are fed with this food until their weight is 60 kg. From thereof and until they are slaughtered at 100kg, their food does not contain the fish oil. At 60kg (sample=1) and 100kg (sample=2) muscle biopsies are made and the concentration of x14 is determined. Measurements on the same pig are correlated, and pigs are additionally related through litters.
Data courtesy of Charlotte Lauridsen, Department of Animal Science, Aarhus University, Denmark.
Dataset to examine if respiratory function in children was influenced by smoking.
fevfev
A data frame with 654 observations on the following 5 variables.
AgeAge in years.
FEVForced expiratory volume in liters per second.
HtHeight in centimeters.
GenderGender.
SmokeSmoking status.
I. Tager and S. Weiss and B. Rosner and F. Speizer (1979). Effect of Parental Cigarette Smoking on the Pulmonary Function of Children. American Journal of Epidemiology. 110:15-26
data(fev) summary(fev)data(fev) summary(fev)
Locate the index of the first/last unique value in i) a vector or of a variable in a data frame.
lastobs(x, ...) firstobs(x, ...) ## Default S3 method: lastobs(x, ...) ## Default S3 method: firstobs(x, ...) ## S3 method for class 'formula' lastobs(formula, data = parent.frame(), ...) ## S3 method for class 'formula' firstobs(formula, data = parent.frame(), ...)lastobs(x, ...) firstobs(x, ...) ## Default S3 method: lastobs(x, ...) ## Default S3 method: firstobs(x, ...) ## S3 method for class 'formula' lastobs(formula, data = parent.frame(), ...) ## S3 method for class 'formula' firstobs(formula, data = parent.frame(), ...)
x |
A vector |
... |
Currently not used |
formula |
A formula (only the first term is used, see 'details'). |
data |
A data frame |
If writing ~a + b + c as formula, then only a is considered.
A vector.
Søren Højsgaard, [email protected]
x <- c(rep(1, 5), rep(2, 3), rep(3, 7), rep(1, 4)) firstobs(x) lastobs(x) data(dietox) firstobs(~Pig, data=dietox) lastobs(~Pig, data=dietox)x <- c(rep(1, 5), rep(2, 3), rep(3, 7), rep(1, 4)) firstobs(x) lastobs(x) data(dietox) firstobs(~Pig, data=dietox) lastobs(~Pig, data=dietox)
Formula operations and coercion as a supplement to update.formula()
formula_add_str(frm1, terms, op = "+") formula_add(frm1, frm2) formula_poly(chr1, n, noint = FALSE, y = NULL) formula_nth(frm1, n) formula_to_interaction_matrix(frm1) formula_chr_to_form(rhs, lhs = character(0)) to_str(chr1, collapse = "+") terms_labels(frm1) simplify_rhs(object) ## S3 method for class 'formula' simplify_rhs(object) ## S3 method for class 'character' simplify_rhs(object) as_rhs_frm(object) as_lhs_frm(object) as_rhs_chr(object, string = FALSE) as_lhs_chr(object, string = FALSE) unique_formula(list_of_formulas)formula_add_str(frm1, terms, op = "+") formula_add(frm1, frm2) formula_poly(chr1, n, noint = FALSE, y = NULL) formula_nth(frm1, n) formula_to_interaction_matrix(frm1) formula_chr_to_form(rhs, lhs = character(0)) to_str(chr1, collapse = "+") terms_labels(frm1) simplify_rhs(object) ## S3 method for class 'formula' simplify_rhs(object) ## S3 method for class 'character' simplify_rhs(object) as_rhs_frm(object) as_lhs_frm(object) as_rhs_chr(object, string = FALSE) as_lhs_chr(object, string = FALSE) unique_formula(list_of_formulas)
frm1, frm2
|
Formulas to be coerced to character vectors. |
terms |
Character string. |
op |
Either "+" (default) or "-". |
chr1 |
Character vector to be coerced to formulas. |
n |
Positive integer. |
noint |
Boolean. |
y |
Response |
rhs, lhs
|
right-hand-side and left-hand-side for formula (as characters) |
collapse |
Character to use as separator. |
object |
Character vector or formula. |
string |
Boolean. |
list_of_formulas |
list of formulas |
formula_poly("z", 2) formula_poly("z", 2, noint=TRUE) as_rhs_chr(c("a", "b", "z")) as_rhs_chr(c("a*b", "z")) as_rhs_chr(y~a+b+z) as_rhs_chr(y~a+b+z, string=TRUE) as_rhs_chr(y~a+b+z) as_rhs_chr(y~a*b+z) as_rhs_chr(y~a*b+z, string=TRUE) as_lhs_chr(y~a*b+z) as_lhs_chr(log(y)~a*b+z) ## Not what one might expect as_lhs_chr(cbind(y, u)~a*b+z) ## Not what one might expect formula_chr_to_form(c("a*b", "z")) formula_chr_to_form(c("a*b", "z"), "y") formula_chr_to_form(c("a*b", "z"), "log(y)") formula_add(y~a*b+z, ~-1) formula_add(y~a*b+z, ~a:b) formula_add_str(y~x1 + x2, "x3") formula_add_str(y~x1 + x2, "x1") formula_add_str(y~x1 + x2, "x1", op="-")formula_poly("z", 2) formula_poly("z", 2, noint=TRUE) as_rhs_chr(c("a", "b", "z")) as_rhs_chr(c("a*b", "z")) as_rhs_chr(y~a+b+z) as_rhs_chr(y~a+b+z, string=TRUE) as_rhs_chr(y~a+b+z) as_rhs_chr(y~a*b+z) as_rhs_chr(y~a*b+z, string=TRUE) as_lhs_chr(y~a*b+z) as_lhs_chr(log(y)~a*b+z) ## Not what one might expect as_lhs_chr(cbind(y, u)~a*b+z) ## Not what one might expect formula_chr_to_form(c("a*b", "z")) formula_chr_to_form(c("a*b", "z"), "y") formula_chr_to_form(c("a*b", "z"), "log(y)") formula_add(y~a*b+z, ~-1) formula_add(y~a*b+z, ~a:b) formula_add_str(y~x1 + x2, "x3") formula_add_str(y~x1 + x2, "x1") formula_add_str(y~x1 + x2, "x1", op="-")
Generate data list
generate_data_list(data., K, method = c("subgroups", "resample"))generate_data_list(data., K, method = c("subgroups", "resample"))
data. |
A data frame |
K |
Number of folds |
method |
Method for generating data |
Get formulas from model_stability_glm_class object
get_formulas(object, unique = TRUE, text = FALSE)get_formulas(object, unique = TRUE, text = FALSE)
object |
A model_stability_glm_class object |
unique |
If TRUE, return unique models |
text |
If TRUE, return text (rather than formula). |
Heat development in cement under hardening related to the chemical composition.
haldCementhaldCement
A data frame with 13 observations on the following 5 variables.
x1Percentage (weight) of [3Ca0][Al2O3]
x2Percentage (weight) of [3Cao][SiO2]
x3Percentage (weight) of [4Ca0][Al2O3][Fe03]
x4Percentage (weight) of [2Cao][SiO2]
yHeat development measured in calories per gram cement after 180 days
Anders Hald (1949); Statistiske Metoder; Akademisk Forlag (in Danish), page 509.
data(haldCement) if( interactive() ){ pairs( haldCement ) } m <- lm(y ~ x1 + x2 + x3 + x4, data=haldCement) summary(m) # Notice: The model explains practically all variation in data; # yet none of the explanatory variables appear to be statistically # significant.data(haldCement) if( interactive() ){ pairs( haldCement ) } m <- lm(y ~ x1 + x2 + x3 + x4, data=haldCement) summary(m) # Notice: The model explains practically all variation in data; # yet none of the explanatory variables appear to be statistically # significant.
head and tail for matrices
head2(x, n = 6, m = n) tail2(x, n = 6, m = n)head2(x, n = 6, m = n) tail2(x, n = 6, m = n)
x |
matrix |
n, m
|
number of rows and columns |
M <- matrix(1:20, nrow=4) head2(M) head2(M, 2)M <- matrix(1:20, nrow=4) head2(M) head2(M, 2)
Data on income, years of educations and ethnicity for a samle of adult Americans aged over 25. The year of sampling is not avalable in the source.
incomeincome
This data frame contains:
Income: Yearly income (thousands of dollars).
Education: Number of years of education (12=high school graduate, 16=college graduate).
Racial-Ethnic group: "b" (black), "h" (hispanic) and "w" (white).
Variable names are as in the reference.
Agresti, A. (2024) Statistical Methods for the Social Sciences, Global Edition (6th edition). ISBN-13: 9781292449197. Table 13.1
Plots the mean of the response for two-way combinations of factors, thereby illustrating possible interactions.
interaction_plot(.data, .formula, interval = "conf.int")interaction_plot(.data, .formula, interval = "conf.int")
.data |
A data frame |
.formula |
A formula of the form |
interval |
Either |
This is a recent addition to the package and is subject to change.
income$educf <- cut(income$educ, breaks=3) income |> interaction_plot(inc ~ race + educf) income |> interaction_plot(inc ~ race + educf, interval="conf.int") income |> interaction_plot(inc ~ race + educf, interval="boxplot") income |> interaction_plot(inc ~ race + educf, interval="none")income$educf <- cut(income$educ, breaks=3) income |> interaction_plot(inc ~ race + educf) income |> interaction_plot(inc ~ race + educf, interval="conf.int") income |> interaction_plot(inc ~ race + educf, interval="boxplot") income |> interaction_plot(inc ~ race + educf, interval="none")
Determines if contrasts are estimable, that is, if the contrasts can be written as a linear function of the data.
is_estimable(K, null.basis)is_estimable(K, null.basis)
K |
A matrix. |
null.basis |
A basis for a null space (can be found with
|
Consider the setting . A linear function of ,
say is estimable if and only if there exists an such
that or equivalently . Hence must be in
the column space of , i.e. in the orthogonal complement of the
null space of . Hence, with a basis for the null space,
is_estimable() checks if each row of the matrix is
perpendicular to each column basis vector in .
A logical vector.
Søren Højsgaard, [email protected]
http://web.mit.edu/18.06/www/Essays/newpaper_ver3.pdf
Helper for building lagged design matrices for time series or
longitudinal data. The function aligns a response vector y
and a single regressor x (possibly with several lags) and
returns a list with the cleaned response, the design matrix and
the row indices used.
You can either:
supply y and x directly, or
use a formula of the form y ~ x together with data.
lag_data( formula = NULL, data = NULL, y = NULL, x = NULL, lags = 0, include_intercept = TRUE, preserve_ts = FALSE )lag_data( formula = NULL, data = NULL, y = NULL, x = NULL, lags = 0, include_intercept = TRUE, preserve_ts = FALSE )
formula |
An optional model formula. The left-hand side is taken as the response and the right-hand side as the regressor(s). Only one non-intercept regressor is currently allowed when lagging is used. |
data |
Optional data frame in which to evaluate |
y |
Optional numeric or time series object with the response.
Ignored if |
x |
Optional numeric vector with the regressor to be lagged.
Ignored if |
lags |
Integer vector of lags to include. The default |
include_intercept |
Logical; if |
preserve_ts |
Logical; if |
When using the formula interface, the function creates a model frame
internally. The first column is taken as the response. All remaining
columns are considered regressors. If the right-hand side contains
only an intercept, x is set to NULL and no lagged regressors are
constructed.
If multiple regressors are present on the right-hand side, they are
combined into a matrix, but in the current implementation this is
only allowed when no lagging is requested (i.e., lags is 0).
Attempting to lag multiple regressors will result in an error.
For a given vector of lags, a lag matrix is built with one column
per lag. The column names are "lag0", "lag1", etc., corresponding
to the entries in lags. Rows which involve undefined values (due to
lagging) are dropped from both y and X. The indices of the rows
kept from the original series are returned in the rows component.
A list with components
The aligned response vector. If preserve_ts = TRUE
and y was a ts object, then this is a ts object; otherwise
a numeric vector.
The design matrix of lagged regressors, including an
intercept column if include_intercept = TRUE. If x is NULL,
then X is NULL.
Integer vector of row indices from the original data that are retained after lagging.
The maximum lag used, i.e. max(lags).
lag, embed,
and ts for related time series utilities.
## ------------------------------------------------------------ ## 1. Basic usage with y/x interface ## ------------------------------------------------------------ set.seed(123) n <- 10 y <- rnorm(n) x <- rnorm(n) ## Use current and one lag of x ld1 <- lag_data(y = y, x = x, lags = 0:1) ld1$y # aligned response ld1$X # design matrix (Intercept, lag0, lag1) ld1$rows # indices retained ## ------------------------------------------------------------ ## 2. Formula interface with a data frame ## ------------------------------------------------------------ dat <- data.frame( y = rnorm(20), x = rnorm(20) ) ## Use y ~ x with one lag ld2 <- lag_data(y ~ x, data = dat, lags = 0:2) head(ld2$X) length(ld2$y) ## ------------------------------------------------------------ ## 3. Using the result in a regression / ARX setting ## ------------------------------------------------------------ ## Here we regress y_t on an intercept and lagged x's ld3 <- lag_data(y = y, x = x, lags = 0:2) ## Remove intercept column when passing to lm() fit_lm <- lm(ld3$y ~ ld3$X[, -1]) coef(fit_lm) ## ------------------------------------------------------------ ## 4. Time series example with preserve_ts = TRUE ## ------------------------------------------------------------ y_ts <- Nile ## Regress Nile flow on its own lag-1 (simple ARX-like setup) ld4 <- lag_data(y = y_ts, x = y_ts, lags = 1, preserve_ts = TRUE) ## y is now a ts object starting at the appropriate time start(y_ts) start(ld4$y) ## Fit an AR(1) model with lagged Nile as xreg ## (intercept in X, lag in the second column) fit_arx <- stats::arima(ld4$y, order = c(1, 0, 0), xreg = ld4$X[, -1, drop = FALSE]) fit_arx## ------------------------------------------------------------ ## 1. Basic usage with y/x interface ## ------------------------------------------------------------ set.seed(123) n <- 10 y <- rnorm(n) x <- rnorm(n) ## Use current and one lag of x ld1 <- lag_data(y = y, x = x, lags = 0:1) ld1$y # aligned response ld1$X # design matrix (Intercept, lag0, lag1) ld1$rows # indices retained ## ------------------------------------------------------------ ## 2. Formula interface with a data frame ## ------------------------------------------------------------ dat <- data.frame( y = rnorm(20), x = rnorm(20) ) ## Use y ~ x with one lag ld2 <- lag_data(y ~ x, data = dat, lags = 0:2) head(ld2$X) length(ld2$y) ## ------------------------------------------------------------ ## 3. Using the result in a regression / ARX setting ## ------------------------------------------------------------ ## Here we regress y_t on an intercept and lagged x's ld3 <- lag_data(y = y, x = x, lags = 0:2) ## Remove intercept column when passing to lm() fit_lm <- lm(ld3$y ~ ld3$X[, -1]) coef(fit_lm) ## ------------------------------------------------------------ ## 4. Time series example with preserve_ts = TRUE ## ------------------------------------------------------------ y_ts <- Nile ## Regress Nile flow on its own lag-1 (simple ARX-like setup) ld4 <- lag_data(y = y_ts, x = y_ts, lags = 1, preserve_ts = TRUE) ## y is now a ts object starting at the appropriate time start(y_ts) start(ld4$y) ## Fit an AR(1) model with lagged Nile as xreg ## (intercept in X, lag in the second column) fit_arx <- stats::arima(ld4$y, order = c(1, 0, 0), xreg = ld4$X[, -1, drop = FALSE]) fit_arx
Compute linear estimates, i.e. L %*% beta for a range of models. One example of
linear estimates is population means (also known as LSMEANS).
linest(object, L = NULL, level = 0.95, ...) ## S3 method for class 'linest_class' confint(object, parm, level = 0.95, ...) ## S3 method for class 'linest_class' coef(object, ...) ## S3 method for class 'linest_class' summary(object, ...)linest(object, L = NULL, level = 0.95, ...) ## S3 method for class 'linest_class' confint(object, parm, level = 0.95, ...) ## S3 method for class 'linest_class' coef(object, ...) ## S3 method for class 'linest_class' summary(object, ...)
object |
Model object |
L |
Either |
level |
The level of the (asymptotic) confidence interval. |
... |
Additional arguments; currently not used. |
parm |
Specification of the parameters estimates for which confidence intervals are to be calculated. |
confint |
Should confidence interval appear in output. |
A dataframe with results from computing the contrasts.
Søren Højsgaard, [email protected]
## Make balanced dataset dat.bal <- expand.grid(list(AA=factor(1:2), BB=factor(1:3), CC=factor(1:3))) dat.bal$y <- rnorm(nrow(dat.bal)) ## Make unbalanced dataset # 'BB' is nested within 'CC' so BB=1 is only found when CC=1 # and BB=2,3 are found in each CC=2,3,4 dat.nst <- dat.bal dat.nst$CC <-factor(c(1,1,2,2,2,2,1,1,3,3,3,3,1,1,4,4,4,4)) mod.bal <- lm(y ~ AA + BB * CC, data=dat.bal) mod.nst <- lm(y ~ AA + BB : CC, data=dat.nst) L <- LE_matrix(mod.nst, effect=c("BB", "CC")) linest( mod.nst, L )## Make balanced dataset dat.bal <- expand.grid(list(AA=factor(1:2), BB=factor(1:3), CC=factor(1:3))) dat.bal$y <- rnorm(nrow(dat.bal)) ## Make unbalanced dataset # 'BB' is nested within 'CC' so BB=1 is only found when CC=1 # and BB=2,3 are found in each CC=2,3,4 dat.nst <- dat.bal dat.nst$CC <-factor(c(1,1,2,2,2,2,1,1,3,3,3,3,1,1,4,4,4,4)) mod.bal <- lm(y ~ AA + BB * CC, data=dat.bal) mod.nst <- lm(y ~ AA + BB : CC, data=dat.nst) L <- LE_matrix(mod.nst, effect=c("BB", "CC")) linest( mod.nst, L )
Auxillary functions for computing lsmeans, contrasts etc.
get_xlevels(obj) ## Default S3 method: get_xlevels(obj) ## S3 method for class 'mer' get_xlevels(obj) ## S3 method for class 'merMod' get_xlevels(obj) get_contrasts(obj) ## Default S3 method: get_contrasts(obj) ## S3 method for class 'merMod' get_contrasts(obj) set_xlevels(xlev, at) get_vartypes(obj) set_covariate_val(xlev, covariateVal) get_X(obj, newdata, at = NULL) ## Default S3 method: get_X(obj, newdata, at = NULL) ## S3 method for class 'merMod' get_X(obj, newdata, at = NULL)get_xlevels(obj) ## Default S3 method: get_xlevels(obj) ## S3 method for class 'mer' get_xlevels(obj) ## S3 method for class 'merMod' get_xlevels(obj) get_contrasts(obj) ## Default S3 method: get_contrasts(obj) ## S3 method for class 'merMod' get_contrasts(obj) set_xlevels(xlev, at) get_vartypes(obj) set_covariate_val(xlev, covariateVal) get_X(obj, newdata, at = NULL) ## Default S3 method: get_X(obj, newdata, at = NULL) ## S3 method for class 'merMod' get_X(obj, newdata, at = NULL)
obj |
An R object |
xlev |
FIXME: to be described |
at |
FIXME: to be described |
covariateVal |
FIXME: to be described |
newdata |
FIXME: to be described |
Generate matrix specifying linear estimate.
LE_matrix(object, effect = NULL, at = NULL) ## Default S3 method: LE_matrix(object, effect = NULL, at = NULL) aggregate_linest_list(linest_list) get_linest_list(object, effect = NULL, at = NULL)LE_matrix(object, effect = NULL, at = NULL) ## Default S3 method: LE_matrix(object, effect = NULL, at = NULL) aggregate_linest_list(linest_list) get_linest_list(object, effect = NULL, at = NULL)
object |
Model object |
effect |
A vector of variables. For each configuration of these the estimate will be calculated. |
at |
Either NULL, a list or a dataframe. 1) If a list, then the list must consist of covariates (including levels of some factors) to be used in the calculations. 2) If a dataframe, the dataframe is split rowwise and the function is invoked on each row. |
linest_list |
Linear estimate list (as generated by |
Check this
## Two way anova: data(warpbreaks) ## An additive model m0 <- lm(breaks ~ wool + tension, data=warpbreaks) ## Estimate mean for each wool type, for tension="M": K <- LE_matrix(m0, at=list(wool=c("A", "B"), tension="M")) K ## Vanilla computation: K %*% coef(m0) ## Alternative; also providing standard errors etc: linest(m0, K) esticon(m0, K) ## Estimate mean for each wool type when averaging over tension; # two ways of doing this K <- LE_matrix(m0, at=list(wool=c("A", "B"))) K K <- LE_matrix(m0, effect="wool") K linest(m0, K) ## The linear estimate is sometimes called to "least squares mean" # (LSmeans) or popupulation means. # Same as LSmeans(m0, effect="wool") ## Without mentioning 'effect' or 'at' an average across all #predictors are calculated: K <- LE_matrix(m0) K linest(m0, K) ## Because the design is balanced (9 observations per combination #of wool and tension) this is the same as computing the average. If #the design is not balanced, the two quantities are in general not #the same. mean(warpbreaks$breaks) ## Same as LSmeans(m0) ## An interaction model m1 <- lm(breaks ~ wool * tension, data=warpbreaks) K <- LE_matrix(m1, at=list(wool=c("A", "B"), tension="M")) K linest(m1, K) K <- LE_matrix(m1, at=list(wool=c("A", "B"))) K linest(m1, K) K <- LE_matrix(m1, effect="wool") K linest(m1, K) LSmeans(m1, effect="wool") K <- LE_matrix(m1) K linest(m1, K) LSmeans(m1)## Two way anova: data(warpbreaks) ## An additive model m0 <- lm(breaks ~ wool + tension, data=warpbreaks) ## Estimate mean for each wool type, for tension="M": K <- LE_matrix(m0, at=list(wool=c("A", "B"), tension="M")) K ## Vanilla computation: K %*% coef(m0) ## Alternative; also providing standard errors etc: linest(m0, K) esticon(m0, K) ## Estimate mean for each wool type when averaging over tension; # two ways of doing this K <- LE_matrix(m0, at=list(wool=c("A", "B"))) K K <- LE_matrix(m0, effect="wool") K linest(m0, K) ## The linear estimate is sometimes called to "least squares mean" # (LSmeans) or popupulation means. # Same as LSmeans(m0, effect="wool") ## Without mentioning 'effect' or 'at' an average across all #predictors are calculated: K <- LE_matrix(m0) K linest(m0, K) ## Because the design is balanced (9 observations per combination #of wool and tension) this is the same as computing the average. If #the design is not balanced, the two quantities are in general not #the same. mean(warpbreaks$breaks) ## Same as LSmeans(m0) ## An interaction model m1 <- lm(breaks ~ wool * tension, data=warpbreaks) K <- LE_matrix(m1, at=list(wool=c("A", "B"), tension="M")) K linest(m1, K) K <- LE_matrix(m1, at=list(wool=c("A", "B"))) K linest(m1, K) K <- LE_matrix(m1, effect="wool") K linest(m1, K) LSmeans(m1, effect="wool") K <- LE_matrix(m1) K linest(m1, K) LSmeans(m1)
LS-means (least squares means, also known as population means and as marginal means) for a range of model types.
LSmeans(object, effect = NULL, at = NULL, level = 0.95, ...) ## Default S3 method: LSmeans(object, effect = NULL, at = NULL, level = 0.95, ...) ## S3 method for class 'lmerMod' LSmeans(object, effect = NULL, at = NULL, level = 0.95, adjust.df = TRUE, ...) popMeans(object, effect = NULL, at = NULL, level = 0.95, ...) ## Default S3 method: popMeans(object, effect = NULL, at = NULL, level = 0.95, ...) ## S3 method for class 'lmerMod' popMeans(object, effect = NULL, at = NULL, level = 0.95, adjust.df = TRUE, ...)LSmeans(object, effect = NULL, at = NULL, level = 0.95, ...) ## Default S3 method: LSmeans(object, effect = NULL, at = NULL, level = 0.95, ...) ## S3 method for class 'lmerMod' LSmeans(object, effect = NULL, at = NULL, level = 0.95, adjust.df = TRUE, ...) popMeans(object, effect = NULL, at = NULL, level = 0.95, ...) ## Default S3 method: popMeans(object, effect = NULL, at = NULL, level = 0.95, ...) ## S3 method for class 'lmerMod' popMeans(object, effect = NULL, at = NULL, level = 0.95, adjust.df = TRUE, ...)
object |
Model object |
effect |
A vector of variables. For each configuration of these the estimate will be calculated. |
at |
A list of values of covariates (including levels of some factors) to be used in the calculations |
level |
The level of the (asymptotic) confidence interval. |
... |
Additional arguments; currently not used. |
adjust.df |
Should denominator degrees of freedom be adjusted? |
There are restrictions on the formulas allowed in the model object.
For example having y ~ log(x) will cause an error. Instead one
must define the variable logx = log(x) and do y ~ logx.
A dataframe with results from computing the contrasts.
Notice that LSmeans and LE_matrix
fails if the model formula contains an offset (as one would
have in connection with e.g. Poisson regression.
LSmeans and popMeans are synonymous.
Søren Højsgaard, [email protected]
## Two way anova: data(warpbreaks) m0 <- lm(breaks ~ wool + tension, data=warpbreaks) m1 <- lm(breaks ~ wool * tension, data=warpbreaks) LSmeans(m0) LSmeans(m1) ## same as: K <- LE_matrix(m0);K linest(m0, K) K <- LE_matrix(m1);K linest(m1, K) LE_matrix(m0, effect="wool") LSmeans(m0, effect="wool") LE_matrix(m1, effect="wool") LSmeans(m1, effect="wool") LE_matrix(m0, effect=c("wool", "tension")) LSmeans(m0, effect=c("wool", "tension")) LE_matrix(m1, effect=c("wool", "tension")) LSmeans(m1, effect=c("wool", "tension")) ## Regression; two parallel regression lines: data(Puromycin) m0 <- lm(rate ~ state + log(conc), data=Puromycin) ## Can not use LSmeans / LE_matrix here because of ## the log-transformation. Instead we must do: Puromycin$lconc <- log( Puromycin$conc ) m1 <- lm(rate ~ state + lconc, data=Puromycin) LE_matrix(m1) LSmeans(m1) LE_matrix(m1, effect="state") LSmeans(m1, effect="state") LE_matrix(m1, effect="state", at=list(lconc=3)) LSmeans(m1, effect="state", at=list(lconc=3)) ## Non estimable contrasts ## ## Make balanced dataset dat.bal <- expand.grid(list(AA=factor(1:2), BB=factor(1:3), CC=factor(1:3))) dat.bal$y <- rnorm(nrow(dat.bal)) ## ## Make unbalanced dataset # 'BB' is nested within 'CC' so BB=1 is only found when CC=1 # and BB=2,3 are found in each CC=2,3,4 dat.nst <- dat.bal dat.nst$CC <-factor(c(1, 1, 2, 2, 2, 2, 1, 1, 3, 3, 3, 3, 1, 1, 4, 4, 4, 4)) mod.bal <- lm(y ~ AA + BB * CC, data=dat.bal) mod.nst <- lm(y ~ AA + BB : CC, data=dat.nst) LSmeans(mod.bal, effect=c("BB", "CC")) LSmeans(mod.nst, effect=c("BB", "CC")) LSmeans(mod.nst, at=list(BB=1, CC=1)) LSmeans(mod.nst, at=list(BB=1, CC=2)) ## Above: NA's are correct; not an estimable function if( require( lme4 )){ warp.mm <- lmer(breaks ~ -1 + tension + (1|wool), data=warpbreaks) LSmeans(warp.mm, effect="tension") class(warp.mm) fixef(warp.mm) coef(summary(warp.mm)) vcov(warp.mm) if (require(pbkrtest)) vcovAdj(warp.mm) } LSmeans(warp.mm, effect="tension")## Two way anova: data(warpbreaks) m0 <- lm(breaks ~ wool + tension, data=warpbreaks) m1 <- lm(breaks ~ wool * tension, data=warpbreaks) LSmeans(m0) LSmeans(m1) ## same as: K <- LE_matrix(m0);K linest(m0, K) K <- LE_matrix(m1);K linest(m1, K) LE_matrix(m0, effect="wool") LSmeans(m0, effect="wool") LE_matrix(m1, effect="wool") LSmeans(m1, effect="wool") LE_matrix(m0, effect=c("wool", "tension")) LSmeans(m0, effect=c("wool", "tension")) LE_matrix(m1, effect=c("wool", "tension")) LSmeans(m1, effect=c("wool", "tension")) ## Regression; two parallel regression lines: data(Puromycin) m0 <- lm(rate ~ state + log(conc), data=Puromycin) ## Can not use LSmeans / LE_matrix here because of ## the log-transformation. Instead we must do: Puromycin$lconc <- log( Puromycin$conc ) m1 <- lm(rate ~ state + lconc, data=Puromycin) LE_matrix(m1) LSmeans(m1) LE_matrix(m1, effect="state") LSmeans(m1, effect="state") LE_matrix(m1, effect="state", at=list(lconc=3)) LSmeans(m1, effect="state", at=list(lconc=3)) ## Non estimable contrasts ## ## Make balanced dataset dat.bal <- expand.grid(list(AA=factor(1:2), BB=factor(1:3), CC=factor(1:3))) dat.bal$y <- rnorm(nrow(dat.bal)) ## ## Make unbalanced dataset # 'BB' is nested within 'CC' so BB=1 is only found when CC=1 # and BB=2,3 are found in each CC=2,3,4 dat.nst <- dat.bal dat.nst$CC <-factor(c(1, 1, 2, 2, 2, 2, 1, 1, 3, 3, 3, 3, 1, 1, 4, 4, 4, 4)) mod.bal <- lm(y ~ AA + BB * CC, data=dat.bal) mod.nst <- lm(y ~ AA + BB : CC, data=dat.nst) LSmeans(mod.bal, effect=c("BB", "CC")) LSmeans(mod.nst, effect=c("BB", "CC")) LSmeans(mod.nst, at=list(BB=1, CC=1)) LSmeans(mod.nst, at=list(BB=1, CC=2)) ## Above: NA's are correct; not an estimable function if( require( lme4 )){ warp.mm <- lmer(breaks ~ -1 + tension + (1|wool), data=warpbreaks) LSmeans(warp.mm, effect="tension") class(warp.mm) fixef(warp.mm) coef(summary(warp.mm)) vcov(warp.mm) if (require(pbkrtest)) vcovAdj(warp.mm) } LSmeans(warp.mm, effect="tension")
Height of a sample of math teachers in Danish high schools collected at a continued education day at Mariager Fjord Gymnasium in 2019.
math_teachersmath_teachers
:
Height in centimeters
Male or female
aggregate(height ~ sex, data=math_teachers, FUN=mean) aggregate(height ~ sex, data=math_teachers, FUN=function(x) {c(mean=mean(x), sd=sd(x))})aggregate(height ~ sex, data=math_teachers, FUN=mean) aggregate(height ~ sex, data=math_teachers, FUN=function(x) {c(mean=mean(x), sd=sd(x))})
Fast summary of microbenchmark object. The default summary method from the microbenchmark package is fairly slow in producing a summary (due to a call to a function from the multcomp package.)
mb_summary(object, unit, add.unit = TRUE, ...) summary_mb(object, unit, add.unit = TRUE, ...)mb_summary(object, unit, add.unit = TRUE, ...) summary_mb(object, unit, add.unit = TRUE, ...)
object |
A microbenchmark object |
unit |
The time unit to be used |
add.unit |
Should time unit be added as column to resulting dataframe. |
... |
Additional arguments; currently not used. |
Milk yield data for cows milked manually twice a day (morning and evening).
milkmanmilkman
A data frame with 161836 observations on the following 12 variables.
cownoa numeric vector; cow identification
lactnoa numeric vector; lactation number
ampma numeric vector; milking time: 1: morning; 2: evening
dfca numeric vector; days from calving
mya numeric vector; milk yield (kg)
fatpcta numeric vector; fat percentage
protpcta numeric vector; protein percentage
lactpcta numeric vector; lactose percentage
scca numeric vector; somatic cell counts
racea factor with levels RDM Holstein Jersey
ecmya numeric vector; energy corrected milk
cowlactCombination of cowno and lactno; necessary because the same cow may appear more than once in the dataset (in different lactations)
There are data for 222 cows. Some cows appear more than once in the dataset (in different lactations) and there are 288 different lactations.
Friggens, N. C.; Ridder, C. and Løvendahl, P. (2007). On the Use of Milk Composition Measures to Predict the Energy Balance of Dairy Cows. J. Dairy Sci. 90:5453–5467 doi:10.3168/jds.2006-821.
This study was part of the Biosens project used data from the “Malkekoens energibalance og mobilisering” project; both were funded by the Danish Ministry of Food, Agriculture and Fisheries and the Danish Cattle Association.
data(milkman)data(milkman)
Model stability for glm objects
model_stability_glm( data., model, n.searches = 10, method = c("subgroups", "resample"), ... )model_stability_glm( data., model, n.searches = 10, method = c("subgroups", "resample"), ... )
data. |
A data frame |
model |
A glm object |
n.searches |
Number of searches |
method |
Method for generating data |
... |
Additional arguments to be passed to |
Near infra red light (NIR) measurements are made at 152 wavelengths on 17 milk samples. While milk runs through a glass tube, infra red light is sent through the tube and the amount of light passing though the tube is measured at different wavelengths. Each milk sample was additionally analysed for fat, lactose, protein and dry matter.
nir_milk NIRmilknir_milk NIRmilk
Data comes in two formats:
nir_milk: A list with two components
x Datafrane with infra red light amount at different wavelengths (column names are the wavelengths; just remove the leading X).
y Datafrane with response variables fat, protein, lactose and dm (drymatter)
NIRmilk: This data frame contains 17 rows and 158 columns.
The first column is the sample number.
The columns named in the form Xw contains
the transmittance (fraction of electromagnetic power)
transmittance through the sample at wavelength w.
The response variables are fat, protein, lactose and dm (dry matter).
An object of class data.frame with 17 rows and 158 columns.
data(nir_milk) data(NIRmilk)data(nir_milk) data(NIRmilk)
Extract components from a formula with the form
y ~ x1 + ... + xn | g1 + ... + gm
parseGroupFormula(form)parseGroupFormula(form)
form |
A formula of the form |
If the formula is y ~ x1 + x2 | g1 + g2 the result is
model |
|
groups |
|
groupFormula |
|
Søren Højsgaard, [email protected]
gf1 <- parseGroupFormula(y ~ x1 + x2 | g1 + g2) gf1 gf2 <- parseGroupFormula( ~ x1 + x2 | g1 + g2) gf2gf1 <- parseGroupFormula(y ~ x1 + x2 | g1 + g2) gf1 gf2 <- parseGroupFormula( ~ x1 + x2 | g1 + g2) gf2
Extract (pick) elements without using brackets so that elements can be picked out as part of a pipe workflow.
pick1(x, which) pick2(x, which)pick1(x, which) pick2(x, which)
x |
A list, data frame, or vector. |
which |
The index or name of the element(s) to extract. |
These two helper functions extract elements from lists, data frames, or vectors. They are simple wrappers for the standard bracket operators in R:
pick1() uses single brackets ([) and returns a subset.
pick2() uses double brackets ([[) and returns the element itself.
These are safer and more flexible than $, especially when used with the base R pipe (|>)
or in functional programming.
pick1() returns a subset of x.
pick2() returns a single element from x.
lst <- list(a = 1:3, b = 4:6) # Without pipe pick1(lst, "a") # List with one element pick2(lst, "a") # Just the vector 1:3 # With base R pipe lst |> pick1("a") lst |> pick2("a") df <- data.frame(x = 1:5, y = letters[1:5]) df |> pick1("y") # Returns a data frame with column 'y' df |> pick2("y") # Returns column 'y' as a character vectorlst <- list(a = 1:3, b = 4:6) # Without pipe pick1(lst, "a") # List with one element pick2(lst, "a") # Just the vector 1:3 # With base R pipe lst |> pick1("a") lst |> pick2("a") df <- data.frame(x = 1:5, y = letters[1:5]) df |> pick1("y") # Returns a data frame with column 'y' df |> pick2("y") # Returns column 'y' as a character vector
A set of simple, vectorized, pipe-friendly arithmetic functions for
transforming numeric data in pipelines. These helpers make common
operations like multiplication, division, addition, subtraction,
exponentiation, and reciprocals clearer when using the native pipe |>.
reciprocal(x) pow(x, p) add(x, k) subtract(x, k) mult(x, k) divide(x, k)reciprocal(x) pow(x, p) add(x, k) subtract(x, k) mult(x, k) divide(x, k)
x |
A numeric vector or scalar. |
p |
A numeric scalar exponent (for |
k |
A numeric scalar for addition, subtraction, multiplication, or division. |
All functions are vectorized and support numeric vectors, scalars, or compatible objects. They are designed to improve the readability of transformation pipelines.
A numeric vector or scalar resulting from the transformation.
x <- c(1, 2, 3) # Multiplication and division x |> mult(10) x |> divide(2) # Addition and subtraction x |> add(5) x |> subtract(1) # Reciprocal x |> reciprocal() # Power x |> pow(2) # Combined use in pipelines x |> mult(2) |> add(3) |> reciprocal()x <- c(1, 2, 3) # Multiplication and division x |> mult(10) x |> divide(2) # Addition and subtraction x |> add(5) x |> subtract(1) # Reciprocal x |> reciprocal() # Power x |> pow(2) # Combined use in pipelines x |> mult(2) |> add(3) |> reciprocal()
Plot linear model object
plot_lm(lm_fit, format = "2x2", global_aes = NULL)plot_lm(lm_fit, format = "2x2", global_aes = NULL)
lm_fit |
An object of class 'lm' |
format |
The format of the plot (or a list of plots if format is "list") |
global_aes |
Currently no effect. |
data(income) m1 <- lm(inc ~ race + educ, data=income) plot_lm(m1) plot_lm(m1, "2x2") plot_lm(m1, "1x4") plot_lm(m1, "4x1") plot_lm(m1, "list")data(income) m1 <- lm(inc ~ race + educ, data=income) plot_lm(m1) plot_lm(m1, "2x2") plot_lm(m1, "1x4") plot_lm(m1, "4x1") plot_lm(m1, "list")
Weight and size of 20 potatoes. Weight in grams; size in millimeter. There
are two sizes: length is the longest length and width is the
shortest length across a potato.
potatoespotatoes
A data frame with 20 observations on the following 3 variables.
weighta numeric vector
lengtha numeric vector
widtha numeric vector
Søren Højsgaard, [email protected]
My own garden; autumn 2015.
data(potatoes) plot(potatoes)data(potatoes) plot(potatoes)
This is the Prostate Tumor Gene Expression dataset used in Chung and Keles (2010).
data(prostate)data(prostate)
A list with two components:
Gene expression data. A matrix with 102 rows and 6033 columns.
Class index. A vector with 102 elements.
The prostate dataset consists of 52 prostate tumor and 50 normal samples.
Normal and tumor classes are coded in 0 and 1, respectively, in y vector.
Matrix x is gene expression data and
arrays were normalized, log transformed, and standardized
to zero mean and unit variance across genes as described
in Dettling (2004) and Dettling and Beuhlmann (2002).
See Chung and Keles (2010) for more details.
Singh D, Febbo P, Ross K, Jackson D, Manola J, Ladd C, Tamayo P, Renshaw A, DAmico A, Richie J, Lander E, Loda M, Kantoff P, Golub T, and Sellers W (2002), "Gene expression correlates of clinical prostate cancer behavior", Cancer Cell, Vol. 1, pp. 203–209.
Chung D and Keles S (2010), "Sparse partial least squares classification for high dimensional data", Statistical Applications in Genetics and Molecular Biology, Vol. 9, Article 17.
Dettling M (2004), "BagBoosting for tumor classification with gene expression data", Bioinformatics, Vol. 20, pp. 3583–3593.
Dettling M and Beuhlmann P (2002), "Supervised clustering of genes", Genome Biology, Vol. 3, pp. research0069.1–0069.15.
data(prostate) prostate$x[1:5,1:5] prostate$ydata(prostate) prostate$x[1:5,1:5] prostate$y
Binds a named list of data frames (or tibbles) into a single data frame. Adds the list name as a new column (first column).
rbind_list(lst, name = "name")rbind_list(lst, name = "name")
lst |
A named list of data frames or tibbles. |
name |
A character scalar: name of the column to hold the list names (default "name"). |
A data frame or tibble, depending on the class of the input.
lst <- list(a = data.frame(x = 1:2), b = data.frame(x = 3:4)) rbind_list(lst) lst <- split(iris, iris$Species) rbind_list(lst)lst <- list(a = data.frame(x = 1:2), b = data.frame(x = 3:4)) rbind_list(lst) lst <- split(iris, iris$Species) rbind_list(lst)
Recodes a vector with values, say 1,2 to a variable with values, say 'a', 'b'
recodeVar(x, src, tgt, default = NULL, keep.na = TRUE) recode_var(x, src, tgt, default = NULL, keep.na = TRUE)recodeVar(x, src, tgt, default = NULL, keep.na = TRUE) recode_var(x, src, tgt, default = NULL, keep.na = TRUE)
x |
A vector; the variable to be recoded. |
src |
The source values: a subset of the present values of x |
tgt |
The target values: the corresponding new values of x |
default |
Default target value for those values of x not listed in
|
keep.na |
If TRUE then NA's in x will be retained in the output |
A vector
Care should be taken if x is a factor. A safe approach may be to convert x to a character vector using as.character.
Søren Højsgaard, [email protected]
x <- c("dec", "jan", "feb", "mar", "apr", "may") src1 <- list(c("dec", "jan", "feb"), c("mar", "apr", "may")) tgt1 <- list("winter", "spring") recodeVar(x, src=src1, tgt=tgt1) #[1] "winter" "winter" "winter" "spring" "spring" "spring" x <- c(rep(1:3, 3)) #[1] 1 2 3 1 2 3 1 2 3 ## Simple usage: recodeVar(x, src=c(1, 2), tgt=c("A", "B")) #[1] "A" "B" NA "A" "B" NA "A" "B" NA ## Here we need to use lists recodeVar(x, src=list(c(1, 2)), tgt=list("A")) #[1] "A" "A" NA "A" "A" NA "A" "A" NA recodeVar(x, src=list(c(1, 2)), tgt=list("A"), default="L") #[1] "A" "A" "L" "A" "A" "L" "A" "A" "L" recodeVar(x, src=list(c(1, 2), 3), tgt=list("A", "B"), default="L") #[1] "A" "A" "B" "A" "A" "B" "A" "A" "B" ## Dealing with NA's in x x<-c(NA,rep(1:3, 3),NA) #[1] NA 1 2 3 1 2 3 1 2 3 NA recodeVar(x, src=list(c(1, 2)), tgt=list("A")) #[1] NA "A" "A" NA "A" "A" NA "A" "A" NA NA recodeVar(x, src=list(c(1, 2)), tgt=list("A"), default="L") #[1] NA "A" "A" "L" "A" "A" "L" "A" "A" "L" NA recodeVar(x, src=list(c(1, 2)), tgt=list("A"), default="L", keep.na=FALSE) #[1] "L" "A" "A" "L" "A" "A" "L" "A" "A" "L" "L" x <- c("no", "yes", "not registered", "no", "yes", "no answer") recodeVar(x, src = c("no", "yes"), tgt = c("0", "1"), default = NA)x <- c("dec", "jan", "feb", "mar", "apr", "may") src1 <- list(c("dec", "jan", "feb"), c("mar", "apr", "may")) tgt1 <- list("winter", "spring") recodeVar(x, src=src1, tgt=tgt1) #[1] "winter" "winter" "winter" "spring" "spring" "spring" x <- c(rep(1:3, 3)) #[1] 1 2 3 1 2 3 1 2 3 ## Simple usage: recodeVar(x, src=c(1, 2), tgt=c("A", "B")) #[1] "A" "B" NA "A" "B" NA "A" "B" NA ## Here we need to use lists recodeVar(x, src=list(c(1, 2)), tgt=list("A")) #[1] "A" "A" NA "A" "A" NA "A" "A" NA recodeVar(x, src=list(c(1, 2)), tgt=list("A"), default="L") #[1] "A" "A" "L" "A" "A" "L" "A" "A" "L" recodeVar(x, src=list(c(1, 2), 3), tgt=list("A", "B"), default="L") #[1] "A" "A" "B" "A" "A" "B" "A" "A" "B" ## Dealing with NA's in x x<-c(NA,rep(1:3, 3),NA) #[1] NA 1 2 3 1 2 3 1 2 3 NA recodeVar(x, src=list(c(1, 2)), tgt=list("A")) #[1] NA "A" "A" NA "A" "A" NA "A" "A" NA NA recodeVar(x, src=list(c(1, 2)), tgt=list("A"), default="L") #[1] NA "A" "A" "L" "A" "A" "L" "A" "A" "L" NA recodeVar(x, src=list(c(1, 2)), tgt=list("A"), default="L", keep.na=FALSE) #[1] "L" "A" "A" "L" "A" "A" "L" "A" "A" "L" "L" x <- c("no", "yes", "not registered", "no", "yes", "no answer") recodeVar(x, src = c("no", "yes"), tgt = c("0", "1"), default = NA)
Recover data from principal component analysis based on the first (typically few) components.
recover_pca_data(object, comp = 1)recover_pca_data(object, comp = 1)
object |
An object of class |
comp |
The number of components to be used. Must be smaller than the number of variables. |
A dataframe
crime <- doBy::crimeRate rownames(crime) <- crime$state crime$state <- NULL o <- order(apply(scale(crime), 1, sum)) dat <- crime[o,] head(dat) tail(dat) matplot(scale(dat), type="l") pc1 <- prcomp(dat, scale. = TRUE) summary(pc1) rec2 <- recover_pca_data(pc1, 2) pairs(rec2) par(mfrow=c(1,2)) matplot(scale(dat), type="l") matplot(scale(rec2), type="l") j <- merge(dat, rec2, by=0) pairs(j[,-1])crime <- doBy::crimeRate rownames(crime) <- crime$state crime$state <- NULL o <- order(apply(scale(crime), 1, sum)) dat <- crime[o,] head(dat) tail(dat) matplot(scale(dat), type="l") pc1 <- prcomp(dat, scale. = TRUE) summary(pc1) rec2 <- recover_pca_data(pc1, 2) pairs(rec2) par(mfrow=c(1,2)) matplot(scale(dat), type="l") matplot(scale(rec2), type="l") j <- merge(dat, rec2, by=0) pairs(j[,-1])
Rename columns in a matrix or a dataframe.
renameCol(indata, src, tgt)renameCol(indata, src, tgt)
indata |
A dataframe or a matrix |
src |
Source: Vector of names of columns in |
tgt |
Target: Vector with corresponding new names in the output. |
A dataframe if indata is a dataframe; a matrix in
indata is a matrix.
Søren Højsgaard, [email protected]
renameCol(CO2, 1:2, c("kk", "ll")) renameCol(CO2, c("Plant", "Type"), c("kk", "ll")) # These fail - as they should: # renameCol(CO2, c("Plant", "Type", "conc"), c("kk", "ll")) # renameCol(CO2, c("Plant", "Type", "Plant"), c("kk", "ll"))renameCol(CO2, 1:2, c("kk", "ll")) renameCol(CO2, c("Plant", "Type"), c("kk", "ll")) # These fail - as they should: # renameCol(CO2, c("Plant", "Type", "conc"), c("kk", "ll")) # renameCol(CO2, c("Plant", "Type", "Plant"), c("kk", "ll"))
Get response variable from model
response(object)response(object)
object |
lm or glm object |
data(cars) lm1 <- lm(dist ~ speed + I(speed^2), data=cars) lm1 |> response() |> head() cars <- cars |> add_pred(lm1) cars |> head() cars <- cars |> add_resid(lm1) carsdata(cars) lm1 <- lm(dist ~ speed + I(speed^2), data=cars) lm1 |> response() |> head() cars <- cars |> add_pred(lm1) cars |> head() cars <- cars |> add_resid(lm1) cars
Plot the response variable against the predictor variables.
response_plot( data., formula., geoms = NULL, global_aes = NULL, plot = TRUE, nrow = NULL, ncol = NULL )response_plot( data., formula., geoms = NULL, global_aes = NULL, plot = TRUE, nrow = NULL, ncol = NULL )
data. |
A data frame containing the variables in the formula. |
formula. |
A formula of the form y ~ x1 + x2 + ... + xn, where y is the response variable and x1, x2, ..., xn are the predictor variables. A dot as right hand side is allowed. |
geoms |
A list of ggplot2 geoms to be added to the plot. |
global_aes |
A list of global aesthetics to be added to the plot. |
plot |
A logical value indicating whether the plot should be displayed. |
nrow, ncol
|
Number of rows / columns in plot. |
A list of ggplot2 plots.
library(ggplot2) response_plot(iris, Sepal.Width ~ ., geoms=geom_point()) response_plot(iris, Sepal.Width ~ ., geoms=geom_point(), global_aes=list(color="Species")) personality |> response_plot(easygon~., geoms=geom_point(), global_aes=NULL)library(ggplot2) response_plot(iris, Sepal.Width ~ ., geoms=geom_point()) response_plot(iris, Sepal.Width ~ ., geoms=geom_point(), global_aes=list(color="Species")) personality |> response_plot(easygon~., geoms=geom_point(), global_aes=NULL)
Applies base::scale() to numeric, integer, or
logical columns in a data frame. Non-numeric columns are left
unchanged.
scale_df(x, center = TRUE, scale = TRUE)scale_df(x, center = TRUE, scale = TRUE)
x |
A data frame or matrix. |
center |
Logical; if TRUE, center the variables. |
scale |
Logical; if TRUE, scale the variables. |
If x is not a data frame, base::scale() is applied directly.
An object of the same class as x.
scale_df(iris)scale_df(iris)
Splits a data frame or matrix by grouping variables and scales numeric variables within each group.
scaleBy(formula, data = parent.frame(), center = TRUE, scale = TRUE) scale_by(data, formula, center = TRUE, scale = TRUE)scaleBy(formula, data = parent.frame(), center = TRUE, scale = TRUE) scale_by(data, formula, center = TRUE, scale = TRUE)
formula |
Grouping structure: a formula, character vector, or variables as |
data |
A data frame or matrix. |
center |
Logical; if TRUE, center the variables. |
scale |
Logical; if TRUE, scale the variables. |
A list of data frames or matrices (same class as input), one per group.
Søren Højsgaard, [email protected]
summaryBy, transformBy, orderBy
scale_by(iris, ~Species) scale_by(iris, ~1) ## Combine result into one data frame: a <- scale_by(iris, ~Species) d <- do.call(rbind, a) ## Old interface scaleBy(~Species, data = iris, center = TRUE, scale = FALSE) scaleBy(~1, data = iris)scale_by(iris, ~Species) scale_by(iris, ~1) ## Combine result into one data frame: a <- scale_by(iris, ~Species) d <- do.call(rbind, a) ## Old interface scaleBy(~Species, data = iris, center = TRUE, scale = FALSE) scaleBy(~1, data = iris)
Section a functions domain by fixing certain arguments of a function call.
section_fun(fun, nms, vls = NULL, method = "args") section_fun_sub(fun, nms, vls = NULL, envir = parent.frame()) section_fun_env(fun, nms, vls = NULL) get_section(object) get_fun(object)section_fun(fun, nms, vls = NULL, method = "args") section_fun_sub(fun, nms, vls = NULL, envir = parent.frame()) section_fun_env(fun, nms, vls = NULL) get_section(object) get_fun(object)
fun |
Function to be sectioned |
nms |
Either a named list of the form name=value where each
name is the name of an argument of the function (in which case
|
vls |
A vector or list of values of the arguments |
method |
One of the following: 1) "args" (default); based on substituting fixed values into the function argument list as default values). For backward compatibility can also be "def". 2) "body" for substituting fixed values into the function body. For backward compatibility can also be "sub". 3) "env": (for environment); using an auxillary argument for storing sectioned values. |
envir |
Environment |
object |
An object from section_fun (a scaffold object). |
Let E be a subset of the cartesian product X x Y where X and Y are some sets. Consider a function f(x,y) defined on E. Then for any x in X, the section of E defined by x (denoted Ex) is the set of $y$s in Y such that (x, y) is in E. Correspondingly, the section of f(x,y) defined by x is the function $f_x$ defined on Ex given by $f_x(y)=f(x,y)$.
section_fun is a wrapper for calling set_default (default
method), section_fun_env or section_fun_sub. Notice that
creating a sectioned function with section_fun_sub can be
time consuming.
A new function: The input function fun but with certain
arguments fixed at specific values.
Søren Højsgaard, [email protected] based on code adapted from the curry package.
f <- function(x, y){x + y} f_ <- section_fun(f, list(y = 10), method="args") ## "def"" is default f_ <- section_fun(f, nms="y", vls=10, method="args") ## SAME AS ABOVE f_ f_(x=1) f_ <- section_fun(f, list(y = 10), method="body") ## f_ <- section_fun(f, nms="y", vls=10, method="body") ## SAME AS ABOVE f_ f_(x=1) f_ <- section_fun(f, list(y = 10), method="env") f_ <- section_fun(f, nms="y", vls=10, method="env") ## SAME AS ABOVE f_ f_(x=1) get_section(f_) get_fun(f_) ## With more complicated values: g <- function(A, B) { A + B } g_ <- section_fun(g, list(A = matrix(1:4, nrow=2))) g_ <- section_fun(g, "A", list(matrix(1:4, nrow=2))) g_(diag(1, 2)) g_ <- section_fun(g, list(A = matrix(1:4, nrow=2))) ## Using built in function set.seed(123) rnorm5 <- section_fun(rnorm, list(n=5)) rnorm5(0, 1) set.seed(123) rnorm(5)f <- function(x, y){x + y} f_ <- section_fun(f, list(y = 10), method="args") ## "def"" is default f_ <- section_fun(f, nms="y", vls=10, method="args") ## SAME AS ABOVE f_ f_(x=1) f_ <- section_fun(f, list(y = 10), method="body") ## f_ <- section_fun(f, nms="y", vls=10, method="body") ## SAME AS ABOVE f_ f_(x=1) f_ <- section_fun(f, list(y = 10), method="env") f_ <- section_fun(f, nms="y", vls=10, method="env") ## SAME AS ABOVE f_ f_(x=1) get_section(f_) get_fun(f_) ## With more complicated values: g <- function(A, B) { A + B } g_ <- section_fun(g, list(A = matrix(1:4, nrow=2))) g_ <- section_fun(g, "A", list(matrix(1:4, nrow=2))) g_(diag(1, 2)) g_ <- section_fun(g, list(A = matrix(1:4, nrow=2))) ## Using built in function set.seed(123) rnorm5 <- section_fun(rnorm, list(n=5)) rnorm5(0, 1) set.seed(123) rnorm(5)
set_default() takes a function and returns a new version with
updated default values for specified arguments.
set_default(fun, nms, vls = NULL)set_default(fun, nms, vls = NULL)
fun |
A function whose arguments will get new default values. |
nms |
Character vector of argument names, or something coercible to that
(e.g. a call to |
vls |
Optional vector or list of values to use as defaults.
If |
This is useful when you want to programmatically create specialized versions of a function with certain arguments preset to default values.
The specified arguments will be moved to the end of the formal argument list in the returned function.
You can supply arguments as a named list or as separate names and values.
A new function with updated default values in its formals.
## Simple example f1 <- function(x, y, z) { x + y + z } ## Add defaults for x and y set_default(f1, list(x = 1, y = 2)) # function (z, x = 1, y = 2) { x + y + z } ## Same using separate vectors of names and values set_default(f1, c("x", "y"), c(1, 2)) ## Works with v() style if supported # set_default(f1, v(x, y), c(1, 2)) ## Another example with more arguments f2 <- function(a, b, c, d) { a + b + c + d } set_default(f2, list(b = 10, d = 5))## Simple example f1 <- function(x, y, z) { x + y + z } ## Add defaults for x and y set_default(f1, list(x = 1, y = 2)) # function (z, x = 1, y = 2) { x + y + z } ## Same using separate vectors of names and values set_default(f1, c("x", "y"), c(1, 2)) ## Works with v() style if supported # set_default(f1, v(x, y), c(1, 2)) ## Another example with more arguments f2 <- function(a, b, c, d) { a + b + c + d } set_default(f2, list(b = 10, d = 5))
Matrix representatation of list of vectors and vice versa
set_list2matrix(set_list, aggregate = FALSE) matrix2set_list(set_matrix)set_list2matrix(set_list, aggregate = FALSE) matrix2set_list(set_matrix)
set_list |
list of vectors |
aggregate |
should the vectors be aggregated |
set_matrix |
matrix representatation |
l <- list(c(1,2,3), c(3,2,4), c(3,2,4)) m1 <- set_list2matrix(l) m1 matrix2set_list(m1) m2 <- set_list2matrix(l, aggregate=TRUE) m2 matrix2set_list(m2)l <- list(c(1,2,3), c(3,2,4), c(3,2,4)) m1 <- set_list2matrix(l) m1 matrix2set_list(m1) m2 <- set_list2matrix(l, aggregate=TRUE) m2 matrix2set_list(m2)
Wear of soles of shoes of materials A and B for one foot each for of ten boys.
shoes shoes_longshoes shoes_long
This data frame contains:
Wear, material A
Wear, material B
Id of boy
The foot with material A
An object of class data.frame with 10 rows and 4 columns.
An object of class tbl_df (inherits from tbl, data.frame) with 20 rows and 4 columns.
The shoes data are measurements of the amount wear of the soles of shoes worn by 10 boys. The soles were made to two different synthetic materials, a standard material A and a cheaper material B.
Box, Hunter, Hunter (2005) Statistics for Experimenters, 2nd edition Wiley, p. 81.
Split matrix or dataframe into list by columns or by rows
split_bycol(x, idx = NULL, as.list = FALSE) split_byrow(x, idx = NULL)split_bycol(x, idx = NULL, as.list = FALSE) split_byrow(x, idx = NULL)
x |
Matrix or dataframe. |
idx |
Index to split by. If NULL, split by columns or rows. |
as.list |
If TRUE, return list of dataframes. If FALSE, return list of matrices. |
x <- mtcars[1:3, 1:6] x |> split_bycol() x |> split_bycol(as.list=TRUE) x |> split_bycol(as.list=FALSE) x |> split_bycol(idx=c(1,1,1,2,2,3,3,3)) ## x |> split_bycol(idx=c(1,1,7,2,2,3,3,3)) ## Gives error x <- mtcars[1:6, 1:6] x |> split_byrow() x |> split_byrow(idx=c(1,1,2,2)) m <- as.matrix(x) u <- x |> split_byrow(idx=c(1,1,2,2)) y <- m |> split_byrow(idx=c(1,1,2,2))x <- mtcars[1:3, 1:6] x |> split_bycol() x |> split_bycol(as.list=TRUE) x |> split_bycol(as.list=FALSE) x |> split_bycol(idx=c(1,1,1,2,2,3,3,3)) ## x |> split_bycol(idx=c(1,1,7,2,2,3,3,3)) ## Gives error x <- mtcars[1:6, 1:6] x |> split_byrow() x |> split_byrow(idx=c(1,1,2,2)) m <- as.matrix(x) u <- x |> split_byrow(idx=c(1,1,2,2)) y <- m |> split_byrow(idx=c(1,1,2,2))
Find sub-sequences of identical elements in a vector.
sub_seq(x, item = NULL) subSeq(x, item = NULL)sub_seq(x, item = NULL) subSeq(x, item = NULL)
x |
An atomic vector or a factor. |
item |
Optionally a specific value to look for in |
A dataframe.
Søren Højsgaard, [email protected]
x <- c(1, 1, 1, 0, 0, 1, 1, 1, 2, 2, 2, 1, 2, 2, 2, 3) sub_seq(x) sub_seq(x, item=1)x <- c(1, 1, 1, 0, 0, 1, 1, 1, 2, 2, 2, 1, 2, 2, 2, 3) sub_seq(x) sub_seq(x, item=1)
Returns Taylor polynomial approximating a function fn(x)
taylor(fn, x0, ord = 1)taylor(fn, x0, ord = 1)
fn |
A function of one variable and that variable must be named 'x'. |
x0 |
The point in which to to the Taylor expansion. |
ord |
The order of the Taylor expansion. |
function.
Søren Højsgaard, [email protected]
fn <- function(x) log(x) ord <- 2 x0 <- 2 xv <- seq(.2, 5, .1) plot(xv, fn(xv), type="l") lines(xv, taylor(fn, x0=x0, ord=ord)(xv), lty=2) abline(v=x0) fn <- function(x)sin(x) ord <- 4 x0 <- 0 xv <- seq(-2*pi, 2*pi, 0.1) plot(xv, fn(xv), type="l") lines(xv, taylor(fn, x0=x0, ord=ord)(xv), lty=2) abline(v=x0)fn <- function(x) log(x) ord <- 2 x0 <- 2 xv <- seq(.2, 5, .1) plot(xv, fn(xv), type="l") lines(xv, taylor(fn, x0=x0, ord=ord)(xv), lty=2) abline(v=x0) fn <- function(x)sin(x) ord <- 4 x0 <- 0 xv <- seq(-2*pi, 2*pi, 0.1) plot(xv, fn(xv), type="l") lines(xv, taylor(fn, x0=x0, ord=ord)(xv), lty=2) abline(v=x0)
Tidy summarizes information about the components of the object.
## S3 method for class 'esticon_class' tidy(x, conf.int = FALSE, conf.level = 0.95, ...)## S3 method for class 'esticon_class' tidy(x, conf.int = FALSE, conf.level = 0.95, ...)
x |
A 'esticon_class' object (produced by |
conf.int |
Should confidence intervals be added. |
conf.level |
Desired confidence level. |
... |
Additional arguments; currently not used. |
Tidy summarizes information about the components of the object.
## S3 method for class 'linest_class' tidy(x, conf.int = FALSE, conf.level = 0.95, ...)## S3 method for class 'linest_class' tidy(x, conf.int = FALSE, conf.level = 0.95, ...)
x |
A 'linest_class' object (produced by |
conf.int |
Should confidence intervals be added. |
conf.level |
Desired confidence level. |
... |
Additional arguments; currently not used. |
Calculates the time since the nearest event in a sequence, optionally using a custom time scale.
time_since_event(yvar, tvar = seq_along(yvar)) timeSinceEvent(...)time_since_event(yvar, tvar = seq_along(yvar)) timeSinceEvent(...)
yvar |
A numeric or logical vector indicating events. |
tvar |
An optional numeric vector specifying time values. Defaults to the index. |
... |
Arguments pased on to time_since_event |
Events are coded as 1 (or TRUE). Non-events are anything else. The result includes absolute and signed distances to events.
A data frame with columns 'yvar', 'tvar', 'abs.tse' (absolute time since event), 'sign.tse' (signed time since event), and other helper columns.
Søren Højsgaard, [email protected]
## Example 1: Basic usage with default time index y <- c(0, 0, 1, 0, 0, 1, 0) tse <- time_since_event(y) print(tse) ## Example 2: Custom (non-integer) time variable y <- c(0, 0, 1, 0, 0, 0, 1, 0) t <- seq(0.5, 3.5, length.out = length(y)) tse <- time_since_event(y, t) print(tse) ## Example 3: Plotting the signed time since event plot(sign.tse ~ tvar, data = tse, type = "b", main = "Signed time since event", xlab = "Time", ylab = "Signed time since event") grid() abline(h = 0, col = "red", lty = 2)## Example 1: Basic usage with default time index y <- c(0, 0, 1, 0, 0, 1, 0) tse <- time_since_event(y) print(tse) ## Example 2: Custom (non-integer) time variable y <- c(0, 0, 1, 0, 0, 0, 1, 0) t <- seq(0.5, 3.5, length.out = length(y)) tse <- time_since_event(y, t) print(tse) ## Example 3: Plotting the signed time since event plot(sign.tse ~ tvar, data = tse, type = "b", main = "Signed time since event", xlab = "Time", ylab = "Signed time since event") grid() abline(h = 0, col = "red", lty = 2)
transform_forecast() takes a univariate time series forecast fitted on a
transformed (model) scale and produces a new forecast object on the
original data scale. This is done by simulating future paths on the
model scale, transforming each path with a user-supplied function, and
then computing pointwise means and prediction intervals on the
transformed scale.
transform_forecast( fc, trans_fun, xreg_future = NULL, nsim = 1000, level = 95, y0 = 1 )transform_forecast( fc, trans_fun, xreg_future = NULL, nsim = 1000, level = 95, y0 = 1 )
fc |
A |
trans_fun |
A function of the form |
xreg_future |
Optional matrix or vector of future values for regressors in ARMAX model. |
nsim |
Integer; number of simulated future paths to use. Larger values give smoother prediction intervals but take longer to compute. |
level |
Numeric; prediction interval coverage in percent. |
y0 |
Numeric; starting value on the output scale used to
reconstruct the historical series from |
The function assumes that fc_object is a "forecast" object as
produced by the forecast package, and that fc_object$model
supports simulate() with arguments nsim and future.
The transformation function trans_fun must have the form
trans_fun(x, y0), where x is a numeric vector representing a path on
the model scale (for example log-values or percentage changes), and
y0 is a scalar "starting value" on the output scale. The function must
return a numeric vector of the same length as x giving the
corresponding path on the output scale.
Internally, transform_forecast() first reconstructs a historical series on
the output scale by applying trans_fun() to fc_object$x with the
supplied y0. It then simulates nsim future paths from
fc_object$model on the model scale, transforms each path to the
output scale using trans_fun() with y0 equal to the last value of
the reconstructed historical series, and finally computes the pointwise
mean and prediction intervals (of nominal coverage level) across
the simulated paths. The result is returned as a new "forecast" object
with x, mean, lower, and upper on the output scale.
A "forecast" object similar to fc_object, but with the components
x, mean, lower, and upper defined on the output (data) scale.
forecast, auto.arima,
simulate, ts.
## Example 1: Log-transform of the Canadian lynx data if (requireNamespace("forecast", quietly = TRUE)) { llynx <- log(lynx) fit_log <- forecast::auto.arima(llynx) fc_log <- forecast::forecast(fit_log, h = 20) forecast::autoplot(fc_log) ## transformation: log -> original scale trans_log <- function(z, y0) { exp(z) } fc_lynx <- transform_forecast(fc_log, trans_fun = trans_log, nsim = 20) forecast::autoplot(fc_lynx) + ggplot2::theme_minimal() } ## Not run: if (requireNamespace("forecast", quietly = TRUE)) { ## Example 2 (variation): CO2 series, log-transform lco2 <- log(co2) fit_co2 <- forecast::auto.arima(lco2) fc_log_co2 <- forecast::forecast(fit_co2, h = 24) forecast::autoplot(fc_log_co2) trans_log <- function(z, y0) exp(z) fc_co2 <- transform_forecast(fc_log_co2, trans_fun = trans_log, nsim = 30) forecast::autoplot(fc_co2) + ggplot2::theme_minimal() } # ## Example 3: Percentage change in income and consumption requireNamespace("fpp2", quietly = TRUE)) { income <- uschange[, "Income"] # quarterly percentage changes (%) consumption <- uschange[, "Consumption"] # quarterly percentage changes (%) forecast::checkresiduals(income) forecast::checkresiduals(consumption) trans_pct <- function(r, y0) { y0 * cumprod(1 + r / 100) } ## ARIMA model for income fit_income_pct <- forecast::auto.arima(income) fit_income_pct fc_income_pct <- forecast::forecast(fit_income_pct, h = 48) forecast::autoplot(fc_income_pct) fc_income <- transform_forecast(fc_income_pct, trans_fun = trans_pct, nsim = 200, level = 95, y0 = 1) forecast::autoplot(fc_income) + ggplot2::theme_minimal() ## ARIMA model for consumption fit_cons_pct0 <- forecast::auto.arima(consumption, seasonal = F) fit_cons_pct0 fc_cons_pct0 <- forecast::forecast(fit_cons_pct0, h = 48) fc_cons0 <- transform_forecast(fc_cons_pct0, trans_fun = trans_pct, nsim = 200, level = 95, y0 = 1) forecast::autoplot(fc_cons0) ## ARIMAX model for consumption with income as xreg fit_cons_pct <- forecast::auto.arima(consumption, xreg = income) fit_cons_pct fc_income_pct$mean |> head() fc_cons_pct <- forecast::forecast(fit_cons_pct, h = 48, xreg = fc_income_pct$mean) forecast::autoplot(fc_cons_pct) fc_cons <- transform_forecast(fc_cons_pct, trans_fun = trans_pct, xreg_future = fc_income$mean, nsim = 200, level = 95, y0 = 1) forecast::autoplot(fc_cons) #' } ## End(Not run)## Example 1: Log-transform of the Canadian lynx data if (requireNamespace("forecast", quietly = TRUE)) { llynx <- log(lynx) fit_log <- forecast::auto.arima(llynx) fc_log <- forecast::forecast(fit_log, h = 20) forecast::autoplot(fc_log) ## transformation: log -> original scale trans_log <- function(z, y0) { exp(z) } fc_lynx <- transform_forecast(fc_log, trans_fun = trans_log, nsim = 20) forecast::autoplot(fc_lynx) + ggplot2::theme_minimal() } ## Not run: if (requireNamespace("forecast", quietly = TRUE)) { ## Example 2 (variation): CO2 series, log-transform lco2 <- log(co2) fit_co2 <- forecast::auto.arima(lco2) fc_log_co2 <- forecast::forecast(fit_co2, h = 24) forecast::autoplot(fc_log_co2) trans_log <- function(z, y0) exp(z) fc_co2 <- transform_forecast(fc_log_co2, trans_fun = trans_log, nsim = 30) forecast::autoplot(fc_co2) + ggplot2::theme_minimal() } # ## Example 3: Percentage change in income and consumption requireNamespace("fpp2", quietly = TRUE)) { income <- uschange[, "Income"] # quarterly percentage changes (%) consumption <- uschange[, "Consumption"] # quarterly percentage changes (%) forecast::checkresiduals(income) forecast::checkresiduals(consumption) trans_pct <- function(r, y0) { y0 * cumprod(1 + r / 100) } ## ARIMA model for income fit_income_pct <- forecast::auto.arima(income) fit_income_pct fc_income_pct <- forecast::forecast(fit_income_pct, h = 48) forecast::autoplot(fc_income_pct) fc_income <- transform_forecast(fc_income_pct, trans_fun = trans_pct, nsim = 200, level = 95, y0 = 1) forecast::autoplot(fc_income) + ggplot2::theme_minimal() ## ARIMA model for consumption fit_cons_pct0 <- forecast::auto.arima(consumption, seasonal = F) fit_cons_pct0 fc_cons_pct0 <- forecast::forecast(fit_cons_pct0, h = 48) fc_cons0 <- transform_forecast(fc_cons_pct0, trans_fun = trans_pct, nsim = 200, level = 95, y0 = 1) forecast::autoplot(fc_cons0) ## ARIMAX model for consumption with income as xreg fit_cons_pct <- forecast::auto.arima(consumption, xreg = income) fit_cons_pct fc_income_pct$mean |> head() fc_cons_pct <- forecast::forecast(fit_cons_pct, h = 48, xreg = fc_income_pct$mean) forecast::autoplot(fc_cons_pct) fc_cons <- transform_forecast(fc_cons_pct, trans_fun = trans_pct, xreg_future = fc_income$mean, nsim = 200, level = 95, y0 = 1) forecast::autoplot(fc_cons) #' } ## End(Not run)
Truncate values in a matrix / vector to zero if they are below a certain threshold.
truncate0(x, tol = 0.6, sparse = TRUE)truncate0(x, tol = 0.6, sparse = TRUE)
x |
matrix / vector |
tol |
threshold |
sparse |
logical; if TRUE and |
A short and convenient alias for vparse(). Accepts unquoted names, character vectors, or a formula.
v(...)v(...)
... |
Variable input in any accepted |
A character vector of variable names
Check if variables exist in a data frame
vcheck(df, ...)vcheck(df, ...)
df |
A data frame |
... |
Variables to check |
TRUE if all variables exist, otherwise error
Apply a function to each parsed variable name
vmap(.vars, .f)vmap(.vars, .f)
.vars |
Variables to parse |
.f |
Function to apply to each name |
A list of results
These functions provide flexible tools for parsing and managing variable names
from formulas, unquoted names, or character vectors. Demonstrated using CO2 dataset.
vparse(...)vparse(...)
... |
Variable input (unquoted, character vector, or formula) |
Rename columns in a data frame
vrename(df, rename_map)vrename(df, rename_map)
df |
A data frame |
rename_map |
A named character vector (old_name = new_name) |
A data frame with renamed columns
Select columns from a data frame using flexible input
vselect(df, ...)vselect(df, ...)
df |
A data frame |
... |
Variable input (unquoted, character vector, or formula) |
A data frame with selected columns
Determines the locations, i.e., indices of the n largest or n smallest elements of a numeric vector.
which.maxn(x, n = 1)which.maxn(x, n = 1)
x |
numeric vector |
n |
integer >= 1 |
A vector of length at most n with the indices of the n largest / smaller elements. NAs are discarded and that can cause the vector to be smaller than n.
Søren Højsgaard, [email protected]
x <- c(1:4, 0:5, 11, NA, NA) ii <- which.minn(x, 5) x <- c(1, rep(NA,10), 2) ii <- which.minn(x, 5)x <- c(1:4, 0:5, 11, NA, NA) ii <- which.minn(x, 5) x <- c(1, rep(NA,10), 2) ii <- which.minn(x, 5)