Imagine that one diatom assemblage has 20 species and another has 25, but the more species rich assemblage also has more diatoms counted.
A fair comparison of the richness of each assemblage can only be made when the number of individuals is constant.
In this case, we can use the function rarefy
from the vegan
package. Other analyses might not be possible to run directly, so you need re-sampled counts with a specified count sum. I’ve been doing this recently to find the how the prevalence of singletons (species only occurring once) varies with count sum.
There are several ways to resample counts. A key question is whether you want to sample with or without replacement.
Sampling without replacement asks what the assemblage you currently have might have been when the count sum was smaller. If you sample as many individuals as you actually have, your resampled assemblage is identical to your current assemblage and you cannot sample more individuals than you have.
Sampling with replacement asks what the assemblage you currently have might look like if you counted another assemblage from the same material.
In principle, you can simulate larger assemblages than you currently have, though this is probably always a bad idea. Species richness will be biased low, as rare taxa present in the community behind the veil line cannot occur in the resampled assemblages but could occur if another diatom slide was counted.
I’m going to resample the vegan::BCI
dataset of tree counts in plots on Barro Colorado Island.
data(BCI, package = "vegan")
First a tidyverse solution.
BCI %>%
# put the rownames into a column called "plot"
rownames_to_column(var = "plot") %>%
# convert the data from a wide to a long format
pivot_longer(-plot, names_to = "species", values_to = "n") %>%
# uncount does the opposite of count - giving 2 rows for a species that had a count of 2.
uncount(n) %>%
group_by(plot) %>%
# randomly sample, with replacement, 100 individuals (rows) per plot
slice_sample(n = 100, replace = TRUE) %>%
# count the number of individuals per species per plot
count(species)
## # A tibble: 2,128 × 3
## # Groups: plot [50]
## plot species n
## <chr> <chr> <int>
## 1 1 Alseis.blackiana 6
## 2 1 Annona.spraguei 1
## 3 1 Apeiba.glabra 2
## 4 1 Astronium.graveolens 3
## 5 1 Brosimum.alicastrum 1
## 6 1 Casearia.sylvestris 2
## 7 1 Cassipourea.guianensis 1
## 8 1 Cecropia.insignis 1
## 9 1 Chrysophyllum.argenteum 1
## 10 1 Cordia.bicolor 3
## # … with 2,118 more rows
The tidyverse solution might not be the fastest (see below to find out) as it is doing a lot of extra work. So let’s write a function that has the same essential steps of uncounting the counts, sampling them, and the recounting them.
resample_1 <- function(counts, size, replace = FALSE){
#counts is species count data to resample,
#size is target count sum
#replace or not
#grab species names (if present)
id <- names(counts)
if(is.null(id)){
id <- 1:length(count)
}
#uncount
count2 <- rep(id, times = counts)
# sample
samp <- sample(count2, size = size, replace = replace)
# count again
tab <- table(samp)
# fill missing taxa with zero
result <- rep(0, length(counts))
names(result) <- id
result[names(tab)] <- tab
#return result - a named vector
result
}
#first BCI plot
bci1 <- unlist(BCI[1, ])
resample_1(count = bci1, size = 100, replace = TRUE)[1:20]
## Abarema.macradenia Vachellia.melanoceras Acalypha.diversifolia
## 0 0 0
## Acalypha.macrostachya Adelia.triloba Aegiphila.panamensis
## 0 0 0
## Alchornea.costaricensis Alchornea.latifolia Alibertia.edulis
## 0 0 0
## Allophylus.psilospermus Alseis.blackiana Amaioua.corymbosa
## 0 9 0
## Anacardium.excelsum Andira.inermis Annona.spraguei
## 0 0 0
## Apeiba.glabra Apeiba.tibourbou Aspidosperma.desmanthum
## 1 0 0
## Astrocaryum.standleyanum Astronium.graveolens
## 0 1
There are also R functions that do the same jobs.
Resampling with replacement gives a multinomial distribution; for which we can also use stats::rmultinom
. This takes the argument prob
for the species data. The data are standardised to sum to one, so counts, percent or proportions could be used. n
is the number of resamples to make. The return object is a matrix with n
columns and one species per row.
rmultinom(n = 1, size = 100, prob = bci1)[1:20, ]
## Abarema.macradenia Vachellia.melanoceras Acalypha.diversifolia
## 0 0 0
## Acalypha.macrostachya Adelia.triloba Aegiphila.panamensis
## 0 0 0
## Alchornea.costaricensis Alchornea.latifolia Alibertia.edulis
## 0 0 0
## Allophylus.psilospermus Alseis.blackiana Amaioua.corymbosa
## 0 5 0
## Anacardium.excelsum Andira.inermis Annona.spraguei
## 0 0 0
## Apeiba.glabra Apeiba.tibourbou Aspidosperma.desmanthum
## 3 0 0
## Astrocaryum.standleyanum Astronium.graveolens
## 0 1
Resampling without replacement follows the multivariate hypergeometric distribution, which can be fitted with rmvhyper
from package extraDistr
. The arguments have different names, nn
is the number of resamples, k
is size
and n
is the species data to resample. The output is a matrix with nn
rows and one species per column – the transpose of rmultinom
.
library(extraDistr)
rmvhyper(nn = 1, k = 50, n = bci1)[, 1:20]
## [1] 0 0 0 0 0 0 1 0 0 0 3 0 0 0 0 3 0 0 0 0
As I often resample many assemblages many times, it is useful to know which solution is fastest.
library(microbenchmark)
size <- 100
# preprocess for tidyverse
BCI1 <- BCI %>%
rownames_to_column(var = "plot") %>%
slice(1) %>%
pivot_longer(-plot, names_to = "species", values_to = "n")
mb1 <- microbenchmark(
tidyverse = BCI1 %>%
uncount(n) %>%
group_by(plot) %>%
slice_sample(n = 100, replace = TRUE) %>%
count(species),
resample_wo = resample_1(count = bci1, size = size, replace = FALSE),
resample_w = resample_1(count = bci1, size = size, replace = TRUE),
rmultinom = rmultinom(n = 1, size = size, prob = bci1),
rmvhyper = rmvhyper(nn = 1, k = size, n = bci1)
)
autoplot(mb1)

As expected, the tidyverse solution is much slower than the other solutions. mvhyper
and rmultinom
are much faster than the home-made solution. I don’t know why mvhyper
is slower than rmultinom
.
I often need to resample an assemblage many times. mvhyper
and rmultinom
have this functionality built-in. The home-grown function can be extended to make multiple resamples.
resample_n <- function(count, size, n = 1, replace = FALSE){
id <- names(count)
if(is.null(id)){
id <- 1:length(count)
}
count2 <- rep(id, times = count)
replicate(n = n, {
samp <- sample(count2, size = size, replace = replace)
tab <- table(samp)
result <- rep(0, length(count))
names(result) <- id
result[names(tab)] <- tab
result
})
}
And again, we can test how fast the functions are.
mb_nrep <- microbenchmark(
resample_wo_1 = resample_n(count = bci1, n = 1, size = size, replace = FALSE),
resample_wo_10 = resample_n(count = bci1, n = 10, size = size, replace = FALSE),
resample_wo_100 = resample_n(count = bci1, n = 100, size = size, replace = FALSE),
rmultinom_1 = rmultinom(n = 1, size = size, prob = bci1),
rmultinom_10 = rmultinom(n = 10, size = size, prob = bci1),
rmultinom_100 = rmultinom(n = 100, size = size, prob = bci1),
resample_w_1 = resample_n(count = bci1, n = 1, size = size, replace = TRUE),
resample_w_10 = resample_n(count = bci1, n = 10, size = size, replace = TRUE),
resample_w_100 = resample_n(count = bci1, n = 100, size = size, replace = TRUE),
rmvhyper_1 = rmvhyper(nn = 1, k = size, n = bci1),
rmvhyper_10 = rmvhyper(nn = 10, k = size, n = bci1),
rmvhyper_100 = rmvhyper(nn = 100, k = size, n = bci1)
)
mb_nrep %>%
print() %>%
mutate(n = rep(c(1, 10, 100), times = 4),
expr = str_remove(expr, "_\\d{1,3}")) %>%
ggplot(aes(x = n, y = mean/n, colour = expr)) +
geom_line() +
geom_point() +
scale_x_log10() +
scale_y_log10() +
labs(x = "Number of draws", y = "Mean time per resample, microseconds", colour = "Method")

The home-made function has less good economy-of-scale than the other methods. I think I have some projects that need their code updating.
One last problem is how to resample all the diatom assemblages in a core. We need to iterate over each row in the data set (assuming it is in wide format – also possible to do this in long format). If we want a single resample, we can use apply
.
m1 <- apply(BCI, 1, resample_1, size = 100)
m2 <- apply(BCI, 1, rmultinom, size = 100, n = 1)
m3 <- apply(BCI, 1, rmvhyper, k = 100, nn = 1)
Of course, it is also possible to run multiple resamples at once, but as this returns a matrix, apply
won’t work well. Instead, we can use plyr::aaply
(don’t load plyr
as it will conflict with dplyr
). This will return an array (the rmvhyper
result is transposed).
#convert the data to a matrix or problems
mm1 <- plyr::aaply(.data = as.matrix(BCI), .margins = 1, .fun = resample_n, n = 2, size = 100 )
dim(mm1)
## [1] 50 225 2
mm2 <- plyr::aaply(.data = as.matrix(BCI), .margins = 1, .fun = rmultinom, n = 2, size = 100 )
dim(mm2)
## [1] 50 225 2
mm3 <- plyr::aaply(.data = as.matrix(BCI), .margins = 1, .fun = rmvhyper, nn = 2, k = 100 )
dim(mm3)
## [1] 50 2 225
Coming soon, some analyses using lots of resampled assemblages.