Shuffling with constraints

Purpose of the vignette

This example demonstrates that by using customized shuffling functions, it is possible to restrain the design optimization to only score sample arrangements that fit the given constraints.

Key idea is that every reshuffling produces a ‘valid’ sample permutation that is not violating those constraints, even if the suggested solution may be quite bad. During optimization, we pick the best design solution from the possible ones by appropriate scoring.

The user is free to implement custom shuffling functions and pass those to the optimizer. However, some knowledge is required regarding the internal workings of the optimization and batch container setup. Therefore the package provides a little generic constructor for customized shufflings shuffle_grouped_data in which certain types of constraints that relate to grouped samples may be specified by the passed parameters.

The design problem

Samples and treatments

We refer to a simplified version of the in vivo example which is examined deeper in a dedicated vignette.

data("invivo_study_samples")

invivo_study_samples <- dplyr::mutate(invivo_study_samples,
  Litter_combine_females = ifelse(Sex == "F", "female_all", Litter)
)
str(invivo_study_samples)
#> 'data.frame':    59 obs. of  9 variables:
#>  $ AnimalID              : chr  "F1" "F2" "F3" "F4" ...
#>  $ Strain                : chr  "Strain B" "Strain B" "Strain B" "Strain B" ...
#>  $ Sex                   : chr  "F" "F" "F" "F" ...
#>  $ BirthDate             : Date, format: "2021-05-24" "2021-03-01" ...
#>  $ Earmark               : chr  "R" "2L" "2L1R" "L" ...
#>  $ ArrivalWeight         : num  19.4 26.5 20.8 22.1 22.9 ...
#>  $ Arrival.weight.Unit   : chr  "g" "g" "g" "g" ...
#>  $ Litter                : chr  "Litter 1" "Litter 2" "Litter 2" "Litter 2" ...
#>  $ Litter_combine_females: chr  "female_all" "female_all" "female_all" "female_all" ...

invivo_study_samples |>
  dplyr::count(Strain, Sex, Litter_combine_females) |>
  gt::gt()
Strain Sex Litter_combine_females n
Strain A F female_all 7
Strain A M Litter 10 3
Strain A M Litter 11 4
Strain A M Litter 12 5
Strain A M Litter 13 4
Strain A M Litter 14 6
Strain B F female_all 7
Strain B M Litter 1 3
Strain B M Litter 3 5
Strain B M Litter 4 4
Strain B M Litter 5 4
Strain B M Litter 6 4
Strain B M Litter 7 3

We will use the litter as a factor to form cages in our design. However, in order to indicate the compatibility of female animals (see in vivo study vignette), a pseudo-litter female_all is created here to group all the females together, marking them as interchangeable for the subgroup (i.e. cage) allocation.

In the simplified setup we want to assign two treatments to those animals, balancing for strain and sex as the primary suspected confounders. The batch container is prepared as follows:

treatments <- factor(rep(c("Treatment A", "Treatment B"), c(30, 29)))
table(treatments)
#> treatments
#> Treatment A Treatment B 
#>          30          29

bc <- BatchContainer$new(locations_table = data.frame(Treatment = treatments, Position = seq_along(treatments)))

bc <- assign_in_order(bc, invivo_study_samples)

scoring_f <- osat_score_generator(batch_vars = "Treatment", feature_vars = c("Strain", "Sex"))

bc
#> Batch container with 59 locations and 59 samples (assigned).
#>   Dimensions: Treatment, Position

Doing it all in one go

The wrapper shuffle_grouped_data allows to construct a shuffling function that satisfies all constraints defined above at the same time. It can be passed to the optimizer together with other user defined options such as the scoring or acceptance functions.

bc2 <- optimize_design(
  bc,
  scoring = scoring_f,
  shuffle_proposal_func = shuffle_grouped_data(bc,
    allocate_var = "Treatment",
    keep_together_vars = c("Strain", "Sex"),
    keep_separate_vars = c("Earmark"),
    subgroup_var_name = "Cage",
    n_min = 2, n_ideal = 3, n_max = 5,
    strict = TRUE,
    report_grouping_as_attribute = TRUE
  ),
  max_iter = 600
)
#> 
#> Formed 4 homogeneous groups using 59 samples.
#> 22 subgroups needed to satisfy size constraints.
#> 
#> Finding possible ways to allocate variable of interest with 2 levels ...
#> 
#> Aborted after 1000010 recursive calls (maxCalls limit).
#> 176364 allocations found.
#> Usage of sample attributes --> executing first shuffling call.
#> Checking variances of 1-dim. score vector.
#> ... (234.938) - OK
#> Initial score: 453.202
#> Adding 4 attributes to samples.
#> Achieved score: 355.575 at iteration 2
#> No permutations fulfilling the 'keep_separate' constraints in 1000 iters!
#> Increasing number of tolerated violations to 1
#> Achieved score: 305.134 at iteration 19
#> Achieved score: 272.592 at iteration 20
#> Achieved score: 231.507 at iteration 34
#> Achieved score: 181.473 at iteration 52
#> Achieved score: 143.032 at iteration 116
#> Achieved score: 122.49 at iteration 123
#> Achieved score: 105.405 at iteration 165
#> Achieved score: 104.999 at iteration 356
#> Achieved score: 79.371 at iteration 384
#> Achieved score: 52.931 at iteration 525
#> Achieved score: 44.388 at iteration 546

design <- bc2$get_samples()

allocate_var is the batch container variable that should be primarily assigned to individual samples.

keep_together_vars is a list of variables that must be homogeneous within a subgroup (here: cage).

keep_separate_vars lists variables which should have different values within a subgroup (here: cage), if at all possible. This is a soft constraint and will be relaxed in a stepwise way until solutions can be found.

subgroup_var_name allows to give the generated subgroup variable a useful name.

n_min, n_max and n_ideal specify the minimal, maximal and ideal group sizes, respectively. It is often necessary to release the strict criterion to find any solution at all that satisfies those size criteria.

report_grouping_as_attribute allows, if TRUE, to add the updated group variable into the batch container at each iteration, so that scoring functions could make use of this variable (here: cage)!

Following the output of the optimizer, we see that a solution was identified that satisfies all constraints, with the exception of tolerating one violation of earmark-uniqueness within a cage.

The following cages (homogeneous in strain, sex and treatment) have been generated in the process:

design |>
  dplyr::count(Cage, Strain, Sex, Treatment) |>
  gt::gt()
Cage Strain Sex Treatment n
1_1 Strain A F Treatment A 3
1_2 Strain A F Treatment A 2
1_3 Strain A F Treatment A 2
2_1 Strain A M Treatment A 3
2_2 Strain A M Treatment A 3
2_3 Strain A M Treatment A 3
2_4 Strain A M Treatment A 3
2_5 Strain A M Treatment B 3
2_6 Strain A M Treatment B 3
2_7 Strain A M Treatment B 2
2_8 Strain A M Treatment B 2
3_1 Strain B F Treatment B 3
3_2 Strain B F Treatment A 2
3_3 Strain B F Treatment B 2
4_1 Strain B M Treatment A 3
4_2 Strain B M Treatment A 3
4_3 Strain B M Treatment A 3
4_4 Strain B M Treatment B 3
4_5 Strain B M Treatment B 3
4_6 Strain B M Treatment B 3
4_7 Strain B M Treatment B 3
4_8 Strain B M Treatment B 2

Multiple step approach

shuffle_grouped_data is a wrapper that consecutively calls other helper function. As an addendum, let us break the whole procedure down into parts that show what is happening internally at each step.

Form homogeneous subgroups - pools of animals that could go into one cage

We have to divide our animal cohort into subgroups with same strain and sex, meeting size constraints as stated above. Since 2-5 animals should go into one cage, we specify n_minand n_maxaccordingly. n_ideal would be selected by default as the mean of those two, but we specify it explicitly here, too.

The homogeneity of subgroups regarding strain and sex is achieved by listing those two parameters as keep_together_vars.

Assignment of treatments should be performed as well at some point. We thus specify Treatment as the allocation variable.

Note that the Treatment variable is technically a batch container location and not a part of the sample list. This distinction does not matter at this point. However, all required variables must exist in the batch container object.

The following call to form_homogeneous_subgroups() produces an object that holds all relevant information about the samples, the allocation variable and the sizes of the subgroups that have to be formed. It is NOT decided, however, which animal will end up in which subgroup. This will be a matter of optimization later on.

subg <- form_homogeneous_subgroups(
  batch_container = bc, allocate_var = "Treatment",
  keep_together_vars = c("Strain", "Sex", "Litter_combine_females"),
  subgroup_var_name = "Cage",
  n_min = 2, n_ideal = 3, n_max = 5
)
#> 
#> Formed 13 homogeneous groups using 59 samples.
#> 18 subgroups needed to satisfy size constraints.

In this example, 18 subgroups have to be formed to meet all constraints.

It is possible to obtain more information from the returned list object. Inspection of element Subgroup_Sizes tells us that 13 ‘animal pools’ have to be formed which are homogeneous in the relevant parameters (here: strain and sex). Each of those groups happens to be split in subgroups with a size between 2 and 4 animals , which will later constitute the individual cages.

subg$Subgroup_Sizes
#> $`Strain A/F/female_all`
#> [1] 3 4
#> 
#> $`Strain A/M/Litter 10`
#> [1] 3
#> 
#> $`Strain A/M/Litter 11`
#> [1] 4
#> 
#> $`Strain A/M/Litter 12`
#> [1] 3 2
#> 
#> $`Strain A/M/Litter 13`
#> [1] 4
#> 
#> $`Strain A/M/Litter 14`
#> [1] 3 3
#> 
#> $`Strain B/F/female_all`
#> [1] 3 4
#> 
#> $`Strain B/M/Litter 1`
#> [1] 3
#> 
#> $`Strain B/M/Litter 3`
#> [1] 3 2
#> 
#> $`Strain B/M/Litter 4`
#> [1] 4
#> 
#> $`Strain B/M/Litter 5`
#> [1] 4
#> 
#> $`Strain B/M/Litter 6`
#> [1] 4
#> 
#> $`Strain B/M/Litter 7`
#> [1] 3

Find all valid ways to allocate treatments to the subgroups

Each subgroup of animals receives one specific treatment. Or more generally: subgroups have to be homogeneous regarding the allocation variable.

This introduces another type of constraint, since numbers have to add up to 10 ‘Control’ and 10 ‘Compound’ cases, as given by the treatments variable. As a next step, we have to find all possible combinations of subgroups which produce valid options for the treatment allocation. That’s done with the next call.

This will find a large number of different ways to assign treatments to subgroups that lead to the correct overall number of treated animals.

possible <- compile_possible_subgroup_allocation(subg)
#> 
#> Finding possible ways to allocate variable of interest with 2 levels ...
#> 
#> Finished with 108296 recursive calls.
#> 14660 allocations found.

Generate shuffling function for potential study designs

So far we only know the sizes of subgroups (i.e. cages). Thus, in a last step we have to assign specific animals to the various subgroups. Ideally each group of ‘equivalent animals’ (in terms of strain and sex) is split up into more than one subgroup, so there’s many potential ways to assign animals to those.

To allow optimization as usual, we want to generate a shuffling function that produces only valid solutions in terms of our constraints, so that optimization can iterate efficiently over this solution space. The function can be generated by calling shuffle_with_subgroup_formation() with the previously created subgrouping object and the list of possible treatment allocations.

Every call to this shuffling function will return a permutation index (of the original samples) that constitutes a valid solution to be scored.

The permutation function actually also constructs a ‘Cage’ variable (see parameter subgroup_var_name in the call to form_homogeneous_subgroups()). To make this parameter available and join it to the samples in the batch container, use flag report_grouping_as_attribute in the construction of the permutation function.

shuffle_proposal <- shuffle_with_subgroup_formation(subg, possible, report_grouping_as_attribute = TRUE)

shuffle_proposal()
#> $location_assignment
#>  [1]  9 10 13  8 11 12 14 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55
#> [26] 59  2  3  4  5 56 57 58  1  6  7 35 36 37 19 20 21 26 27 15 16 17 18 22 23
#> [51] 24 25 28 29 30 31 32 33 34
#> 
#> $samples_attr
#> # A tibble: 59 × 4
#>    alloc_var_level group subgroup Cage 
#>    <chr>           <int>    <int> <chr>
#>  1 Treatment B         7        1 7_1  
#>  2 Treatment A         7        2 7_2  
#>  3 Treatment A         7        2 7_2  
#>  4 Treatment A         7        2 7_2  
#>  5 Treatment A         7        2 7_2  
#>  6 Treatment B         7        1 7_1  
#>  7 Treatment B         7        1 7_1  
#>  8 Treatment A         1        2 1_2  
#>  9 Treatment A         1        1 1_1  
#> 10 Treatment A         1        1 1_1  
#> # ℹ 49 more rows

Calling the shuffle proposal function repeatedly produces a valid (constraint-aware) sample arrangement each time, with the grouping variable (here: Cage) reported alongside. (The optimizer will merge the ‘Cage’ variable into the batch container after each iteration, so that it can be used for scoring as if it would have been in the container from the beginning!)

Use shuffling function for optimizing design

We can finally use the customized shuffling function in the optimization.

bc3 <- optimize_design(
  bc,
  scoring = scoring_f,
  shuffle_proposal_func = shuffle_proposal,
  max_iter = 300
)
#> Usage of sample attributes --> executing first shuffling call.
#> Checking variances of 1-dim. score vector.
#> ... (378.225) - OK
#> Initial score: 289.541
#> Adding 4 attributes to samples.
#> Achieved score: 189.066 at iteration 9
#> Achieved score: 139.439 at iteration 15
#> Achieved score: 105.405 at iteration 34
#> Achieved score: 52.931 at iteration 55
#> Achieved score: 51.304 at iteration 70
#> Achieved score: 32.863 at iteration 206

design <- bc3$get_samples()

# Obeying all constraints does not lead to a very balanced sample allocation:
dplyr::count(design, Treatment, Strain) |> gt::gt()
Treatment Strain n
Treatment A Strain A 17
Treatment A Strain B 13
Treatment B Strain A 12
Treatment B Strain B 17

dplyr::count(design, Treatment, Sex) |> gt::gt()
Treatment Sex n
Treatment A F 10
Treatment A M 20
Treatment B F 4
Treatment B M 25