Standalone use of “STATA” for analysis of cluster randomized controlled trials (cluster RCT)

Well friends, I am going to demystify various statistical bottlenecks underpinning cluster or group RCT in the backdrop of “Stata” software.

To start with, as you all are aware that RCT is the benchmark for research, particularly in medical field where individuals or study subjects are simply randomized into intervention and control groups. In contrast, cluster RCT entails randomizing clusters of individuals into different groups. These clusters may be schools, hospitals, villages, factories, farms and so on. Though intervention is targeted at individuals in both types of trials, but randomization and implementation of intervention take place at the cluster level in cluster RCT. Cluster RCTs have rapidly increased in last 2-3 decades and are frontline strategy in public health interventions.


These three pioneers laid down analytical framework for cluster RCTs and they have authored best known books on cluster RCT. Indeed, it is because of their efforts that research in this field has become really challenging and also more precise for public health experts. But thanks to Stata, almost all their challenges can be met easily. With this premise, I will share with you my own experience of conducting and analyzing a cluster RCT published in Asian Pacific J of Public Health.

This was a community based cluster RCT

Just a brief overview of our trial schematic and flow chart


We took 32 different villages and randomized them equally into intervention and control group. Each group had 16 villages enrolling anemic women (i.e. women with low Hb below 12 gm%). Our basic aim was to treat these anemic women with iron tablets for a consecutive period of 90 days. However, we followed different strategies of administering iron tablets. In intervention clusters, anemic women were given home based directly supervised iron tablets daily whereas in control clusters, iron tablets were distributed through clinics once a month to be taken unsupervised by anemic women. We had two key outcomes. First was a qualitative outcome of anemia where we aimed to decrease the prevalence of anemia in intervention clusters as compared to control clusters and second was a quantitative outcome of Hb values where we compared mean Hb rise between two study groups. After completion of trial, intervention clusters reported more than 50% relative reduction in anemia prevalence and mean Hb was 1 gm% higher as compared to control group subjects.

In the backdrop of our trial, I am going to decode 4 important analytical issues lurking underneath our cluster RCT.

As in any research, first and foremost is sample size calculation. Fortunately, Stata has provided us with a ado-program called “clustersampsi” which can easily provided sample size for cluster RCT comparing two means, two proportions or two event rates. Interestingly, this stata program comes with a dialog box where we can use menu driven commands as well.

Second and key bottleneck is ICC which stands for intracluster correlation coefficient. In simple term, ICC denotes the random variability of outcome response between different clusters. In my viewpoint estimating ICC for quantitative as well qualitative outcomes using loneway command (see commands below) is the key strength of Stata particularly in comparison to SPSS.

Next part is analyzing primary outcome at the level of clusters where all individual values in each cluster are summarized into a single summary measure and then compared using garden variety parametric and nonparametric tests. Interestingly, “Stata” offers us a very useful resampling technique over and above these parametric and non parametric tests called permutation test which adds further validity and precision to cluster level analysis.

Finally, “stata” has a rich armamentarium of regression modeling techniques which are able to carry out individual level analysis adjusting easily for clustering of study subjects.

Let us first unmask this arcane entity of ICC. Statistically, ICC is a metric which captures the variability of primary outcome response between various clusters versus within the clusters and is estimated by the following formula.

For the ease of understanding, let us look at this picture below.


Suppose, these eggs are the study subjects who are distributed in multiple clusters. If one looks carefully, the subjects in any single cluster are similar to subjects in all other clusters or primary outcome response is similarly distributed between all clusters. Hence, there will be no between-cluster making the numerator in above formula zero which will further yield an ICC of zero. On the contrary, in this picture, study subjects differ markedly between the clusters but their primary outcome response is uniformly same within each cluster. This simply decimates the within-cluster variability to zero which when substituted in the denominator of above formula which lead to a maximum possible value (one) of ICC. Why is this ICC important in cluster RCTs. Ignorance of ICC or failure to account for ICC is statistically fatal for two reasons:

First, it will lead to lower sample size and therefore underpowered cluster RCT (higher beta error).

Second, it will underestimate standard errors of all estimates leading to spuriously low P-values and higher alpha error.

This slide displays the dialog box as well as output of ICC for quantitative variable. The command for ICC in stata is oversimplied- just write loneway, then enter your outcome variable followed by cluster variable and click enter.

ICC for quantitative outcome: loneway Hb village

ICC for qualitative outcome: loneway anemia village

Armed with concept of ICC, let us quickly see sample size calculation for our cluster RCT. As far as simple RCT is concerned, we need minimal desirable outcome difference between two randomized groups (this may be difference between two means, two proportions or two event rates) along with their standard deviations. Then we a priori decide the cut off limits for alpha and beta errors. For simple RCT, this is almost enough and you feed it to sample size estimation in stata using a ‘ssi’ command, it will yield you a sample size for simple RCT. However, in cluster RCT, we need two more parameters that is- ICC and average cluster size. Once these parameters are decided, write a following command in Stata and your sample size is ready. You can easily appreciate that for same difference in two proportions, sample size in cluster RCTs in each trial arm is nearly 4 times that of simple RCT.

blog2Next issue is analysis of cluster RCT. There are two basic approaches.

t approach is analysis at the level of clusters. Let me clarify this by a pictorial demonstration. We have intervention clusters as well as control clusters showing values of primary outcome for various individual study subjects in them. Here, we first summarize outcome values of all individuals within each cluster into one single average value of say mean or median. This step can be easily accomplished in Stata by collapse command where you merge all individual values of outcome Hb, in our example, into a single summary measure of mean for each cluster separately.


y similar to number of clusters. Since there is only one value for cluster, the issue of dependency or clustering of outcome data becomes redundant and hence such summary level data can be analyzed by simple parametric or non parametric tests. To further optimize the precision of these parametric or nonparametric tests, statisticians often recommend use of resampling techniques such as permutation test for cluster level analysis. Here basically, each cluster of study groups is replicated or permuted into thousands of additional samples and each of these permuted study groups is analyzed for given outcome.

For example in this Stata output, let us quickly understand the Stata command. We have directed it to confirm a statistical significance of mean difference in Hb outcome between two study groups by t-test using 5000 permuted samples. The finally result shows that actual mean difference was 0.69 gm% in mean Hb values, but when study samples were permuted 5000 times, only one of them showed a difference as large as 0.69 gm%. In other words, possibility of randomly obtaining such a difference in mean outcomes of two study groups by chance was only 1 in 5000 which puts the corresponding P-value as highly significant (P-value 0.0002).

So far things hopefully were little straightforward, now lets us jump on to more advanced statistical analysis targeted not at clusters but directly to individuals with adjustment for their clustering.

First of all, we will consider individual level analysis for a qualitative or binary outcome in cluster RCT. In this slide, on one hand we have a simple RCT design where binary outcome of good/bad is analyzed between directly randomized individuals and on the other hand, there is a cluster RCT where a binary outcome of yes/no is tested between groups of individuals cluster randomized at the level of households. Such outcome analysis between two study groups (whether simple or cluster RCT) is basically highlighted as 2 by 2 analysis or crosstab analysis where binary outcome is in columns and study groups in rows or vice versa. In simple RCT, such 2 by 2 tables are statistically analyzed by simple pearson’s chi-square test which is given as following stata command. But when RCT is cluster or group randomized, then analysis of 2 by 2 table in stata is little modified by clustered chi-square test as shown here. On the contrary, carrying out such clustered chi square test by manual use of excel calculations may take days to weeks depending upon one’s mathematical rigor.


Now if we have to compare such binary outcomes between study groups in presence of other covariates, standard statistical procedure is binary logistic regression. However in presence of clustering, logistic regression is further customized as cluster specific approach where random variability between clusters is modeled as additional parameter or population average approach where outcome odds are averaged across all clusters after adjusting for within cluster variability.

Let us simplify it with graphical demonstration. Here we have plotted probability of outcome (say anemia) as log odds against study groups in simple RCT. As one changes from control to intervention group on X-axis, standard logistic regression in Stata (shown above) will capture the log odds of anemia as a function of intercept alpha (i.e. value of outcome in control group) plus slope beta (i.e. additional effect of intervention) keeping other covariates at fixed or constant value.


However, when there is clustering, the log odds of anemia will need one additional parameter called sigma u which explains between cluster variability in anemia outcome. For this, we need to customize standard logistic regression command in Stata with addition of re (random effect) for village clusters. However, these two commands will give us odds ratio. But if we want to calculate the relative risk of anemia in intervention versus control group adjusting for various covariates, we have another logistic regression technique called GEE. Unlike random effect logistic regression, GEE will not need to estimate additional model parameters like sigma u and it will provide us directly with RR than OR.

Finally, let us see how can we analyze a quantitative variable outcome at the individual level? In a simple RCT without any clustering, a univariate test for comparing a quantitative outcome between two randomized groups is t-test given in Stata as following command. However, in case of cluster RCT, comparative analysis of a quantitative variable by t-test needs to be adjusted for clustering. Fortunately Stata again provides us with a clustered t-test as following command. But when we have to undertake multivariate analysis adjusting for effects of other confounding variables, we go ahead with standard linear regression in simple RCT. But in cluster RCT, standard linear regression needs to be fortified with these three modifications: use of cluster robust SE, mixed effect linear model or again GEE.