Running regression in loops in Stata
In a world of ever-growing data, the ability to quickly and efficiently process large amounts of information is crucial. Loop functions are invaluable tools that provide an efficient means for data analysis, allowing users to automate tasks and improve analysis workflows. Not only can they save time compared to manually performing repetitive calculations, but they also enable more sophisticated processes that would otherwise be too intensive or complicated.
That said, in this post, I’d like to demonstrate the use of loops in running regression analysis using Stata 16 (will do so with R soon). And we will use an example data (nhanes2) that comes with the app.
use http://www.stata-press.com/data/r15/nhanes2.dta, clear
Once the dataset is loaded, the complete list of the variables can be printed using the ds command:
Let’s also use the describe commands to get a better view of the variables (on Stata, scroll downwards to see the entire list of the variables):
We will see the use two types of loops: forvalues and foreach.
forvalues
For the purpose of the first example, I am going to need three variables: one outcome (heartatk), one independent (serum iron (mcg/dL)), and a third one whose values will be used for looping (reiterating).
The variable region has four categories and will serve the purpose of the third variable well.
Here comes a more detailed picture of the variables of interest:
You have guessed it right, our interest is in the association between serum iron (mcg/dL) and heart attack (heartatk), which can be modelled in the following way:
logit heartatk iron if region, or # or for odds ratio
Let’s say we are also interested in knowing where the association differs across the four regions. One option is to run the regression separately for each region:
logit heartatk iron if region == 1, or
logit heartatk iron if region == 2, or
logit heartatk iron if region == 3, or
logit heartatk iron if region == 4, or
Which is perfectly fine for our analytical purposes. However, it can be done in a quicker and more fool-proof way by using loops (less typing = less typos).
Let’s have a look at the syntax:
forvalues i=1/4{
logit heartatk iron if region ==`i’, or
}
Saves some typing, and gives a sleeker look! I am not pasting the output here as screen capture can’t capture the entire output.
Now, there are a few things to remember when using the forval loop:
- The looping variable must be numeric.
- The values set for ‘i’ should reflect the ones of the looping variable.
foreach (multiple outcomes, same exposures)
The next example is about running a series of regressions with the same set of regressors for different outcome variables.
Let’s say we want to measure how serum iron is associated with heart attack and diabetes, both of which are categorized in the same (0/1) manner. The regular way of doing this would be:
logit heartatk iron, or
logit diabetes iron, or
Now, in lieu of running the regression twice, we can use the loop function:
foreach outcome of varlist heartatk diabetes{
logit `outcome’ iron, or
}
The example may not seem that useful, but trust me it certainly will when we look at, say, 5+ outcomes.
foreach (same outcome, multiple exposures)
Our third example is the opposite of the second one- looping multiple exposure variables (iron, tcresult, and tgresult) against the same outcome:
foreach exposure of varlist iron tcresult tgresult {
logit diabetes `exposure’, or
}
foreach (multiple outcome, multiple exposures)
Yes, we can mix the loop for both outcomes and exposures:
foreach outcome of varlist heartatk diabetes{
foreach exposure of varlist iron tcresult tgresult {
logit `outcome’ `exposure’, or
}
}
Happy looping!