5 Network inference: Running regressions with network data
5.1 Why run a regression?
Regressions are powerful tools that help you figure out which variable is correlated with another variable. There are two main reasons to use regressions:
- you can find patterns of associations between variables that you cannot spot with the nacked eye (mainly because data is always messy and there is never a perfect correlation between two variables that you can spot alone by looking the data)
- you can include control-variables: i.e., variables that you know from theory/past research are associated with your dependent variable and you should therefore control for
5.2 Which model should I use?
Since we are working with network data, the regression models you can use are a little bit more complicated that ordinary least squares regressions or generalized linear models. Basically there are two main model-groups that can be used to achieve a differnt goal.
- Group 1: Which factors explain the variance in a specific variable? (whilst controlling for network dependencies)
- Group 2: Explaining the data generating process of a network (how does a network come to be? is it significantly different from a random network with equalsized nodes and edges?)
We’ll focus on the first group, where we use standard regression tools to explain factors that correlate with a depenent variable (e.g., player performance) and control for network effects. For the secound group: check out this wonderful summary of advances in inferential network analysis: Cranmer et al. (2017)
5.3 Basic idea of Temporal Network Autocorrelation Model (TNAM)
The network autocorrelation model can be used to run regressions on data that show an inherent dependency among observations. If you do not control for these dependencies, the standard errors are calculated wrongly and your results will not hold!
First, let’s prepare the data:
#read data
el <- read.table('data_literatur_varia/Edgelist_highSchoolfriends.csv',
sep = ";", header = TRUE)
dt <- read.table('data_literatur_varia/Students_highSchoolAttributes.csv',
sep = ";", header = TRUE)
The data comes from a Swiss high-school that I’ve surveyed in 2014. About 600 students were asked to complete a questionnaire on their friends, their grades, their values, beliefs, deviant behavior etc.
We’ll create two networks: one for all sorts of friends (best friends, drinking buddies etc.) and one for school friends only:
#create adjacency matrix for all sorts of friends (best friends, drinking buddies, school friends etc.)
students <- unique(c(el$studentID, el$friend.ID.code))
mat <- matrix(0, nrow =length(students) , ncol = length(students))
colnames(mat) <- as.character(students)
rownames(mat) <- as.character(students)
mat[cbind(as.character(el$studentID),
as.character(el$friend.ID.code))] <- 1
#create adjacency matrix for school friends
matsf <- matrix(0, nrow =length(students) , ncol = length(students))
colnames(matsf) <- as.character(students)
rownames(matsf) <- as.character(students)
matsf[cbind(as.character(el$studentID[el$school.friend == 'Yes']),
as.character(el$friend.ID.code[el$school.friend == 'Yes']))] <- 1
Next we need to create an attribute file and add variables:
#create attributes file that matches the adjacency matrix
att <- data.frame('studentID' = rownames(mat))
#add attributes
# Grades in Math and French-Class
att$grade.math.french <- dt$marks.french.math.num[match(att$studentID, dt$studentID)]
# Name of school class
att$class <- dt$class[match(att$studentID, dt$studentID)]
# Gender
att$gender <- dt$gender[match(att$studentID, dt$studentID)]
# Age
att$age <- dt$age[match(att$studentID, dt$studentID)]
# Migration background
att$migrationbg <- dt$migration.bg[match(att$studentID, dt$studentID)]
# Liking school
att$liking.school <- dt$likes.school.num[match(att$studentID, dt$studentID)]
# Leisure time
att$sport.hours <- dt$sport.hours[match(att$studentID, dt$studentID)]
att$tv.hours <- dt$hr.spent.TV.num[match(att$studentID, dt$studentID)]
att$friends.hours <- dt$hr.spent.friends.num[match(att$studentID, dt$studentID)]
att$reading.hours <- dt$hr.spent.reading.num[match(att$studentID, dt$studentID)]
# Values: Being successful & Earning a lot
att$agree.besuccessfull <- dt$agree.be.successful.num[match(att$studentID, dt$studentID)]
att$agree.earnalot <- dt$aggree.earn.alot.num[match(att$studentID, dt$studentID)]
# Values: Repecting elders & Parental monitoring
att$true.parentsinterested <- dt$true.parents.interested.num[match(att$studentID, dt$studentID)]
att$true.respectelderly <- dt$agree.respect.elderly.num[match(att$studentID, dt$studentID)]
And then we’re done preparing.
5.4 TNAM-terms
To capture network dependencies in the data, several different tnam
-terms can be used to calculate new variables that may affect the dependent variable (or more neutrally: is correlated with the dependent variable).
Spatial Lags - where a standard variable is multiplied with an adjacency matrix to test whether network autocorrelation exists, meaning whether the behavior of my neighbors/friends affects my behavior (is correlated with my behavior - no matter which way the correlation points, i.e., whether I choose similar friends or whether my friends affect me or whether I influence my friends).
Attribute similarity - using a spatial lag term, you compare the dependent variable of your friends to your own value in the dependent variable to see whether homophily (either influence or selection or both) is at work.
Network effects - you can test the effect of network effects on the dependent variable, i.e., whether popularity affects a dependent variable or centrality scores correlate with the dependent variable.
Find out more on tnam-terms by typing:
?'tnam-terms'
5.5 Running TNAMs
5.5.1 Purely exogenous factors - neglecting important network dependencies!
fit1 <- lm(grade.math.french ~
age
+ gender,
data = att)
summary(fit1)
##
## Call:
## lm(formula = grade.math.french ~ age + gender, data = att)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.5166 -0.5166 0.0142 0.6588 1.6588
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.64240 0.70978 7.949 2.83e-14 ***
## age -0.07229 0.04224 -1.712 0.0879 .
## gendermale 0.17541 0.10055 1.745 0.0820 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9051 on 338 degrees of freedom
## (27 observations deleted due to missingness)
## Multiple R-squared: 0.01636, Adjusted R-squared: 0.01054
## F-statistic: 2.811 on 2 and 338 DF, p-value: 0.06158
Now these results are not reliable since we do not control for network dependencies among observations (=students). SE are overestimated, p-values are too low and estimates are also biased if we do not control for the fact that these observations potentially influence each other.
5.5.2 Checking for network autocorrelation
Let’s add a netlag term for network dependency. This term measures whether your own performance (= measured by the grades achieved in math and French class) correlates with the performance of your immediate friends.
With the netlag()
-function we can easily estimate the (average) performance of an ego’s friends.
nldt <- netlag(att$grade.math.french, mat)
The netlag()-function creates a new data set with the following columns: 1. netlag.pathdist1 = performance of best friends 2. time = here a constant 3. node = nodeID 4. response = dependent variable = here performance
We will adopt the first variable into our attributes data set and include it in the regression:
att$grade.friends <- netlag(att$grade.math.french, mat)[,1]
fit2 <- lm(grade.math.french ~
age
+ gender
+ grade.friends,
data = att)
summary(fit2)
##
## Call:
## lm(formula = grade.math.french ~ age + gender + grade.friends,
## data = att)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.5402 -0.4870 0.0133 0.6676 1.7154
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.573817 0.712216 7.826 6.57e-14 ***
## age -0.074396 0.042264 -1.760 0.0793 .
## gendermale 0.185680 0.100940 1.840 0.0667 .
## grade.friends 0.006659 0.005993 1.111 0.2673
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9048 on 337 degrees of freedom
## (27 observations deleted due to missingness)
## Multiple R-squared: 0.01995, Adjusted R-squared: 0.01122
## F-statistic: 2.287 on 3 and 337 DF, p-value: 0.07852
The normal netlag()
-function uses no normalization and simply adds all the values of the friends together. I prefer to use an average effect, where the performance is averaged over the number of friends I have.
att$avg.grade.ofriends <- netlag(att$grade.math.french, mat, normalization = 'row')[,1]
att$avg.grade.ifriends <- netlag(att$grade.math.french, mat, normalization = 'column')[,1]
fit3 <- lm(grade.math.french ~
age
+ gender
+ avg.grade.ofriends,
data = att)
summary(fit3)
##
## Call:
## lm(formula = grade.math.french ~ age + gender + avg.grade.ofriends,
## data = att)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.54831 -0.50981 -0.00981 0.71351 1.70279
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.30974 0.72837 7.290 2.23e-12 ***
## age -0.07413 0.04208 -1.761 0.0791 .
## gendermale 0.18675 0.10034 1.861 0.0636 .
## avg.grade.ofriends 0.08580 0.04512 1.902 0.0581 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9016 on 337 degrees of freedom
## (27 observations deleted due to missingness)
## Multiple R-squared: 0.0268, Adjusted R-squared: 0.01814
## F-statistic: 3.094 on 3 and 337 DF, p-value: 0.02713
The netlag()
-term allows for more distand network effects:
att$avg.grades.ofriendoffriends <- netlag(att$grade.math.french, mat, normalization = 'row', pathdist = 2)[,1]
fit4 <- lm(grade.math.french ~
age
+ gender
+ avg.grade.ofriends
+ avg.grades.ofriendoffriends,
data = att)
summary(fit4)
##
## Call:
## lm(formula = grade.math.french ~ age + gender + avg.grade.ofriends +
## avg.grades.ofriendoffriends, data = att)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.53735 -0.51553 -0.00669 0.69880 1.76122
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.31759 0.72893 7.295 2.16e-12 ***
## age -0.07093 0.04233 -1.676 0.0948 .
## gendermale 0.17945 0.10089 1.779 0.0762 .
## avg.grade.ofriends 0.10551 0.05238 2.014 0.0448 *
## avg.grades.ofriendoffriends -0.06924 0.09330 -0.742 0.4585
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9022 on 336 degrees of freedom
## (27 observations deleted due to missingness)
## Multiple R-squared: 0.02839, Adjusted R-squared: 0.01683
## F-statistic: 2.455 on 4 and 336 DF, p-value: 0.04565
5.5.3 Checking for Clique effects
Instead of checking for correlations between ego’s behavior and the average behavior of ego’s friends you can also check for clique effects: here, the average performance is calculated not only for direct friends but also for indirect friends that are part of ego’s clique.
att$clique.grades <- cliquelag(att$grade.math.french, mat)[,1]
fit4 <- lm(grade.math.french ~
age
+ gender
+ clique.grades,
data = att)
summary(fit4)
##
## Call:
## lm(formula = grade.math.french ~ age + gender + clique.grades,
## data = att)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.56359 -0.50301 0.01606 0.66415 1.68883
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.587911 0.715100 7.814 7.12e-14 ***
## age -0.071803 0.042277 -1.698 0.0904 .
## gendermale 0.187352 0.102230 1.833 0.0677 .
## clique.grades 0.004487 0.006761 0.664 0.5074
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9059 on 337 degrees of freedom
## (27 observations deleted due to missingness)
## Multiple R-squared: 0.01764, Adjusted R-squared: 0.008897
## F-statistic: 2.017 on 3 and 337 DF, p-value: 0.1113
5.5.4 Checking for centrality effects
Next we can check if the network position of a student affects their performance. We’ll test indegree-, outdegree- and betweenness-centrality.
att$indegree <- degree(mat, cmode = 'indegree')
fit5a <- lm(grade.math.french ~
age
+ gender
+ avg.grade.ofriends
+ indegree,
data = att)
summary(fit5a)
##
## Call:
## lm(formula = grade.math.french ~ age + gender + avg.grade.ofriends +
## indegree, data = att)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.48733 -0.52220 -0.00039 0.69348 1.68315
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.30653 0.72887 7.280 2.38e-12 ***
## age -0.07130 0.04229 -1.686 0.0927 .
## gendermale 0.18111 0.10070 1.799 0.0730 .
## avg.grade.ofriends 0.08938 0.04540 1.969 0.0498 *
## indegree -0.01702 0.02298 -0.741 0.4595
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9022 on 336 degrees of freedom
## (27 observations deleted due to missingness)
## Multiple R-squared: 0.02839, Adjusted R-squared: 0.01682
## F-statistic: 2.454 on 4 and 336 DF, p-value: 0.0457
att$outdegree <- degree(mat, cmode = 'outdegree')
fit5b <- lm(grade.math.french ~
age
+ gender
+ avg.grade.ofriends
+ outdegree,
data = att)
summary(fit5b)
##
## Call:
## lm(formula = grade.math.french ~ age + gender + avg.grade.ofriends +
## outdegree, data = att)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.53743 -0.52950 -0.00305 0.69813 1.71214
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.32109 0.72908 7.298 2.12e-12 ***
## age -0.07280 0.04216 -1.727 0.0851 .
## gendermale 0.17998 0.10086 1.784 0.0753 .
## avg.grade.ofriends 0.09566 0.04722 2.026 0.0436 *
## outdegree -0.02092 0.02937 -0.712 0.4768
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9023 on 336 degrees of freedom
## (27 observations deleted due to missingness)
## Multiple R-squared: 0.02827, Adjusted R-squared: 0.0167
## F-statistic: 2.444 on 4 and 336 DF, p-value: 0.04649
att$betweenness <- betweenness(mat)
fit5c <- lm(grade.math.french ~
age
+ gender
+ avg.grade.ofriends
+ betweenness,
data = att)
summary(fit5c)
##
## Call:
## lm(formula = grade.math.french ~ age + gender + avg.grade.ofriends +
## betweenness, data = att)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.5533 -0.5063 -0.0064 0.7179 1.7079
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.300e+00 7.303e-01 7.257 2.76e-12 ***
## age -7.367e-02 4.217e-02 -1.747 0.0816 .
## gendermale 1.871e-01 1.005e-01 1.862 0.0635 .
## avg.grade.ofriends 8.401e-02 4.565e-02 1.840 0.0666 .
## betweenness 5.164e-06 1.884e-05 0.274 0.7842
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9029 on 336 degrees of freedom
## (27 observations deleted due to missingness)
## Multiple R-squared: 0.02702, Adjusted R-squared: 0.01544
## F-statistic: 2.333 on 4 and 336 DF, p-value: 0.05558
None of them seem to correlate with students’ performance.
5.5.5 Attribute similarity
You can also check whether students who share some attribute also share performance scores. We could test this using the ‘sport.hours’ variable that measures the hours a student spends doing sports/exercises.
att$attrsim.sporthr <- attribsim(att$grade.math.french, att$sport.hours, match = FALSE)[,1]
fit6 <- lm(grade.math.french ~
age
+ gender
+ avg.grade.ofriends
+ attrsim.sporthr,
data = att)
Sadly the tnam
-package has an bug in this function for now. It will be solved soon (I already reported the bug).
5.5.6 A fuller model
Controlling for gender and age is probably an underspecification, which leads to biased estimates (i.e., coefficients and standard errors). For sake of teaching a sparse model was chosen, but:
always make sure you include all (potential) control variables! Otherwise your results will be biased.
You can check your regression by: - comparing BIC scores between models (lower BIC = better model) - comparing R^2 between models - checking prediction performance of your model (advanced topic)
If you neglect this, you will interpret a model that does not explain your data well. Your results will be biased - meaning that the coefficients may be wrong in size or your standard errors too small (or too large). If you are unsure whether or not you should include another variable do this:
Run two models: one with the variable and one without. Then check if the fit is better (lower BIC score) and if your effects (i.e., coefficients) change drastically. Check out this lovely YouTube-Video on variable selection in multiple regression models: https://www.youtube.com/watch?v=HP3RhjLhRjY.
fit7 <- lm(grade.math.french ~ age
+ gender
+ liking.school #yes/no-dummy
+ tv.hours #hours of tv/week
+ reading.hours #hours of reading/week
+ agree.earnalot #want to earn alot? scale: no = 1, yes = 4
+ true.parentsinterested #are your parents interested in your life? scale: no = 1, yes = 4
+ avg.grade.ofriends,
data = att)
summary(fit7)
##
## Call:
## lm(formula = grade.math.french ~ age + gender + liking.school +
## tv.hours + reading.hours + agree.earnalot + true.parentsinterested +
## avg.grade.ofriends, data = att)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.50201 -0.56628 -0.00226 0.73366 1.96678
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.5987676 0.8856799 5.192 3.68e-07 ***
## age -0.0612375 0.0429582 -1.426 0.15497
## gendermale 0.2154665 0.1014545 2.124 0.03445 *
## liking.schoolyes 0.3836715 0.1479345 2.594 0.00993 **
## tv.hours 0.0229388 0.0190269 1.206 0.22885
## reading.hours -0.0006643 0.0244909 -0.027 0.97838
## agree.earnalot -0.1303891 0.0654644 -1.992 0.04724 *
## true.parentsinterested 0.1351617 0.0815818 1.657 0.09854 .
## avg.grade.ofriends 0.0733366 0.0450928 1.626 0.10485
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.889 on 324 degrees of freedom
## (35 observations deleted due to missingness)
## Multiple R-squared: 0.06961, Adjusted R-squared: 0.04664
## F-statistic: 3.03 on 8 and 324 DF, p-value: 0.002682
If you want to compare models, use the texreg
-package:
htmlreg(list(fit1, fit4, fit7),
stars = c(.1, 0.05, 0.01, 0.001))
Model 1 | Model 2 | Model 3 | ||
---|---|---|---|---|
(Intercept) | 5.64*** | 5.59*** | 4.60*** | |
(0.71) | (0.72) | (0.89) | ||
age | -0.07· | -0.07· | -0.06 | |
(0.04) | (0.04) | (0.04) | ||
gendermale | 0.18· | 0.19· | 0.22* | |
(0.10) | (0.10) | (0.10) | ||
clique.grades | 0.00 | |||
(0.01) | ||||
liking.schoolyes | 0.38** | |||
(0.15) | ||||
tv.hours | 0.02 | |||
(0.02) | ||||
reading.hours | -0.00 | |||
(0.02) | ||||
agree.earnalot | -0.13* | |||
(0.07) | ||||
true.parentsinterested | 0.14· | |||
(0.08) | ||||
avg.grade.ofriends | 0.07 | |||
(0.05) | ||||
R2 | 0.02 | 0.02 | 0.07 | |
Adj. R2 | 0.01 | 0.01 | 0.05 | |
Num. obs. | 341 | 341 | 333 | |
RMSE | 0.91 | 0.91 | 0.89 | |
p < 0.001, p < 0.01, p < 0.05, ·p < 0.1 |
Cranmer, Skyler J, Philip Leifeld, Scott D McClurg, and Meredith Rolfe. 2017. “Navigating the Range of Statistical Tools for Inferential Network Analysis.” American Journal of Political Science 61 (1). Wiley Online Library: 237–51.