Team:Waterloo/NetworkModelling

Mathematical Modelling

Policy and Practices

Networks Modelling

Quantification

The quantitative analysis was performed to verify qualitative findings researched by the policy and practices teams regarding various social theories such as homophily, heterophily, social capital, and structural hole. The quantitative analysis was conducted using data from the iGEM network in order to see the extent to which the iGEM network show association to the researched social theories.

Hypotheses Tested

Collaboration Bias

Despite having the knowledge of collaboration being encouraged by iGEM as an organization, this test was done to quantitatively verify the assumption. The findings from this test adhered to the expectations of teams being biased to collaborate in order to qualify for the silver medal criteria.

H0 : When looking at the total sample size of each statistically significant group classified based on geographical location, the outcome from the number of teams that collaborated out of the total sample size DO NOT show a bias towards collaboration.

H1: When looking at the total sample size of each statistically significant group classified based on geographical location, the outcome from the number of teams that collaborated out of the total sample size show a significant bias towards collaboration.

α = 0.01

Team Size

The findings from the results of the first null hypothesis test outline the effect of team size in relation to medal wins and special awards. The effect of team size was tested for each statistically significant geographical network clusters.

H0: For a statistically significant sample size from each geographical region, if team size is an independent variable and the type of medal won is the dependent variable, then the independent variable has NO effect on the responding.

H1: For a statistically significant sample size from each geographical region, if team size is an independent variable and the type of medal won is the dependent variable, then the independent variable has significant effect on the responding.

α = 0.01

Geographic Association

To understand the effect of collaboration in relation to the flow of ideas within each geographical region, all the teams were first categorized in classes based on their continent. Geographical classification was done based on the assumption that the geographic positioning will be a clear factor affecting how the teams interact. This assumption was further tested using the null hypothesis.

H0: It was hypothesized that as the factor of collaboration is held constant for each statistically significant sample size from each region, there is NO difference between the mean medal values that is won in each population. Meaning, if a random sample of teams from North America, Latin America, Europe, and Asia are selected where the mean value for each team’s medal win is u1, u2, u3, u4 respectively, then u1 = u2 = u3 = u4.

H1: As the factor of collaboration is held constant for each statistically significant sample size for each region, there is a clear different between the mean medal values for each population. Meaning, if a random sample of teams from North America, Latin America, Europe, and Asia are selected where the mean value for each team’s medal win is u1, u2, u3, u4 respectively, then u1 =/= u2 =/= u3 = /= u4.

α = 0.01

The limitation of the data set was that it was only data from 2015. Any continuous trends or analysis based on time was limited because of this.

Methods

Defining Success

The qualitative data that was mined from the 2015 iGEM website included the type of medal won (if any), special awards won (if any), and evidence of collaboration (based on parameters laid out in the Codebook). This was then converted into quantitative data for the purpose of analysis. Each medal win was assigned a particular success value. For example, Gold received a value of 1, Silver medal 0.5, bronze 0.25, and the absence of a medal received 0. Winning a special award afforded an additional value of 1 to a team's "success" score. Teams that did not win special awards received no addition to this measure. These parameters was further used to help calculate the total awards won by any given team. If a team won a gold medal and a special award, the team was given a total award score of 2. The purpose of setting the scoring in this manner allowed quantification total award results. We fully acknowledge the arbitrary nature of this kind of value assignment, so to mitigate this, we tested the data set using several different value assignments (e.g. gold = 7, silver = 5, bronze = 2, special = 9), but our conclusions did not change.

Figure 1: Showing quantitative parameters assigned for testing. *these parameters were assigned arbitrarily, parameters can be changed to penalize and reward points differently for the purposes of analysis.

Binomial Test

In order to aid with the assumption that teams from a particular geographical region may be biased towards collaboration (when we ignore the degree of this bias), a null hypothesis test was conducted with this data. The null hypothesis was that donor team was not biased towards collaboration. All the geographical region containing statistically significant sample size rejected the null hypothesis and showed a bias towards collaboration. The value of collaboration was set up using binary parameters where in the existence of collaboration a value of 1 was awarded and absence was 0 for each team.The limitation was not having previous year's data to compare the degree of bias, so whether teams wanted to collaborate more or less, over the years. This hypothesis along with all the other tests conducted for our network analysis used p-value<0.01 to reject the null hypothesis. We used a threshold of 0.01 in our tests instead of 0.05 because we were dealing with several geographic regions, which raises the likelihood that one of these will falsely test positive. The lowered threshold of 0.01 provides greater validation of our results given the context.

Linear Regression

A linear regression was performed to test the null hypothesis: If team size is an independent variable and the medal win is the dependent variable, then the independent variable has no effect on the responding. Total award value where the total award value (medal + special award) was the dependent value. Using a 99% confidence interval, Latin America was the only network cluster from the iGEM network data where the null hypothesis test was failed to be rejected. This means, team size of the Latin American teams had no statistically significant effect on the total awards won. Using the social theories discussed in the literature review, possible suggestions were created to interpret based on the results.

ANOVA

An anova test was also conducted to test the medal variation between and within the different geographical region. The single-factor Anova test was conducted with a random sample of 47 teams from North America, Latin America, Europe, and Asia. The null-hypothesis that was being tested for this was if we hold the factor of collaboration constant, then the mean of the medal won from region to region should stay constant.

Betweenness Centrality

Betweenness Centrality is an indicator that helps interpret the centrality of a node within the respected network. Statistically, betweenness centrality is the shortest number of pathways from other vertices to all others that pass through that team node. The variables needed to express betweenness centrality include the total number of shortest paths from the node, the node itself, and the number of those paths that pass through. This concept can help identify the teams with the greatest influence involved in a network. By identifying the main nodes of influence, this information can help describe many other social theories including structural holes and how social capital is utilized to transfer information within the designated network.

Burt’s Constraint

Burt’s Constraint focuses on the weight given between two nodes that calculates how strong the mutual relation is between nodes. Burt's constraint is higher if the vantage point between two nodes are stronger, mutually related, and redundant contacts. Network constraint measures the degree to which a network is directly or indirectly concentrated on a specific node on that network. Burt’s constraint value helps understand the size, density, and hierarchy.

Size: In large networks, the constraint score is usually expected to be lower. This is because the proportion of one’s energy invested in one specific contact is expected to be lower if there are many contacts to invest energy in versus smaller size networks where choices are limited (Burt, 2002).

Density: This factor of the constraint measure focuses on the average strength of connection between nodes in a network. In binary network data, similar to what has been used in this paper related to collaboration, where the focus was on whether people connected or not, the strength of any two networks connecting to each other is proportional to the contact pairs that are connected(Burt, 2002).

Hierarchy: This is another measure within the constraint analysis that analyzes the network closure and the positioning of nodes (Burt, 2002). This measure focuses on whether other network clusters are formed around a specific network. This same analysis can be used within a network cluster to analyse whether a specific node has hierarchical power with other nodes concentrated around it.

The following R script was used to generate betweenness and constraint score:

V(iGEM)$constraint <- constraint(iGEM)

V(iGEM)$betweenness <- igraph::betweenness(iGEM)

results <- cbind(V(iGEM)$name, V(iGEM)$constraint, V(iGEM)$betweenness)

The constraint and betweenness value relationship is modeled graphically. A, B, and C are outlier teams.

Results

Linear Regression

A linear regression was performed to test the null hypothesis: If team size is an independent variable and the medal win is the dependent variable, then the independent variable has no effect on the responding. A p-value of 0.99 was used as well for this test. The null hypothesis was also extended to total award value where the total award value (medal + special award) was the dependent value. The result of this regression rejected the null-hypothesis for all regions holding statistically significant sample size except for Latin America. The deviation from the trend that is seen in the Latin America region is discussed in detail in our paper. The global data (which included all teams), also rejected the null hypothesis. However, it is important to note the adjusted-R squared value for this regression. The R-squared value for the global data when the dependent variable was total awards (medal + special award) was 0.028. This means that the independent variable of team size explains only 2.81% of the variation of our dependent variable. This number changed to 1.67% for just the medal winnings.

ANOVA

The null hypothesis related to the geographical differences NOT leaded to population success variation was rejected using the 99% confidence interval. There was clear variation in the average medal won from each geographical network cluster. The reason Africa was eliminated from this calculation was because the sample size of teams in Africa was 3. This was deemed a statistically insufficient team size to include in the Anova test. The test rejected the null hypothesis. Therefore, when we hold the factor of collaboration a constant, from region to region, there was a statistically significant difference between the mean medal value won in each population.

Figure 2: Anova Factor Test was conducted

Network Modelling

This visualization of the network analysis shows connections (edges) between teams (nodes), highlighting the team's geography (nodes colored according to the team's continent of origin).

References

Brandes, U. (2001). "A faster algorithm for betweenness centrality". Journal of Mathematical Sociology. 25: 163–177.

Burt, R. S. (2000, May). Structural holes versus network closure as social capital. Social Capital: Theory and Research. Retrieved October 16, 2016, from http://snap.stanford.edu/class/cs224w-readings/burt00capital.pdf

Burt, R. S. (2004). Structural holes and good ideas American Journal of Sociology, 110(2), 349-399.

Freeman, Linton (1977). "A set of measures of centrality based on betweenness”, Sociometry. 40: 35–41