An Interactive Survey Application for Validating Social Network Analysis Techniques

Social network analysis is extremely well supported by the R community and is routinely used for studying the relationships between people engaged in collaborative activities. While there has been rapid development of new approaches and metrics in this field, the challenging question of validity (how well insights derived from social networks agree with reality) is often difficult to address. We propose the use of several R packages to generate interactive surveys that are specifically well suited for validating social network analyses. Using our web-based survey application, we were able to validate the results of applying community-detection algorithms to infer the organizational structure of software developers contributing to open-source projects.

Mitchell Joblin (Siemens AG) , Wolfgang Mauerer (Siemens AG)

1 Introduction

Social network analysis (SNA) is an increasingly popular approach to study the relationships between individuals engaged in collaborative activities (Ahn et al. 2007; Mislove et al. 2007; Kumar et al. 2010), and numerous high quality R packages support the thriving SNA community (e.g., igraph, sna, graph, twitteR, Rfacebook, etc.; Csardi and T. Nepusz (2006; Butts 2014; Barbera and Piccirilli 2015; Gentry 2015; Gentleman et al.)). What is often not clear is the validity of SNA approaches that propose new metrics or apply existing metrics to a new source of data. In the literature, researchers have questioned and criticized studies using SNA because it is unclear if the results are reflective of reality (Donath and Boyd 2004; Wilson et al. 2009). We developed a web-based survey application for conducting interactive surveys that specifically addresses the unique needs of the SNA community and successfully deployed the application to study the collaborative relationship between software developers in open-source projects and to validate the usage of unsupervised machine learning algorithms to infer the developers’ organizational structure (Joblin et al. 2015).

In social network analysis, the relationships between individuals are formalized as a graph where nodes represent people and the edges between nodes represent a particularly interesting connection. For example, Twitter data can be used to construct a retweet network where an edge between individuals exists if one individual has retweeted another individual’s tweet. The particular heuristic used to establish an edge between individuals is chosen based on the desired concept to study. For example, a retweet may indicate endorsement of the message being tweeted. From this, one could conclude that users with many retweets of their content are regarded as an influential person within that local group of people. One of the primary challenges with this style of analysis is validating whether the assumptions about the relationship heuristic are correct. It may, for instance, not be clear whether a retweet always indicates a positive sentiment. Alternatively, retweets could also stem from controversial topics and may not be ubiquitously regarded as supportive of the original tweet’s message.

While SNA is not primarily concerned with constructing social networks, but rather to analyze the network’s properties, the network construction heuristic influences the validity of the subsequent analysis. In general, the goal of SNA is to identify interesting features of a social network that capture an abstract quality of social relationships. For example, finding important or highly influential actors in a network is one of the most well-researched areas of SNA where one considers the local or global network topology to identify individuals that are exceptionally well-connected to other actors. The notion of centrality, and many other network metrics, is a duality where one definition is a mathematical formalization based on the network topology and the other definition is an abstract social concept such as influence. The challenge we wish to address with our survey application is to validate the claim that the mathematical formalizations provided by the field of SNA are congruent with the abstract social concept we wish to identify.

Our survey application is designed to address the following three concerns that are fundamental to the validity of SNA:

All source code that implements our survey application is available at a supplementary site: .

2 Challenges

Developing and deploying a survey for SNA purposes involves a set of unique challenges and requirements that are not currently satisfied by existing survey templates and tools. We now introduce the set of requirements we identified and specifically addressed with our survey application.

Requirement 1: Ease of large scale deployment and collection of responses

Modern social networks can range in size from hundreds to millions of nodes and the survey delivery mechanism should be designed to handle deployment under large scale conditions. A web solution enables the survey participant to easily login to the web interface and submit their responses without the need to download or install any software to complete the survey. Any challenges experienced while participating in a survey create a barrier to completion and likely contribute to lower return-rates and quality responses. In additional to scalability and ease of use, a web solution also allows for aggregation of responses into a common database for later analysis.

Requirement 2: Interactivity

Performing a survey for SNA purposes will often involve the need to display a labeled graph to the survey participants. Research in graph layout and visualization is continually advancing; however, the optimal visualization and layout parameters are dependent on network properties in a non-trivial way. The network size, edge density, graph type (e.g., directed, undirected, weighted, unweighted, one-mode, and two-mode) all influence the optimal visualization. Readability of the graph is necessary for quality responses. To ensure graph details were not obscured by problems such as overlapping nodes or edges, the survey participant should be able to influence a set of visual parameters so that all necessary details of the graph are observable. The adjustable parameters also allow the visually impaired to participate more effectively.

Requirement 3: Dynamic survey content generation

We determined that certain elements of the survey needed to be generated dynamically so that each survey participant would be shown information that was relevant to their particular position in the network. We identified each survey participant through a login process and then computed relevant data such as the subgraph community they were found in and the set of people which we expected to be influential to them. We found this to be a particularly powerful and interesting aspect of our survey because the responses often provided insights about the network that would not have been obvious if we had not shown the relevant network data and instead only asked general questions.

Requirement 4: Integration with existing R infrastructure

One of our primary concerns with developing the survey was the expenditure of effort to prepare our existing SNA analysis pipeline for use in the survey instrument. A substantial amount of support for SNA already exists in R (e.g., igraph, sna, graph etc.), therefore it is highly desirable to seamlessly and effortlessly integrate existing R infrastructure into the survey application. By taking advantage of the Shiny R web application framework, we could avoid a substantial amount of effort to adapt existing R infrastructure to another language or platform for the survey deployment.

Requirement 5: Visually appealing and professional aesthetic

In a preliminary analysis of options for survey platforms, we realized that many of the existing tools did not support a visually appealing or professional aesthetic. We felt that an unprofessional appearance would compromise the seriousness and credibility of the organization hosting the survey and deter survey participation. Perhaps potential survey candidates would perceive the survey poorly and think that the organization would mishandle the collected data for unethical reasons or via poor execution such that the results would be useless and answering the questions would be futile.

3 Alternative survey tools

A number of survey tools are available online such as SurveyMonkey ( and LimeSurvey (, but we found that these tools were not capable of satisfying the requirements for validating SNA techniques. In a canonical survey, a number of predetermined questions are presented to the survey participants and the responses are predetermined categories or free text fields. In the case of predominantly static and predetermined survey content, the features offered by the above tools are more than adequate and customizable. The inadequacies of these tools stem from the lack of features for interfacing with R infrastructure and supporting interactive survey content. Both SurveyMonkey and LimeSurvey have convenient import features that provide a mechanism to display precomputed survey content. Prior to developing our own application, we considered precomputing the survey content for all the possible survey participants and then using the import feature. The problem with this approach was that we did not have a-priori knowledge of the required content and computing all possible variations of the survey content would be incredibly wasteful. When conducting a survey, one typically expects roughly a 10% response rate so computing the necessary data for all potential participants would be roughly 90% waste. This consideration is especially important for researchers working with big data, where there may be potentially millions of survey participants. Furthermore, using this approach would not allow the survey participant to configure any visualization parameters. We found the reactive programming model provided by shiny (Chang et al. 2015) to be far more powerful for creating interactive surveys compared to those provided by the alternative survey tools. The added benefit of using shiny is that any visualization generated by an R script can be easily converted into a dynamic survey element with just a few lines of code. In contrast, managing a set of precomputed visualization requires potentially vast quantities of storage space and a schema for uniquely identifying the images to be displayed correctly in the survey.

4 Shiny web application framework

Shiny is a web application framework for R that allows one to easily transform their existing R code into an interactive web application. By using Shiny, we were able to quickly implement a web-based survey instrument without the need to significantly alter our existing R infrastructure for social network analysis. To build a Shiny web application two main components need to be implemented, a server.R file which constructs the R objects to be displayed by the application and a ui.R file to control layout and appearance. Alternatively, one can choose to implement an interface using HTML and CSS to achieve greater flexibility and customization. An example survey question taken from our application is provided in Figure 1. In the following subsections, we introduce the basic elements for implementing the example question including the creation of interactive visualizations using the shiny and igraph packages.

graphic without alt text
Figure 1: Example survey question.

Example server R script

In our particular implementation we stored the user data in a relational database, as will likely be common in many applications. Below we illustrate the implementation to retrieve a specific person’s ego network in the form of an edge list data frame from a MySQL database. The igraph package for network analysis is then used to construct the graph object from the edge list data frame and then finally plot it. Using this basic template, one can insert their own network analysis algorithms for a specific purpose. The reactive mechanism for retrieving the interactive user input from the UI is also illustrated for the vertex size and vertex label visualization parameters. In the next section, we will see how the UI is implemented for these particular visualization parameters.

Shiny uses a very powerful reactive programming paradigm to couple the client and server elements to support interactivity. Using this model, reactive values represent values that can change over time, and reactive expressions represent operations that depend on the use of reactive values. The reactive expressions track the state of reactive values so whenever an update occurs, the dependent reactive expressions are re-executed. The reactive programming concept also supports the important separation of computationally intensive processes from the interactive elements to prevent lag in the user interface. We provide an example server script in Figure 2 to illustrate a basic example for generating interactive survey content using the reactive programming model.

After first executing a connection to the MySQL database, a reactive expression ( is defined to encapsulate the computationally intensive graph processing algorithms. The reactive expression first retrieves an ID for a specific edge list stored in the database. In example question shown in Figure 1, the ID corresponds to a specific user’s community network and in the login phase description we discuss how to identify survey participants using a login process. Next, the edge list corresponding to the ID is retrieved from the database and then any computationally intensive processing is performed. Alternatively, if a database is unavailable, then an archive file or any type of storage format can be loaded into R within this reactive expression.

In the next expression, a reactive endpoint is created. Inside the expression, the reactive values that represent the vertex size (input$vertex.size) and label size (input$label.size) will cause the reactive endpoint output$graph to be evaluated every time an update occurs to one of the visualization parameters. A critical aspect of the implementation is the separation of the computationally intensive SNA algorithms from the visualization parameters. Without the use of reactive conductors (e.g.,, the computationally intensive code would be re-evaluated for every update to the visualization parameters and would result in severe lag in the user interface. In the next section, we demonstrate how the binding between the elements in the server script and the user interface is established. More information about the reactive programming model used by shiny server is available in the Shiny Package Documentation.


shinyServer(function(input, output, session, clientData) {
  # Create MySQL connection object
  con <- dbConnect(MySQL(),
                   user = 'USERNAME',
                   password = 'PASSWORD',
                   host = 'HOST',
                   dbname = 'DBNAME')
  # Query database for an edge list and perform SNA 
  # and return a reactive conductor <- reactive({
                  # Get unique graph ID from database

                  # Query MySQL database for a specific network
                  edge.list <- query.graph.edges(con,

                  # Insert any computationally intensive code for processing
                  # the graph here to avoid being recomputed
                  # for every visualization update})

  # Generate reactive endpoint
  output$graph <- renderplot({ 
                    edge.df <-

                    # Get input from UI for graph label and node size 
                    # from reactive sources
                    vertex.size <- input$vertex.size
                    label.size  <- input$label.size
                    # Create igraph graph object and plot
                    g <-
                    plot(g, vertex.size = vertex.size,
                         vertex.label.cex = label.size)})
Figure 2: Server R script for example survey question.

Example HTML UI

The user interface component of the shiny web application controls the appearance and layout of the survey including all survey questions and response fields. We chose to implement the UI in HTML and CSS using the open-source framework Bootstrap, alternatively one could also implement the UI in R using shiny. The advantage to developing the UI in HTML is greater customizability of the look and feel, but the features provided by shiny are quite sufficient for most purposes and doesn’t require HTML knowledge. With Bootstrap, we were able to achieve a professional aesthetic and maximum flexibility using only the basic features offered by the framework. An example survey question is shown in Figure 3 demonstrating the UI implementation in HTML to display the survey question text, an igraph network object with user configurable visualization parameters, and a multi-category response section. This example illustrates all the basic elements of a survey question and all the questions in our example application follow a similar format. Beginning at the top of Figure 3, the survey question is specified inside a Bootstrap “well” element. Next, the visualization parameters for the network are specified as sliders as is shown in Figure 1. Moving further downward, the binding between the graph object provided by the server.R script and the UI is made. Lastly, the response fields “agree” and “disagree” are specified as HTML form inputs. This basic example question can easily be extended to suit the needs of a wide variety of survey questions. For example, one can easily rewrite the question, change the visualization parameters, or introduce different response categories (e.g., five level Likert item). From a technical standpoint, Figure 3 clearly demonstrates how the input elements for dynamically altering the graph visualization are implemented and how the graph plot generated by the server.R script is integrated into the UI.

<h3>Question 1</h3>
<--Survey question text-->
<div class="well">
  <h4>Does the following network accurately represent collaborative relationships?</h4>

<--Display network visualization-->
<div class="row-fluid">
  <div class="span2">
    <!--Insert all user configurable graph visualization parameters here-->
    <h5>Visual Adjustments</h5>
    <div class="well">
      <!-- Slider to change vertex size parameter, input id
               matches variable name in server.R-->
      <label class="control-label" for="vertex.size">Vertex Size:</label>
      <input class="jslider"
      data-format="#,##0.#####" data-from="1" data-locale="us" data-round="false"
      data-skin="plastic" data-smooth="false" data-step="1" data-to="10" id="vertex.size"
      name="vertex.size" type="slider" value="5">

    <div class="span10">
      <div class="well">
        <!--Insert graph plot, id matches output variable identifier in server.R-->
        <div class="shiny-plot-output" id="graph" style=
        "width: 100%; height: 800px; margin-left:0px; margin-right:0px;
        margin-bottom:0px; margin-top:0px">

    <!--Specify response fields-->
      <div class="well">
          <label>Select one:</label>
          <table class="table table-condensed table-bordered" style=
                <td><label class="radio"><input name="q1a" type="radio" value="agree">
                <td><label class="radio"><input name="q1a" type="radio" value="disagree">
Figure 3: HTML UI for example survey question.

The binding between the elements of the server.R script and the UI are achieved through the variable identifiers. One can see that the HTML id tags match the corresponding variable identifiers in the server.R script. In this case, the vertex.size visualization parameter is displayed as a slider. For categorical or other discrete parameters, drop-down menus can be used instead of a slider when needed. The appearance of the example question rendered in a browser is shown in Figure 1 and includes the graph, the basic visualization adjustments, the categorical response input, and an additional text input for comments.

The above serves as a starting point to construct more elaborate applications and easily integrate specific SNA results. For example, one could extend the above code to show only the vertex induced subgraph of nodes with a particularly high centrality or show only the relationships that exist between actors during specific temporal periods using sliders to select the time range of an evolving network.

5 Survey execution process

We now introduce the execution of the survey application, the data flow, and the general architectural elements used to accomplish the goals of each phase. The survey is broken up into three main phases discussed in detail below, namely the login phase, the survey completion phase, and the response collection phase. The login phase is used to identify the survey participant so that the appropriate visual elements can be generated for this specific user. Next, the shiny application server queries the database storing user data to generate the appropriate survey content for the specific user. Finally, the survey responses are collected and subsequently stored in a relational database so that further analysis can be easily performed. Figure 4 illustrates the sequence of events and the data exchanged between the main architectural elements. An example survey can be found at the following site: