Revisiting Historical Bar Graphics on Epidemics in the Era of R ggplot2

This study is motivated by an article published in a local history magazine on “Pandemics in the History”. That article was also motivated by a government report involving several statistical graphics which were drawn by hand in 1938 and used to summarize official statistics on epidemics occurred between the years 1923 and 1937. Due to the aesthetic information design available on these historical graphs, in this study, we would like to investigate how graphical elements of the graphs such as titles, axis lines, axis tick marks, tick mark labels, colors, and data values are presented on these graphics and how to reproduce these historical graphics via well-known data visualization package ggplot2 in our era.

Sami Aldag (Department of Mathematics Engineering, Istanbul Technical University,) , Dogukan Topcuoglu (Department of Mathematics Engineering, Istanbul Technical University,) , Gul Inan (Department of Mathematics, Istanbul Technical University,)
2022-06-21

1 Introduction

In August 2018, a local history journal named “Social History” published an issue on “Pandemics in the History” which left a deep effect on the world, created new public policies, and, in turn, reshaped state-society relations over the globe (Toplumsal Tarih 2018). The issue involves several articles specifically on the pandemics such as plague, malaria, cholera, diphtheria, trachoma, syphilis, and tuberculosis, where the content of the articles were accompanied by rich historical photographs and visualizations.

The article entitled “Fight against syphilis that forgot to embrace in the era of early Republic” by (Malkoc 2018) in this issue specifically took our attention since this article involves several aesthetically attractive statistical column bar graphics, which were assumed to be drawn by hand with the help of a ruler, with a citation to a government report published in \(1938\). A deep investigation of this 620-page government report, which is also available online at https://acikerisim.tbmm.gov.tr/handle/11543/553, revealed that it has a section where the Ministry of Health reported official statistics related to all health policy actions taken and health services provided to improve the public health between the years \(1923\) and \(1937\). While most of the official statistics were summarized in tabular form, around forty different statistical bar graphics were also used to visually summarize the official statistics related to various epidemic diseases such as smallpox, trachoma, malaria, and syphilis which occurred in the country between the years \(1923\) and \(1937\). While doing so, it was obvious that the government officials put a special emphasis on the information design on the graphics at that time. A further investigation through discussions with several academics studying on the history of graphic design also revealed that using aesthetically designed statistical graphics were already common in the country in late 1800’s parallel to the globe (Durmaz 2017). Here we note that well-known early examples of visualization of official statistics over the globe include Statistical Atlas of the United States in late 1800’s, Album de Statistique Graphique, and Graphical Statistical Atlas of Switzerland 1897–2017 where selected illustrative examples are available in Friendly (2008).

Furthermore, statistical graphics were also used by the Goverment officials as an effective communication tool to inform the society who had low literacy skills during that period (Sengul 2017). This argument is still true during the Covid-19 pandemic. With the help of technological advances in data visualization software in our era, government officials, authorities, and media intensively use (mostly interactive) data visualization tools to release pandemic related statistics to the public in a very short time to keep the society informed (e.g., please visit GitHub account of the Civil Protection Department of the Italian government given at (Consiglio dei Ministri 2020) and the GIS based interactive dahsboard of Coronavirus Resource Center at (Johns Hopkins University 2020)). Hence, as in the past, during the Covid-19 pandemic over the globe, data visualization continues to be the most effective way of sharing information and informing society (McCoy 2020).

On the other hand, while statisticians and graphic designers may have different priorities on what makes a good graphic (Gelman and Unwin 2013; Quito and Kopf 2020), reading graphics, understanding the information design behind them and interpreting them require practice of data literacy for the society (e.g., use of semi logarithmic graphs for visualizing rate of change of Covid-19 infections has been a long discussion (Garthwaite 2020)). In this sense, motivated by i) (VanderPlas et al. 2019) who revisited, reinterpreted, and reproduced some novel charts from 1870 Statistical Atlas with moden technology, ii) the exhibition entitled "Speak to the Eyes", curated by (Durmaz 2017) which revisited and turned some historical graphics on justice statistics in 1920’s into motion graphics, and iii) (Matthew 2019) who revisited and reproduced W.E.B. Du Bois’in visualizations on social and economic life of African-Americans in 1900’s via R, in this study, we would like to revisit and reproduce the historical column bar graphics used to visualize official statistics on epidemics occurred in our country between the years \(1923\) and \(1937\) via R. For that reason, the aim of this study is to investigate i) how graphical elements of the historical column bar graphs such as titles, axis lines, axis tick marks, tick mark labels, bar colors, and data values are presented on these graphics and ii) how to reproduce these 1938-made and hand-drawn graphics via well-known data visualization software ggplot2 (Wickham et al. 2020) in our era.

The subsequent sections of the paper are organized as follows: We give general information on the graphical elements of column bar graphics and we talk about the column bar graphics used in this study. Then we also give redesigned versions some of the selected historical graphics. Finally, we finish with some concluding remarks.

2 An overview on graphical elements of a column bar

The bar chart was first invented by William Playfair to visualize the imports and exports of Scotland between seventeen countries in year 1871 and was first published in his book entitled "Commercial and Political Atlas" in 1876 (please visit Figure C in (Beniger and Robyn 1978)). In a general sense, column bar graphics are a statistical visualization technique used to present quantitative information through a series of vertical rectangles. They are mostly used to display and compare data values of multiple groups over time (Harris 2000). Column bars mostly have a quantitative linear scale on the vertical axis. The height of each column in a bar graph is proportional to the numerical value it represents so that the viewer make a visual comparison between the columns. When the vertical axis is not available in the graph, the actual data value which each column represents can be either placed inside the column or at the top of the column. Alignment of the data value can be done horizontally or vertically, depending on the space available on the graph.

The scale on the horizontal axis is generally categorical or sequential (e.g. time series) and tick marks may or may not be used on the horizontal axis. The width of columns and the spacing between the columns are generally kept uniform over columns in a graph. The data series belonging to different groups are generally differentiated with each other by assigning different colors or patterns to the groups. The differentiation in colors and/or patterns are also reflected into the legend keys to help the viewer to identify the quantitative information displayed in the graph. Furthermore, the information on the legend keys is ordered as it appears on the graph. The legends can be placed anywhere on the graph, but the closer to the information they represent, the more convenient for the viewer to decode the information on the graph. Grid lines at the background are not generally preferred since rectangular bars are very dominant visual objects. The background color may contrast the color of the columns to increase communication between the graph and the viewer. We illustrate these graphical elements in Figure 1.

Figure 1: An anatomy of a column bar graph.

3 Column bar graphics used in this study

Due to World War I (1914-1918) and then Independence War (1919-1922), the country, which was founded in 1923, had to simultaneously deal with many infectious diseases such as smallpox, malaria, plague, syphilis, trachoma, tuberculosis, leprosy, and typhus. Due to the increasing number of infectious diseases and infected people, the government had to develop new public health policies and offer health care services through launching new hospitals, training health care workers (including medical doctors, nurses and so on), and producing disease diagnostic kits, drugs, serum, and vaccines. In spite of many impossibilities, the government had achieved great success in prevention of infectious diseases during the period of 1923-1937. In 1938, all the efforts, especially the ones on the workload of hospitals and then on vaccine administration in the country, were summarized officially and these official statistics were visualized through statistical column bar graphics along with the tabular raw data in the government report. We should note that the government report does not provide any additional information or explanation related to these graphics.

In this study, among these historical column bar graphics, we investigated and reproduced nine of them. We provide the original graphics alongside the reproduced graphics as well. In this sense, we categorize them into five main parts with respect to the number of data series available as well as grouping structure of the bars (e.g., overlapped, side-by-side, and paired bar graphs). We also kindly invite readers to look at the R codes available as a Supplementary material while investigating the graphics.

Bar graph with one data series

In the bar graphs with one data series, bars are used to compare a single numerical variable per item or category. Figure 2 gives the amount of smallpox vaccine administered in various regions of the country between the period \(1925\) and \(1937\). Smallpox is a deadly infectious disease accompanied by lesions filled with thick liquid appearing on the face, mouth, nose, and body of a person. Within the early days of exposure, it had been shown that the vaccination can prevent or lessen the severity of the disease. For that reason, the vaccination was mandatory for new borns, at schools, and at some workplaces. Note that these vaccines were distributed for free to prevent the disease.

In Figure 2, we can see that the background color of the figure is white. There is no vertical axis and related information on the vertical axis (e.g., axis line, axis title, axis tick marks, and tick mark labels). We can get the frequencies of each column bar through the data values placed inside the columns. Consequently, the column bar heights are directly proportional to the data values they represent. Since the height of the columns are taller and take space in the figure plotting area, the data values are placed vertically inside the columns. The horizontal axis refers to the time interval with linear increments without having an axis title. Due to the white background color of the figure, the column bars are filled in with black color whereas the data values are colored in white for contrast. Due to a large number of column bars and lack of space, bar widths and the spacing between columns are kept short and the labels of the horizontal axis tick marks are displayed vertically.

Since the heights of the columns of the graph are directly proportional to the data value they represent, the geom_col() layer right after the main ggplot() call in ggplot2 is used to produce Figure 3. The data values are placed onto the graphic via an annotate() layer. The white background is obtained via theme_classic() layer. The structure of the graph is mostly obtained through modifying the components of theme() layer such as axis.line, axis.title, axis.ticks, and axis.text in ggplot2, in addition to geom_col() layer. The main figure title consists of four lines. However, the first three line and the last line of the title have different font types, sizes, and faces (i.e., italic and unitalic texts). For that reason, several annotate() layers are further used to run the full title rather than ggtitle() or labs() layer which assumes a uniform text structure over the multiple lines. Finally, we can see that the number of smallpox vaccines administered increased over the years.

Figure 2: Historical figure on "Smallpox vaccine administered in various regions of the country, between the period 1925-1937" retrieved from https://acikerisim.tbmm.gov.tr/handle/11543/553.
Figure 3: Reproduced figure on "Smallpox vaccine administered in various regions of the country, between the period 1925-1937".

Overlapped bar graphs with two data series

Figure 4: Historical figure on "The service of hospitals and dispensaries within the Department of Control of Trachoma, 1925-1937" retrieved from https://acikerisim.tbmm.gov.tr/handle/11543/553 (\blacksquare The number of inpatient treatments \square The number of surgeries performed).
Figure 5: Reproduced figure on "The service of hospitals and dispensaries within the Department of Control of Trachoma, 1925-1937" (\blacksquare The number of inpatient treatments \square The number of surgeries performed).

Figure 4 presents the service of hospitals and dispensaries within the Department of Control of Trachoma between the years \(1925\) and \(1937\) with respect to the number of inpatient treatments performed (in black) and the number of surgeries performed for treating trachoma (in white).

Trachoma is an infectious eye disease caused by a bacteria and is transmitted among humans through shared use of items used for cleaning face. If it is not treated at the earlier stages, it may lead to damages in eye cornea or even to blindness. In the early stages of trachoma, antibiotics may be effective to eliminate the infection, whereas surgery may be required at the later stages.

As in Figure 2, the background color of the Figure 4 is white and there is no vertical axis and any information related to the vertical axis (e.g., axis line, axis title, axis tick marks, and tick mark labels). The columns of both groups are overlapped \(100\%\) and the column bar heights are directly proportional to the data values they represent. The data set for the number of inpatient treatments is always shorter than the data set for the number of surgeries performed for treating trachoma over the years \(1925\) and \(1937\). Thus, the columns for inpatient treatments are positioned in front of the columns for the number of surgeries performed. However, the disparity between the heights of both groups is manipulated through assigning a strong color, black, to the number of inpatient treatments, and a recessive color, white, to the number of outpatient treatments, which is a common strategy in graphic design (White 1984). This emphasis is also reflected in the legend keys such that the legend keys are ordered according to color, not alphabetically.

The data values for the surgery group between the years \(1925\) and \(1931\) are placed vertically at the top of the columns. Those between the years \(1932\) and \(1937\) are placed inside the columns vertically due to lack of space in the plotting region, whereas the data values for the number of inpatient treatments are always placed vertically inside the column. All the data values for each group are in black since the background color is white. Due to the reasonable number of column bars in the plotting area, the width of the column bars and the inter-bar spacing between them are now increased.

Unlike Figure 2, there are no labels for the horizontal axis tick marks now. However, the third line of main graph title gives the clue that horizontal axis starts from \(1925\) and goes to \(1937\). As the reviewer pointed out, we think that horizontal tick mark labels here were unintentionally forgotten since this is the only bar graph with missing horizontal tick mark labels in the government report. However, this may result in a cognitive effort for the viewer if the viewer would like to know the exact information for the number of inpatient treatments and/or the number of surgeries performed throughout the years, especially for the years in the middle of 1925 and 1937. In that case, the viewer has to count the number of bars to get precise information. On the other hand, if the interest is on the overall trend of the number of inpatient treatments and/or the number of surgeries performed over the years, then comparing the heights of the bars visually will eliminate this problem. If there is not enough space to place all the tick marks on the horizontal axis, then two design choices can be followed here: 1) placing all the labels with some vertical shift such as 90 degree or 45 degree alignment, or 2) starting labeling at the year 1925 and labeling the years with one year apart.

In Figure 5, the \(100\%\) overlapping structure of column bars is obtained via setting argument position = "identity" in geom_col() layer. Note that the look of Figure 5 requires arrangement of the levels of grouping factor in the data with order of inpatient treatments and surgeries performed, respectively. This grouping variable is also mapped into fill and alpha arguments of the aesthetics of the main ggplot() call since the fill-in colors of the bars and transparency level of the bars should be matched with the levels of this grouping variable. In addition to modifying several components of theme() layer, assigning “black” color to the inpatient treatments and “white” color to the surgeries performed via scale_fill_manual() layer, and then assigning low level of transparency “1” to the black color of inpatient treatments and high level of transparency “0” to the white color of surgeries performed via scale_alpha_manual() layer would yield the final look of the figure. Hence, the order of elements of the vector of colors in scale_fill_manual() layer and the order of elements of the vector of transparency in scale_alpha_manual() layer are matched with the order of the levels of the grouping variable. Lastly, if transparency were not added to the plot, the color of the second level of the grouping variable will be displayed only due to the overlapping structure of the column bars.

The white line segment in the first column of Figure 5 is integrated via an annotate() layer along with rect argument. On the other hand, three-lined main graph title is run with labs() and annotate() layers due the italicized font structure of the middle line compared to the unitalicized font structure of the first and the last lines. Lastly, we can say that both the number of inpatient treatments and the number of surgeries performed increased considerably over the years.

Figure 6 shows the service of the Zonguldak Government Hospital between the years \(1924\) and \(1937\) with respect to the number of inpatient treatments (in black) and the number of outpatient treatments (in white). The data value for the number of outpatient treatment is not available in \(1924\) and is coded as NA in the data. While the columns for the number of inpatient treatments are taller than that of outpatient treatment over the period \(1925\) and \(1932\), the columns for the number of outpatient treatments are taller than that of inpatient treatments over the period \(1933\) and \(1937\). To increase the dominance of the number of inpatient treatments over the number of outpatient treatments, the former group is colored in black.

Figure 6: Historical figure on "The service of the Zonguldak Government Hospital, 1924-1937" retrieved from https://acikerisim.tbmm.gov.tr/handle/11543/553 (\blacksquare The number of inpatient treatments \square The number of outpatient treatments).