Knowledge classification is an extensive and practical approach in domain knowledge management. Automatically extracting and organizing knowledge from unstructured textual data is desirable and appealing in various circumstances. In this paper, the tidy framework for automatic knowledge classification supported by the akc package is introduced. With powerful support from the R ecosystem, the akc framework can handle multiple procedures in data science workflow, including text cleaning, keyword extraction, synonyms consolidation and data presentation. While focusing on bibliometric analysis, the akc package is extensible to be used in other contexts. This paper introduces the framework and its features in detail. Specific examples are given to guide the potential users and developers to participate in open science of text mining.
Co-word analysis has long been used for knowledge discovery, especially in library and information science (Callon et al. 1986). Based on co-occurrence relationships between words or phrases, this method could provide quantitative evidence of information linkages, mapping the association and evolution of knowledge over time. In conjunction with social network analysis (SNA), co-word analysis could be escalated and yield more informative results, such as topic popularity (Huang and Zhao 2019) and knowledge grouping (Khasseh et al. 2017). Meanwhile, in the area of network science, many community detection algorithms have been proposed to unveil the topological structure of the network (Fortunato 2010; Javed et al. 2018). These methods have then been incorporated into the co-word analysis, assisting to group components in the co-word network. Currently, the co-word analysis based on community detection is flourishing across various fields, including information science, social science and medical science (Hu et al. 2013; Hu and Zhang 2015; Leung et al. 2017; Baziyad et al. 2019).
For implementation, interactive software applications, such as CiteSpace (Chen 2006) and VOSviewer (Van Eck and Waltman 2010), have provided freely available toolkits for automatic co-word analysis, making this technique even more popular. Interactive software applications are generally friendlier to users, but they might not be flexible enough for the whole data science workflow. In addition, the manual adjustments could be variant, bringing additional risks to the research reproducibility. In this paper, we have designed a flexible framework for automatic knowledge classification, and presented an open software package akc supported by R ecosystem for implementation. Based on community detection in co-occurrence network, the package could conduct unsupervised classification on the knowledge represented by extracted keywords. Moreover, the framework could handle tasks such as data cleaning and keyword merging in the upstream of data science workflow, whereas in the downstream it provides both summarized table and visualized figure of knowledge grouping. While the package was first designed for academic knowledge classification in bibliometric analysis, the framework is general to benefit a broader audience interested in text mining, network science and knowledge discovery.
Classification could be identified as a meaningful clustering of experience, turning information into structured knowledge (Kwasnik 1999). In bibliometric research, this method has been frequently used to group domain knowledge represented by author keywords, usually listed as a part of co-word analysis, keyword analysis or knowledge mapping (He 1999; Hu et al. 2013; Leung et al. 2017; Li et al. 2017; Wang and Chai 2018). While all named as (unsupervised) classification or clustering, the algorithm behind could vary widely. For instance, some researches have utilized hierarchical clustering to group keywords into different themes (Hu and Zhang 2015; Khasseh et al. 2017), whereas the studies applying VOSviewer have adopted a weighted variant of modularity-based clustering with a resolution parameter to identify smaller clusters (Van Eck and Waltman 2010). In the framework of akc, we have utilized the modularity-based clustering method known as community detection in network science (Newman 2004; Murata 2010). These functions are supported by the igraph package (Csardi et al. 2006). Main detection algorithms implemented in akc include Edge betweenness (Girvan and Newman 2002), Fastgreedy (Clauset et al. 2004), Infomap (Rosvall and Bergstrom 2007; Rosvall et al. 2009), Label propagation (Raghavan et al. 2007), Leading eigenvector (Newman 2006), Multilevel (Blondel et al. 2008), Spinglass (Reichardt and Bornholdt 2006) and Walktrap (Pons and Latapy 2005). The details of these algorithms and their comparisons have been discussed in the previous studies (Sousa and Zhao 2014; Yang et al. 2016; Garg and Rani 2017; Amrahov and Tugrul 2018).
In practical application, the classification result is susceptible to data variation. The upstream procedures, such as information retrieval, data cleaning and word sense disambiguation, play vital roles in automatic knowledge classification. For bibliometric analysis, the author keyword field provides a valuable source of scientific knowledge. It is a good representation of domain knowledge and could be used directly for analysis. In addition, such collections of keywords from papers published in specific fields could provide a professional dictionary for information retrieval, such as keyword extraction from raw text in the title, abstract and full text of literature. In addition to automatic knowledge classification based on community detection in keyword co-occurrence network, the akc framework also provides utilities for keyword-based knowledge retrieval, text cleaning, synonyms merging and data visualization in data science workflow. These tasks might have different requirements in specific backgrounds. Currently, akc concentrates on keyword-based bibliometric analysis of scientific literature. Nonetheless, the R ecosystem is versatile, and the popular tidy data framework is flexible enough to extend to various data science tasks from other different fields (Wickham et al. 2014; Wickham and Grolemund 2016; Silge and Robinson 2017), which benefits both end-users and software developers. In addition, when users have more specific needs in their tasks, they could easily seek other powerful facilities from the R community. For instance, akc provides functions to extract keywords using an n-grams model (utilizing facilities provided by tidytext), but skip-gram modelling is not supported currently. This functionality, on the other hand, could be provided in tokenizers (Mullen et al. 2018) or quanteda (Benoit et al. 2018) package in R. A greater picture of natural language processing (NLP) in R could be found in the CRAN Task View: Natural Language Processing.
An overview of the framework is given in Figure 1. Note that the name akc refers to the overall framework for automatic keyword classification as well as the released R package in this paper. The whole workflow can be divided into four procedures: (1) Keyword extraction (optional); (2) Keyword preprocessing; (3) Network construction and clustering; (4) Results presentation.
In bibliometric meta-data entries, the textual information of title, abstract and keyword are usually provided for each paper. If the keywords are used directly, there is no need to do information retrieval. Then we could directly skip this procedure and start from keyword preprocessing. However, sometimes the keyword field is missing, then we would need to extract the keywords from raw text in the title, abstract or full text with an external dictionary. At other times, one might want to get more keywords and their co-occurrence relationships from each entry. In such cases, the keyword field could serve as an internal dictionary for information retrieval in the provided raw text.
Figure 2 has displayed an example of keyword extraction procedure. First, the raw text would be split into sub-sentences (clauses), which suppresses the generation of cross-clause n-grams. Then the sub-sentences would be tokenized into n-grams. The n
could be specified by the users, inspecting the average number of words in keyword phrases might help decide the maximum number of n
. Finally, a filter is made. Only tokens that have emerged in the user-defined dictionary are retained for further analysis. The whole keyword extraction procedure could be implemented automatically with keyword_extract
function in akc.
In practice, the textualized contents are seldom clean enough to implement analysis directly. Therefore, the upstream data cleaning process is inevitable. In keyword preprocessing procedure of akc framework, the cleaning part would take care of some details in the preprocess, such as converting the letters to lower case and removing parentheses and contents inside (optional). For merging part, akc help merge the synonymous phrases according to their lemmas or stems. While using lemmatization and stemming might get abnormal knowledge tokens, here in akc we have designed a conversion rule to tackle this problem. We first get the lemmatized or stemmed form of keywords, then group them by their lemma or stem, and use the most frequent keyword in the group to represent the original keyword. This step could be realized by keyword_merge
function in akc package. An example could be found in Table 1. After keyword merging, there might still be too many keywords included in the analysis, which poses a great burden for computation in the subsequent procedures. Therefore, a filter should be carried out here, it could exclude the infrequent terms, or extract top TF-IDF terms, or use any criteria that meets the need. Last, a manual validation should be carried out to ensure the final data quality.
ID | Original form | Lemmatized form | Merged form |
---|---|---|---|
1 | higher education | high education | higher education |
2 | higher education | high education | higher education |
3 | high educations | high education | higher education |
4 | higher educations | high education | higher education |
5 | high education | high education | higher education |
6 | higher education | high education | higher education |
Based on keyword co-occurrence relationship, the keyword pairs would form an edge list for construction of an undirected network. Then the facilities provided by the igraph package would automatically group the nodes (representing the keywords). This procedure could be achieved by using keyword_group
function in akc.
Currently, there are two kinds of output presented by akc. One is a summarized result, namely a table with group number and keyword collections (attached with frequency). Another is network visualization, which has two modes. The local mode provides a keyword co-occurrence network by group (use facets in ggplot2), whereas the global mode displays the whole network structure. Note that one might include a huge number of keywords and make a vast network, but for presentation the users could choose how many keywords from each group to be displayed. More details could be found in the following sections.
The akc framework could never be built without the powerful support provided by R community. The akc package was developed under R environment, and main packages imported to akc framework include data.table (Dowle and Srinivasan 2021) for high-performance computing, dplyr (Wickham et al. 2022) for tidy data manipulation, ggplot2 (Wickham 2016) for data visualization, ggraph (Pedersen 2021) for network visualization, ggwordcloud (Le Pennec and Slowikowski 2019) for word cloud visualization, igraph (Csardi et al. 2006) for network analysis, stringr (Wickham 2019) for string operations, textstem (Rinker 2018) for lemmatizing and stemming, tidygraph (Pedersen 2022) for network data manipulation and tidytext (Silge and Robinson 2016) for tidy tokenization. Getting more understandings on these R packages could help users utilize more alternative functions, so as to complete more specific and complex tasks. Hopefully, the users might also become potential developers of the akc framework in the future.
This section shows how akc can be used in a real case. A collection of bibliometric data of R Journal from 2009 to 2021 is used in this example. The data of this example can be accessed in the GitHub repository. Only the akc package is used in this workflow. First, we would load the package and import the data in the R environment.
# A tibble: 568 × 4
id title abstr…¹ year
<int> <chr> <chr> <dbl>
1 1 Aspects of the Social Organization and Traject… Based … 2009
2 2 asympTest: A Simple R Package for Classical Pa… asympT… 2009
3 3 ConvergenceConcepts: An R Package to Investiga… Conver… 2009
4 4 copas: An R package for Fitting the Copas Sele… This a… 2009
5 5 Party on! Random… 2009
6 6 Rattle: A Data Mining GUI for R Data m… 2009
7 7 sos: Searching Help Pages of R Packages The so… 2009
8 8 The New R Help System Versio… 2009
9 9 Transitioning to R: Replicating SAS, Stata, an… Statis… 2009
10 10 Bayesian Estimation of the GARCH(1,1) Model wi… This n… 2010
# … with 558 more rows, and abbreviated variable name ¹abstract
rj_bib
is a data frame with four columns, including id (Paper ID), title (Title of paper), abstract (Abstract of paper) and year (Publication year of paper). Papers in R Journal do not contain a keyword field, thus we have to extract the keywords from the title or abstract field (first step in Figure 1). Here in our case, we use the abstract field as our data source. In addition, we need a user-defined dictionary to extract the keywords, otherwise all the n-grams (meaningful or meaningless) would be extracted and the results would include redundant noise.
# import the user-defined dictionary
rj_user_dict = readRDS ("./rj_user_dict.rds")
rj_user_dict
# A tibble: 627 × 1
keyword
<chr>
1 seasonal-adjustment
2 unit roots
3 transformations
4 decomposition
5 combination
6 integration
7 competition
8 regression
9 accuracy
10 symmetry
# … with 617 more rows
Note that the dictionary should be a data.frame with only one column named “keyword”. The user can also use make_dict
function to build the dictionary data.frame with a string vector. This function removes duplicated phrases, turns them to lower case and sorts them, which potentially improves the efficiency for the following processes.
rj_dict = make_dict (rj_user_dict$keyword)
With the bibliometric data (rj_bib
) and dictionary data (rj_dict
), we could start the workflow provided in Figure 1.
In this step, we need a bibliometric data table with simply two informative columns, namely paper ID (id) and the raw text field (in our case abstract). The parameter dict is also specified to extract only keywords emerging in the user-defined dictionary. The implementation is very simple.
rj_extract_keywords = rj_bib %>%
keyword_extract (id = "id",text = "abstract",dict = rj_dict)
By default, only phrases ranging 1 to 4 in length are included as extracted keywords. The user can change this range using parameters n_min
and n_max
in keyword_extract
function. These is also a stopword
parameter, allowing users to exclude specific keywords in the extracted phrases. The output of keyword_extract
is a data.frame (tibble,tbl_df
class provided by tibble package) with two columns, namely paper ID (id) and the extracted keyword (keyword).
For the preprocessing part, keyword_clean
and keyword_merge
would be implemented in the cleaning part and merging part respectively. In the cleaning part, the keyword_clean
function would: 1) Splits the text with separators (If no separators exist, skip); 2) Removes the contents in the parentheses (including the parentheses, optional); 3) Removes white spaces from start and end of string and reduces repeated white spaces inside a string; 4) Removes all the null character string and pure number sequences (optional); 5) Converts all letters to lower case; 6) Lemmatization (optional). The merging part has been illustrated in the previous section (see Table 1), thus would not be explained again. In the tidy workflow, the preprocessing is implemented via:
rj_cleaned_keywords = rj_extract_keywords %>%
keyword_clean () %>%
keyword_merge ()
No parameters are used in these functions because akc has been designed to input and output tibbles with consistent column names. If the users have data tables with different column names, specify them in arguments (id
and keyword
) provided by the functions. More details can be found in the help document (use ?keyword_clean
and ?keyword_merge
in the console).
To construct a keyword co-occurrence network, only a data table with two columns (with paper ID and keyword) is needed. All the details have been taken care of in the keyword_group
function. However, the user could specify: 1) the community detection function (use com_detect_fun
argument); 2) the filter rule of keywords according to frequency (use top
or min_freq
argument, or both). In our example, we would use the default settings (utilizing Fastgreedy algorithm, only top 200 keywords by frequency would be included).
rj_network = rj_cleaned_keywords %>%
keyword_group ()
The output object rj_network
is a tbl_graph
class supported by tidygraph, which is a tidy data format containing the network data. Based on this data, we can present the results in various forms in the next section.
Currently, there are two major ways to display the classified results in akc, namely network and table. A fast way to gain the network visualization is using keyword_vis
function:
rj_network %>%
keyword_vis ()
In Figure 3, the keyword co-occurrence network is clustered into three groups. The size of nodes is proportional to the keyword frequency, while the transparency degree of edges is proportional to the co-occurrence relationship between keywords. For each group, only the top 10 keywords by frequency are showed in each facet. If the user wants to dig into Group 1, keyword_network
could be applied. Also, max_nodes
parameter could be used to control how many nodes to be showed (in our case, we show 20 nodes in the visualization displayed in Figure 4).
rj_network %>%
keyword_network (group_no = 1,max_nodes = 20)
Another displayed form is using table. This could be implemented by keyword_table
via:
rj_table = rj_network %>%
keyword_table ()
This would return a data.frame with two columns (see Table 2), namely the group number and the keywords (by default, only the top 10 keywords by frequency would be displayed, and the frequency information is attached).
Group | Keywords (TOP 10) |
---|---|
1 | r package (238); algorithms (117); time (109); software (93); regression (75); number (72); features (60); sets (45); selection (41); simulation (40) |
2 | parameters (98); inference (65); framework (58); information (51); distributions (48); performance (47); probability (45); design (44); likelihood (41); optimization (31) |
3 | package (505); model (310); tools (140); tests (48); errors (46); multivariate (42); system (41); hypothesis (18); maps (16); assumptions (15) |
Word cloud visualization is also supported by akc via ggwordcloud package, which could be implemented by using keyword_cloud
function.
In our example, we assume R Journal has a large focus on introducing R packages (Group 1 and Group 3 contains “r package” and “package” respectively). Common statistical subjects mentioned in R Journal include “regression” (in Group 1), “optimization” (in Group 2) and “multivariate” (in Group 3). While our example provides a preliminary analysis of knowledge classification in R Journal, an in-depth exploration could be carried out with a more professional dictionary containing more relevant keywords, and more preprocessing could be implemented according to application scenarios (e.g. “r package” and “package” could be merged into one keyword, and unigrams could be excluded if we consider them carrying indistinct information).
The core functionality of the akc framework is to automatically group the knowledge pieces (keywords) using modularity-based clustering. Because this process is unsupervised, it can be difficult to evaluate the outcome of classification. Nevertheless, the default setting of community detection algorithm was selected after empirical tests via benchmarking. It was found that: 1) Edge betweenness and Spinglass algorithm are most time-consuming; 2) Edge betweenness and Walktrap algorithm could potentially find more local clusters in the network; 3) Label propagation could hardly divide the keywords into groups; 4) Infomap has high standard deviation of node number across groups. In the end, Fastgreedy was chosen as the default community detection algorithm in akc, because its performance is relatively stable, and the number of groups increases proportionally with the network size.
Though akc currently focuses on automatic knowledge classification based on community detection in keyword co-occurrence network, this framework is rather general in many natural language processing problems. One could utilize part of the framework to complete some specific tasks, such as word consolidating (using keyword merging) and n-gram tokenizing (using keyword extraction with a null dictionary), then export the tidy table and work in another environment. As long as the data follows the rule of tidy data format (Wickham et al. 2014; Silge and Robinson 2017), the akc framework could be easily decomposed and applied in various circumstances. For instance, by considering the nationalities of authors as keywords, akc framework could also investigate the international collaboration behavior in specific domain.
In the meantime, the akc framework is still in active development, trying new algorithms to carry out better unsupervised knowledge classification under the R environment. The expected new directions include more community detection functions, new clustering methods, better visualization settings, etc. Note that except for the topology-based community detection approach considering graph structure of the network, there is still another topic-based approach considering the textual information of the network nodes (Ding 2011), such as hierarchical clustering (Newman 2003), latent semantic analysis (Landauer et al. 1998) and Latent Dirichlet Allocation (Blei et al. 2003). These methods are also accessible in R, the relevant packages could be found in the CRAN Task View: Natural Language Processing. With the tidy framework, akc could assimilate more nutrition from the modern R ecosystem, and move forward to create better reproducible open science schemes in the future.
In this paper, we have proposed a tidy framework of automatic knowledge classification supported by a collection of R packages integrated by akc. While focusing on data mining based on keyword co-occurrence network, the framework also supports other procedures in data science workflow, such as text cleaning, keyword extraction and consolidating synonyms. Though in the current stage it aims to support analysis in bibliometric research, the framework is quite flexible to extend to various tasks in other fields. Hopefully, this work could attract more participants from both R community and academia to get involved, so as to contribute to the flourishing open science in text mining.
This study is funded by The National Social Science Fund of China “Research on Semantic Evaluation System of Scientific Literature Driven by Big Data” (21&ZD329). The source code and data for reproducing this paper can be found at: https://github.com/hope-data-science/RJ_akc.
Supplementary materials are available in addition to this article. It can be downloaded at RJ-2022-025.zip
akc, igraph, tidytext, tokenizers, quanteda, ggplot2, data.table, dplyr, ggraph, ggwordcloud, stringr, textstem, tidygraph, tibble
Databases, Epidemiology, Finance, GraphicalModels, HighPerformanceComputing, ModelDeployment, NaturalLanguageProcessing, Optimization, Phylogenetics, Spatial, TeachingStatistics, TimeSeries
Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Huang, et al., "akc: A Tidy Framework for Automatic Knowledge Classification in R", The R Journal, 2022
BibTeX citation
@article{RJ-2022-025, author = {Huang, Tian-Yuan and Li, Li and Yang, Liying}, title = {akc: A Tidy Framework for Automatic Knowledge Classification in R}, journal = {The R Journal}, year = {2022}, note = {https://doi.org/10.32614/RJ-2022-025}, doi = {10.32614/RJ-2022-025}, volume = {14}, issue = {2}, issn = {2073-4859}, pages = {67-76} }