spherepc: An R Package for Dimension Reduction on a Sphere

Dimension reduction is a technique that can compress given data and reduce noise. Recently, a dimension reduction technique on spheres, called spherical principal curves (SPC), has been proposed. SPC fits a curve that passes through the middle of data with a stationary property on spheres. In addition, a study of local principal geodesics (LPG) is considered to identify the complex structure of data. Through the description and implementation of various examples, this paper introduces an R package spherepc for dimension reduction of data lying on a sphere, including existing methods, SPC and LPG.

Jongmin Lee (Seoul National University) , Jang-Hyun Kim (Seoul National University) , Hee-Seok Oh (Seoul National University)
2022-06-22

1 Introduction

This paper aims to introduce an R package spherepc that considers several dimension reduction techniques on a sphere, which encompass recently developed approaches such as SPC and LPG as well as some existing methods, and discuss how to implement these methods through spherepc.

Dimension reduction methods are widely used in various fields, including statistics and machine learning, by efficiently compressing data and removing noise (Benner et al. 2005). As one of the dimension reduction methods, the principal curves of Hastie and Stuetzle (1989) are suitable for fitting a curve or a surface of data in Euclidean space, which go through the middle of the data. Hauberg (2016) proposed an algorithm to find the principal curves in Riemannian manifolds based on the concept of the original principal curves. However, the principal curves proposed by Hauberg (2016) no longer represent the data continuously because of the approximation of the projection step required to fit the curves.

Recently, Lee et al. (2021a) proposed a new method, termed spherical principal curves (SPC), that constructs principal curves, ensuring a stationary property on spheres. SPC is useful for representing circular or waveform data with smaller reconstruction errors than conventional methods, including principal geodesic analysis (Fletcher et al. 2004), exact principal circle (Lee et al. 2021a), and principal curves proposed by Hauberg (2016). However, SPC has the disadvantage of being sensitive to initialization. As a result, there are some data structures that SPC does not apply to, for example, data with spirals, zigzags, or branches like tree-shape. A localized version of SPC called local principal geodesics (LPG) is being developed to resolve such a problem. A function for LPG is also provided in the package spherepc. Research on the LPG is underway in progress.

To the best of our knowledge, no available R packages offer the methods of dimension reduction and principal curves on a sphere. The existing R packages providing principal curves, such as princurve (Hastie and Weingessel 2015) and LPCM (Einbeck et al. 2015), are available only on Euclidean space, not on a sphere or (Riemannian) manifold. In addition, most dimension reduction methods on manifolds (Huckemann et al. 2010; Panaretos et al. 2014; Liu et al. 2017) involve somewhat complex optimizations. The proposed package spherepc for R provides the state-of-the-art principal curve technique on the sphere (Lee et al. 2021a) and comprehensively collects and implements the existing methods (Fletcher et al. 2004; Hauberg 2016).

The rest of this paper is organized as follows. The following section introduces the existing methods for dimension reduction on the sphere and relevant functions covered in the package spherepc, which is available on CRAN. Furthermore, their usages are discussed with examples in detail. Then, the spherical principal curves proposed by (Lee et al. 2021a) and principal curves of (Hauberg 2016) are briefly described. In addition, implementations of the SPC() and SPC.Hauberg() functions in the spherepc are presented. The subsequent section discusses the local principal geodesics (LPG) with the implementation of various simulated data, demonstrating its promising usability. In the application session, all the mentioned methods are performed to analyze real seismological data. Finally, conclusions are given in the last section.

2 Existing methods

Principal geodesic analysis

Principal geodesic analysis (PGA) proposed by Fletcher et al. (2004) can be regarded as a generalization of principal component analysis (PCA) to Riemannian manifolds. In particular, Fletcher et al. (2004) performed dimension reduction of data on the Cartesian product space of the manifolds. In detail, the data are projected onto the tangent spaces at the intrinsic means of each component of the manifolds; thus, the given data are approximated as points on Euclidean vector space, and subsequently, PCA is applied to the points. As a result, the dimension reduction can be performed through the inverse of the tangent projections.

The principal geodesic analysis can be implemented by the PGA() function available in the spherepc. The detailed usage of the PGA() function is described as follows.

    PGA(data, col1 = "blue", col2 = "red")

Before using the PGA() function, it requires loading the packages rgl (Adler and Murdoch 2020), sphereplot (Robotham 2013), and geosphere (Hijmans et al. 2017). The following codes yield an implementation of the PGA() function.

   #### for all simulated datasets, longitude and latitude are expressed in degrees
   #### example 1: half-great circle data
   > circle <- GenerateCircle(c(150, 60), radius = pi/2, T = 1000)
   > sigma <- 2                             # noise level
   > half.circle <- circle[circle[, 1] < 0, , drop = FALSE]
   > half.circle <- half.circle + sigma * rnorm(nrow(half.circle))
   > PGA(half.circle)

   #### example 2: S-shaped data
   # the dataset consists of two parts: lon ~ Uniform[0, 20], 
   # lat = sqrt(20 * lon - lon^2) + N(0, sigma^2), 
   # lon ~ Uniform[-20, 0], lat = -sqrt(-20 * lon - lon^2) + N(0, sigma^2)
   > n <- 500              
   > sigma <- 1                             # noise level
   > lon <- 60 * runif(n)  
   > lat <- (60 * lon - lon^2)^(1/2) + sigma * rnorm(n)
   > simul.S1 <- cbind(lon, lat)
   > lon2 <- -60 * runif(n)
   > lat2 <- -(-60 * lon2 - lon2^2)^(1/2) + sigma * rnorm(n)
   > simul.S2 <- cbind(lon2, lat2)
   > simul.S <- rbind(simul.S1, simul.S2)
   > PGA(simul.S)

Because a principal geodesic is always a great circle, the PGA() function is suitable for identifying the global data trend. The implementations of half-circle and S-shaped data are displayed in Figure 1, where the principal geodesic properly extracts the global trends in the half-great circle and S-shaped data, while it cannot identify the circular variations in the S-shaped case. In addition, the arguments and outputs of the PGA() function are described in Tables 1 and 2.

graphic without alt text