Hands-on Exercise 6 Part Two

Author

Oh Jia Wen

Published

May 20, 2023

Modified

May 20, 2023

Getting started

Using p_load() of pacman package to load the required libraries

pacman::p_load(corrplot, ggstatsplot, tidyverse, GGally, 
               parallelPlot, treemap, treemapify)

Importing Data

wine <- read_csv("data/wine_quality.csv")
realis2018 <- read_csv("data/realis2018.csv")

1. Plotting Correlation Matrix: pairs() method

Create a scatterplot matrix by using the pairs function of R Graphics.

1.1 Basic Correlation Matrix

pairs(wine[,1:11])

Another example. This time, it uses column 2 to 12 to the build the scatter plot matrix.

pairs(wine[,2:12])

1.2 Lower Half of Correlation Matrix

Since correlation matrix is symmetric, we would like to see the lower half or upper half. Thus, we would need to choose between upper.panel = NULL or lower.panel = NULL .

pairs(wine[,2:12], upper.panel = NULL)

1.3 Show Correlation Coefficient

panel.cor function will be used to show the correlation coefficient of each pair of variables instead of a scatter plot.

panel.cor <- function(x, y, digits=2, prefix="", cex.cor, ...) {
usr <- par("usr")
on.exit(par(usr))
par(usr = c(0, 1, 0, 1))
r <- abs(cor(x, y, use="complete.obs"))
txt <- format(c(r, 0.123456789), digits=digits)[1]
txt <- paste(prefix, txt, sep="")
if(missing(cex.cor)) cex.cor <- 0.8/strwidth(txt)
text(0.5, 0.5, txt, cex = cex.cor * (1 + r) / 2)
}

pairs(wine[,2:12], 
      upper.panel = panel.cor)

2. Plotting Correlation Matrix: ggcormat()

We will be using ggcorrmat() of ggstatsplot package.

#|fig.height: 12
#|fig.width: 6

ggstatsplot::ggcorrmat(
  data = wine, 
  cor.vars = 1:11,
  ggcorrplot.args = list(outline.color = "black", 
                         hc.order = TRUE,
                         tl.cex = 10),
  title    = "Correlogram for wine dataset",
  subtitle = "Four pairs are no significant at p < 0.05"
)

#help to control specific component of the plot such 
ggplot.component = list(
    theme(text=element_text(size=5),
      axis.text.x = element_text(size = 8),
      axis.text.y = element_text(size = 8)))

2. Plotting Correlation Matrix: ggcormat()

ggstatsplot package supports faceting. However the feature is not available in ggcorrmat() but in the grouped_ggcorrmat() of ggstatsplot.

#|fig.height: 12
#|fig.width: 6
grouped_ggcorrmat(
  data = wine,
  cor.vars = 1:11,
  grouping.var = type,        #to build facet plot
  type = "robust",
  p.adjust.method = "holm",
  
  #provides list of additional arguments
  plotgrid.args = list(ncol = 2),       
  ggcorrplot.args = list(outline.color = "black", 
                         hc.order = TRUE,
                         tl.cex = 10),
  
  #calling plot annotations arguments of patchwork
  annotation.args = list(               
    tag_levels = "a",
    title = "Correlogram for wine dataset",
    subtitle = "The measures are: alcohol, sulphates, fixed acidity, citric acid, chlorides, residual sugar, density, free sulfur dioxide and volatile acidity",
    caption = "Dataset: UCI Machine Learning Repository"
  )
)

3. Plotting Corrplot package: corrplot()

We will be using corrplot() of the corrplot package. Check out -> An Introduction to corrplot Package for basic understanding of corrplot package.

wine.cor <- cor(wine[, 1:11])

Next, corrplot() is used to plot the corrgram by using all the default setting as shown in the code chunk below.

corrplot(wine.cor)

3.1 Further Visualization with Corrplot()

There are seven visual geometrics (parameter method) can be used to encode the attribute values. They are: circle, square, ellipse, number, shade, color and pie. The default is circle.

Example 1:

corrplot(wine.cor, 
         method = "ellipse",  #default is circle 
         type="lower",        #default is full
         diag = FALSE,        #turn off diagonal celss
         tl.col = "black")    #change axis text label color to black

Example 2:

corrplot.mixed(wine.cor, 
               lower = "ellipse", 
               upper = "number",
               tl.pos = "lt",
               diag = "l",
               tl.col = "black")

Example 3: Combine corrgram with significant test

wine.sig = cor.mtest(wine.cor, conf.level= .95)

corrplot(wine.cor,
         method = "number",
         type = "lower",
         diag = FALSE,
         tl.col = "black",
         tl.srt = 45,
         p.mat = wine.sig$p,  #input the calculated conf.level
         sig.level = .05)

Reordering corrgram:

Matrix reorder is very important for mining the hiden structure and pattern in a corrgram. By default, the order of attributes of a corrgram is sorted according to the correlation matrix (i.e. “original”). The default setting can be over-write by using the order argument of corrplot(). Currently, corrplot package support four sorting methods, they are:

“AOE” is for the angular order of the eigenvectors. See Michael Friendly (2002) for details.
“FPC” for the first principal component order.
“hclust” for hierarchical clustering order, and “hclust.method” for the agglomeration method to be used.
- “hclust.method” should be one of “ward”, “single”, “complete”, “average”, “mcquitty”, “median” or “centroid”.
“alphabet” for alphabetical order.

“AOE”, “FPC”, “hclust”, “alphabet”. More algorithms can be found in seriation package.

corrplot.mixed(wine.cor, 
               lower = "ellipse", 
               upper = "number",
               tl.pos = "lt",
               diag = "l",        #instead of diag, change to order = "hclust"
               order="AOE",   #order_type.  then hcluster.method = "ward.D"
               tl.col = "black")

4. Plotting Heatmap

Using p_load() of pacman package to load the required libraries

pacman::p_load(seriation, dendextend, heatmaply, tidyverse)

Importing data

wh <- read_csv("data/WHData-2018.csv")

Preparing the data

row.names(wh) <- wh$Country

Transform to Matrix

wh1 <- dplyr::select(wh, c(3, 7:12))
wh_matrix <- data.matrix(wh)

4.1 heatmap() of R Stats

We will be using heatmap() of Base Stats.

wh_heatmap <- heatmap(wh_matrix,
                      Rowv=NA, Colv=NA)

To plot cluster heatmap,

wh_heatmap <- heatmap(wh_matrix)

4.2 Interactive Heatmap

We will be using heatmaply in this section.

heatmaply supports a variety of hierarchical clustering algorithm. The main arguments provided are:

distfun: function used to compute the distance (dissimilarity) between both rows and columns. Defaults to dist. The options “pearson”, “spearman” and “kendall” can be used to use correlation-based clustering, which uses as.dist(1 - cor(t(x))) as the distance metric (using the specified correlation method).
hclustfun: function used to compute the hierarchical clustering when Rowv or Colv are not dendrograms. Defaults to hclust.
dist_method default is NULL, which results in “euclidean” to be used. It can accept alternative character strings indicating the method to be passed to distfun. By default distfun is “dist”” hence this can be one of “euclidean”, “maximum”, “manhattan”, “canberra”, “binary” or “minkowski”.
hclust_method default is NULL, which results in “complete” method to be used. It can accept alternative character strings indicating the method to be passed to hclustfun. By default hclustfun is hclust hence this can be one of “ward.D”, “ward.D2”, “single”, “complete”, “average” (= UPGMA), “mcquitty” (= WPGMA), “median” (= WPGMC) or “centroid” (= UPGMC).

Clustering: Manual Approach

In the code chunk below, the heatmap is plotted by using hierachical clustering algorithm with “Euclidean distance” and “ward.D” method.

heatmaply(normalize(wh_matrix[, -c(1, 2, 4, 5)]),
          dist_method = "euclidean",
          hclust_method = "ward.D")

Clustering: Statistical Approach

In order to determine the best clustering method and number of cluster the dend_expend() and find_k() functions of dendextend package will be used.

First, the dend_expend() will be used to determine the recommended clustering method to be used.

wh_d <- dist(normalize(wh_matrix[, -c(1, 2, 4, 5)]), method = "euclidean")
dend_expend(wh_d)[[3]]

  dist_methods hclust_methods     optim
1      unknown         ward.D 0.6137851
2      unknown        ward.D2 0.6289186
3      unknown         single 0.4774362
4      unknown       complete 0.6434009
5      unknown        average 0.6701688
6      unknown       mcquitty 0.5020102
7      unknown         median 0.5901833
8      unknown       centroid 0.6338734

The output table shows that “average” method should be used because it gave the high optimum value.

Next, find_k() is used to determine the optimal number of cluster.

wh_clust <- hclust(wh_d, method = "average")
num_k <- find_k(wh_clust)
plot(num_k)

Figure above shows that k=3 would be good.

With reference to the statistical analysis results, we can prepare the code chunk as shown below.

heatmaply(normalize(wh_matrix[, -c(1, 2, 4, 5)]),
          dist_method = "euclidean",
          hclust_method = "average",
          k_row = 3)

Seriation:

heatmaply uses the seriation package to find an optimal ordering of rows and columns.

heatmaply(normalize(wh_matrix[, -c(1, 2, 4, 5)]),
          seriate = "OLO")

The default options is “OLO” (Optimal leaf ordering) which optimizes the above criterion (in O(n^4)). Another option is “GW” (Gruvaeus and Wainer) which aims for the same goal but uses a potentially faster heuristic.

Final outcome:

heatmaply(normalize(wh_matrix[, -c(1, 2, 4, 5)]),
          Colv=NA,
          seriate = "none",
          colors = Blues,
          k_row = 5,
          margins = c(NA,200,60,NA),
          fontsize_row = 4,
          fontsize_col = 5,
          main="World Happiness Score and Variables by Country, 2018 \nDataTransformation using Normalise Method",
          xlab = "World Happiness Indicators",
          ylab = "World Countries"
          )

5. Plotting parallel coordinates

wh <- read_csv("data/WHData-2018.csv")

5.1 With box plot

Refer to url for more explanation.

ggparcoord(data = wh, 
           columns = c(7:12), 
           groupColumn = 2,
           scale = "uniminmax",
           alphaLines = 0.2,
           boxplot = TRUE, 
           title = "Parallel Coordinates Plot of World Happines Variables")

5.2 With Facet Wrap

ggparcoord(data = wh, 
           columns = c(7:12), 
           groupColumn = 2,
           scale = "uniminmax",
           alphaLines = 0.2,
           boxplot = TRUE, 
           title = "Multiple Parallel Coordinates Plots of World Happines Variables by Region") +
  facet_wrap(~ Region)

5.3 Interactive Parallel Coordinates Plot

parallelPlotis used in this section.

wh_i <- wh |> 
  select("Happiness score", c(7:12))

histo <- rep(TRUE, ncol(wh_i))

parallelPlot(wh_i,
             continuousCS = "YlOrRd",
             rotateTitle = TRUE,
             histoVisibility = histo)

6. Plotting treemap

Grouped summaries with pipe

realis2018_summarised <- realis2018 %>% 
  group_by(`Project Name`,`Planning Region`, 
           `Planning Area`, `Property Type`, 
           `Type of Sale`) %>%
  summarise(`Total Unit Sold` = sum(`No. of Units`, na.rm = TRUE), 
            `Total Area` = sum(`Area (sqm)`, na.rm = TRUE),
            `Median Unit Price ($ psm)` = median(`Unit Price ($ psm)`, na.rm = TRUE),
            `Median Transacted Price` = median(`Transacted Price ($)`, na.rm = TRUE))

Thereafter, we filter for condominium in property type and resale type of flat.

realis2018_selected <- realis2018_summarised %>%
  filter(`Property Type` == "Condominium", `Type of Sale` == "Resale")

As the file have been installed previous in Take Home Exercise 2, we will not install the github again.

#library(devtools)
#install_github("timelyportfolio/d3treeR")

library(d3treeR) #package have been installed in Take Home Exercise 2

6.1 Plotting interactive treemap with d3treeR

tm <- treemap(realis2018_summarised,
        index=c("Planning Region", "Planning Area"),
        vSize="Total Unit Sold",
        vColor="Median Unit Price ($ psm)",
        type="value",
        title="Private Residential Property Sold, 2017",
        title.legend = "Median Unit Price (S$ per sq. m)"
        )

d3tree(tm,rootname = "Singapore" )