Overview

Materials data are calculated by Density Functional Theory (DFT) , using two different potential approximations, PBE and HSE. The method of PBE is fast and efficient. HSE, on the other hand, supposed to be more accurate but also time-consuming. Therefore, one should wisely choose the method to perform DFT calculation in order to get more accurate results without wasting a lot of CPU hours. Here we are going to work with a dataset of both PBE and HSE calculated materials formation energies, a very important materials property indicating the thermodynamical stability of a compound. In order to draw the conclusions about whether HSE is more accurate, we will apply several statistical analysis in this case study.

This project will cover the following topics:

  • Comparisons between PBE and HSE calculated data (Linear regression)
  • Overview on calculation errors between PBE and HSE
  • Normality tests for PBE/HSE errors (Shapiro-Wilk test)
  • Confidence intervals between mean error difference ($t$-test)
  • Variance test between PBE and HSE errors ($F$-test)
  • Independence test between methods and materials type ($\chi^2$-test & Fisher's exact test)
  • Error median test without removing outliers (Wilcoxon Test)
  • Bootstrapping
  • Skew Normal Distribution Fit

(This is a R notebook.)

In [1]:
# Uncommet this block if these packages are not installed
# install.packages("PerformanceAnalytics")
# install.packages("psych")
# install.packages("ggpubr")
# install.packages("moments")
# install.packages("fGarch")
# install.packages("boot")
In [2]:
# Get raw data from a CSV file
raw_d <- read.csv("formation_energy_data.csv")
In [3]:
head(raw_d)
namePBEHSEExpType
Ag2Te -0.027317-0.160568-0.1129 insulator
AgBr -0.320165-0.502117-0.5204 insulator
AgCl -0.417480-0.572266-0.6585 insulator
AgF -0.941097-1.099089-1.0516 insulator
AgI -0.275869-0.413274-0.3205 insulator
AlB2 -0.042659-0.046835-0.4090 metal
In [4]:
## Get the summary information of each columns
raw_d.plt <- raw_d[, c('PBE', 'HSE', 'Exp')]
summary(raw_d.plt)
      PBE               HSE               Exp         
 Min.   :-3.9302   Min.   :-4.1269   Min.   :-4.2424  
 1st Qu.:-1.2043   1st Qu.:-1.3661   1st Qu.:-1.3880  
 Median :-0.6839   Median :-0.7952   Median :-0.7499  
 Mean   :-0.8898   Mean   :-1.0110   Mean   :-1.0295  
 3rd Qu.:-0.3142   3rd Qu.:-0.3473   3rd Qu.:-0.3685  
 Max.   : 0.7182   Max.   : 0.7642   Max.   :-0.0011  

In this table, for each compound, we have materials formation energies calculated by PBE, HSE and measure by experients.

In [5]:
# Visualize the data generally
library(psych)
options(repr.plot.width = 5, repr.plot.height = 5)
pairs.panels(raw_d.plt, scale=TRUE)
detach("package:psych", unload=TRUE)
In [6]:
## Clean up
rm(raw_d.plt)

Comparisons between PBE and HSE calculated data

We first investigate the correlation between data calculated using PBE and HSE methods. Based on our physical intuition, we expect a strong linear correlation between these two sets of data. We utilized Analysis of Variance (ANOVA) to verify our claim.

In [7]:
# Data preparation
data.pbe <- raw_d[, c('PBE')]
data.hse <- raw_d[, c('HSE')]

cat(sprintf("The sample size is %d.", nrow(raw_d)))
The sample size is 292.
In [8]:
# Apply Single Linear Regression to fit the data
data.lm = lm(data.pbe~data.hse)

Our proposed the model is:

    data.pbe = $\beta_1\cdot$data.hse+$\beta_0$

where $\beta_1$ is the slope and $\beta_0$ is the intercept.

In [9]:
# View the fitting results
data.lm$coefficients
(Intercept)
0.0304697211825741
data.hse
0.910302205243096

Therefore, our fitted model is:

    data.pbe = 0.9103$\cdot$data.hse+0.0305

In [10]:
# Apply ANOVA for our linear regression results
anova(data.lm)
DfSum SqMean SqF valuePr(>F)
data.hse 1 193.951877 193.95187737 15815.46 5.139091e-255
Residuals290 3.556396 0.01226343 NA NA
In [11]:
# Critical F value for this case
qf(0.05, df1 = 1, df2 = 290, lower.tail = FALSE)
3.87372423188624

We use ANOVA to test the slope of our model.

Hypothesis:

  • $H_0$: $\beta_1 = 0$
  • $H_A$: $\beta_1 \neq 0$

Sample data analysis:

  • Slope: $$\beta_1 = 0.9103$$
  • Mean square error: $$MSE=\frac{\sum(y_i-\hat{y_i})^2}{n-2}=\frac{3.556396}{290}=0.01226343$$
  • Regression mean square: $$MSR=\frac{\sum(\hat{y_i}-\bar{y})^2}{1}=\frac{193.951877}{1}=193.951877$$
  • $F$-statistic: $$F=\frac{MSR}{MSE}= \frac{193.951877}{0.01226343}=15815.46$$
  • $p$-value: $$ p(F>15815.46) \approx 0.000$$

Conclusion: Since the $p$-value is less than 0.05, we reject the null hypothesis and conclude that the slope cannot be zero. Alternatively, we can also draw the same conclusion by comparing $F$-statistic to the criticle $F$ value, i.e. $F = 15814.46 > F_{0.05,1,290}=3.87$.

In [12]:
# Complete summary of the linear regression model
summary(data.lm)
Call:
lm(formula = data.pbe ~ data.hse)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.74726 -0.05207 -0.00095  0.04476  0.52405 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 0.030470   0.009775   3.117  0.00201 ** 
data.hse    0.910302   0.007238 125.760  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.1107 on 290 degrees of freedom
Multiple R-squared:  0.982,	Adjusted R-squared:  0.9819 
F-statistic: 1.582e+04 on 1 and 290 DF,  p-value: < 2.2e-16

A few points:

  • As well acknowledged, there is a relationship between $t$-statistic and $F$-statistic: $t^2_{n-2}=F^2_{1, n-2} $. Here, $t=125.76$ and $F=15815.46(=125.76^2)$, which statisfies the expression.
  • The magnitudes of significance for both slope and intercept are really small, which means they cannot be ignored.
  • $R^2$ is 0.982, which also indicates the strong linear relationship.
In [13]:
# Visualize data
library(ggplot2)
options(repr.plot.width = 4.5, repr.plot.height =4.5)

ggplot(,aes(x = data.hse, y = data.pbe)) + 
  geom_point(colour="black", alpha=0.8, size=1) +
  stat_smooth(method = "lm", fill = "lightblue", size=1, alpha=0.3)

detach("package:ggplot2", unload=TRUE)
In [14]:
# Clean up
rm(data.hse, data.pbe, data.lm)

Overview on calculation errors between PBE and HSE

Here we study the errors between calculated materials properties (PBE & HSE) and experimentally measured materials properties. We want to draw a conclusion which method is more accurate.

In [15]:
# Data preparation
data.name <- raw_d[, c('name')]
data.pbe <- raw_d[, c('PBE')]
data.hse <- raw_d[, c('HSE')]
data.exp <- raw_d[, c('Exp')]
data.type <- raw_d[, c('Type')]

data.d_pbe <- data.pbe-data.exp
data.d_hse <- data.hse-data.exp

# Clean up
rm(data.pbe, data.hse, data.exp)
In [16]:
# General view of data in histograms
par(mfrow=c(1,2))
options(repr.plot.width = 8, repr.plot.height = 4)

hist(data.d_pbe, col='orange', xlab="Errors", main="PBE")
hist(data.d_hse, col='blue', xlab="Errors", main="HSE")
In [17]:
# General view of data in scatterplot and boxplots
options(repr.plot.width = 7, repr.plot.height = 7)
par(fig=c(0,0.85,0,0.85))
plot(data.d_pbe, data.d_hse, xlab="PBE errors",
  ylab="HSE errors", col="black", pch=21)
abline(v=0, col="orange", lty=3, lwd=2)
abline(h=0, col="blue", lty=3, lwd=2)
par(fig=c(0,0.85,0.55,1), new=TRUE)
boxplot(data.d_pbe, horizontal=TRUE, col="white", border="orange", pch=21, axes=FALSE)
par(fig=c(0.65,1,0,0.85),new=TRUE)
boxplot(data.d_hse, col="white", border="blue", pch=21, axes=FALSE)
In [18]:
# Of course, we can view boxplots side-by-side
options(repr.plot.width = 6, repr.plot.height = 4)
boxplot(data.d_pbe,data.d_hse,
        horizontal=TRUE,
        width=c(0.03, 0.03),
        names=c("PBE","HSE"),
        col=c("white","white"),
        border=c("orange", "blue"),
        xlab="Calculated errors", ylab="Method")

A few conclusions by viewing these plots:

  • HSE errors are more centralized than PBE.
  • HSE errors have more outliers than PBE.
  • Both errors might follow normal distributions.

Normality tests for PBE/HSE errors

Before we further analyze our data, we first need to know whether the errors follow normal distributions. Therefore, we will start with the normality tests.

  1. Utilize q-q plot to visualize normality
  2. Apply Shapiro-Wilk normality test (remove outliers if necessary)
  3. Fit normal distribution parameters using Maximum Likelihood Estimators (MLE) method
In [19]:
# Compare distributions of PBE/HSE errors with normal distribution using q-q plot
suppressMessages(library(ggpubr))
options(repr.plot.width = 7, repr.plot.height = 4)

q1 <- ggqqplot(data.d_pbe, title="PBE errors q-q plot", col="orange")
q2 <- ggqqplot(data.d_hse, title="HSE errors q-q plot", col="blue")

ggarrange(q1, q2, nrow=1, ncol=2)

# Clean up 
rm(q1, q2)
detach("package:ggpubr", unload=TRUE)

As we can see, both data samples have relatively good fit with normal distributions.

In [20]:
# Apply Shapiro-Wilk Normality Test
shapiro.test(data.d_pbe)
shapiro.test(data.d_hse)
	Shapiro-Wilk normality test

data:  data.d_pbe
W = 0.91033, p-value = 3.463e-12
	Shapiro-Wilk normality test

data:  data.d_hse
W = 0.88321, p-value = 3.753e-14

Based on these tests, the $p$ values are quite small. Therefore, we should reject the null hpythosis can conclude that both datasets fail to follow the normal distribution. This result might due to the existence of outliers. We then will remove the outliers and redo the normality tests.

The commonly used outlier dection tests, i.e. Grubbs’s Test and Dixon Q Test, cannot be used here because both tests rely on the assumption of normal distribution. Therefore, we should detect the outliers using non-parametric method. Here, we will remove those outliers if the value is outside the range $1.5*IQR$ (interquartile range).

In [21]:
# Remove outliers
remove_outliers <- function(x, na.rm = TRUE, ...) {
  qnt <- quantile(x, probs=c(.25, .75), na.rm = na.rm, ...)
  H <- 1.5 * IQR(x, na.rm = na.rm)
  y <- x
  y[x < (qnt[1] - H)] <- NA
  y[x > (qnt[2] + H)] <- NA
  y
}

d_pbe_new <- remove_outliers(data.d_pbe)
d_hse_new <- remove_outliers(data.d_hse)
data.ro <- data.frame(name=data.name, pbe=d_pbe_new, hse=d_hse_new, type=data.type)

# Drop missing data
data.ro_clean <- data.ro[complete.cases(data.ro),]

nrow(data.ro_clean)

# Clean up
rm(data.ro, d_pbe_new, d_hse_new, remove_outliers)
276

After we drop the outliers, the sample size reduces to 276 (from 292).

In [22]:
# Re-apply normality test after removing outliers
shapiro.test(data.ro_clean$pbe)
shapiro.test(data.ro_clean$hse)
	Shapiro-Wilk normality test

data:  data.ro_clean$pbe
W = 0.99157, p-value = 0.1157
	Shapiro-Wilk normality test

data:  data.ro_clean$hse
W = 0.98673, p-value = 0.01197

After we drop the outliers, we found both datasets follow the normal distributions.

Once we confirmed the normal distribution, we can estimate the paramaters of both distributions based on maximum likelihood estimation (MLE).

The MLE derivations for normal distribution are shown as follows:

  • Likelihood function: $$ L(\mu, \sigma|x_1, x_2,\dots,x_n)=\prod_{i}^{n}L(\mu, \sigma|x_i)=\prod_{i}^{n}\frac{1}{\sqrt{2\pi\sigma^2}}e^{-(x_i-\mu)^2/2\sigma^2} $$
  • Log likelihood function: $$ \textrm{ln}[L(\mu, \sigma|x_1, x_2,\dots,x_n)] =\textrm{ln}\left[\prod_{i}^{n}\frac{1}{\sqrt{2\pi\sigma^2}}e^{-(x_i-\mu)^2/2\sigma^2}\right] =\prod_{i}^{n}\left[\textrm{ln}\left(\frac{1}{\sqrt{2\pi\sigma^2}}e^{-(x_i-\mu)^2/2\sigma^2}\right)\right]$$ $$ = \sum_{i}^{n}\left(-\frac{1}{2}\textrm{ln}(2\pi)-\textrm{ln}(\sigma)-\frac{(x_i-\mu)^2}{2\sigma^2}\right) =-\frac{n}{2}\textrm{ln}(2\pi)-n\textrm{ln}(\sigma)-\sum_{i}^{n}\left(-\frac{(x_i-\mu)^2}{2\sigma^2}\right)$$
  • Derivative w.r.t $\mu$: $$\frac{\partial}{\partial\mu}\textrm{ln}[L(\mu, \sigma|x_1, x_2,\dots,x_n)]=\sum_{i}^{n}\left(\frac{x_i-\mu}{\sigma^2}\right)=\frac{1}{\sigma^2}\left[\sum_{i}^{n}x_i-n\mu\right]=0$$ $$ \mu = \frac{\sum\limits_{i}^{n}x_i}{n} $$
  • Derivative w.r.t $\sigma$: $$ \frac{\partial}{\partial\sigma}\textrm{ln}[L(\mu, \sigma|x_1, x_2,\dots,x_n)]=-\frac{n}{\sigma}+\sum_{i}^{n}\frac{(x_i-\mu)^2}{\sigma^3}=-\frac{n}{\sigma}+\frac{1}{\sigma^3}\left[\sum_{i}^{n}(x_i-\mu)^2\right]=0$$ $$ \sigma=\sqrt{\frac{\sum\limits_{i}^{n}(x_i-\mu)^2}{n}}$$
In [23]:
# Calculate sample mean & standard deviation 
mu_pbe <- mean(data.ro_clean$pbe)
sigma_pbe <- sd(data.ro_clean$pbe)

mu_hse <- mean(data.ro_clean$hse)
sigma_hse <- sd(data.ro_clean$hse)

cat(sprintf("mean(PBE)=%f, sd(PBE)=%f\n", mu_pbe, sigma_pbe))
cat(sprintf("mean(HSE)=%f, sd(HSE)=%f", mu_hse, sigma_hse))
mean(PBE)=0.126192, sd(PBE)=0.151008
mean(HSE)=-0.002488, sd(HSE)=0.153595

These values can be treated as the parameters of the normal distribution. Therefore, we can visualize the data by overlayering the fitted normal distribution on top of the histograms.

In [24]:
# General view of data in histograms
par(mfrow=c(1,2))
options(repr.plot.width = 8, repr.plot.height = 4)

hist(data.ro_clean$pbe, density=20, breaks=20, prob=TRUE, col="orange",
     xlab="Errors", ylim=c(0,3.5), main="PBE")
curve(dnorm(x, mean=mu_pbe, sd=sigma_pbe), 
      col="orange", lwd=2, add=TRUE, yaxt="n")

hist(data.ro_clean$hse, density=20, breaks=20, prob=TRUE, col="blue", 
     xlab="Errors",ylim=c(0,3.5), main="HSE")
curve(dnorm(x, mean=mu_hse, sd=sigma_hse), 
      col="blue", lwd=2, add=TRUE, yaxt="n")
In [25]:
# Clean up
rm(mu_pbe, mu_hse, sigma_pbe, sigma_hse)

Confidence intervals between mean error difference ($t$-test)

For each materials, we can compare which method will provides results closer to the experimental measurements. Here we want to get the confidence intervals between mean error difference to conclude which method might provide smaller errors.

In [26]:
# Apply t-test
t.test(data.ro_clean$pbe, data.ro_clean$hse, paired = FALSE)
	Welch Two Sample t-test

data:  data.ro_clean$pbe and data.ro_clean$hse
t = 9.925, df = 549.84, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 0.1032126 0.1541474
sample estimates:
   mean of x    mean of y 
 0.126191540 -0.002488453 

Hence, we reject the null hypothesis that true difference in means is zero since the $p$-value is almost zero. The 95 percent confidence interval of mean of the differences between PBE and HSE is [0.1032, 0.1541].

Actually, the sample data are paired, i.e. for each material, we have both PBE and HSE calculation errors. We have already concluded that both PBE and HSE errors follow normal distributions. However, $t$-test on paired sample based on the assumption that the difference between pairs follows normal distribution. Therefore, we have to apply normality test for differences first.

In [27]:
# Apply SW normality test for difference
shapiro.test(data.ro_clean$pbe-data.ro_clean$hse)
	Shapiro-Wilk normality test

data:  data.ro_clean$pbe - data.ro_clean$hse
W = 0.91836, p-value = 3.889e-11

Since the difference does not follow normal distribution, we cannot apply paired sample $t$-test here. However, we can perform non-parametric pair sample test, e.g. Wicoxon Test, which will be done in later section.

Variance test between PBE and HSE errors ($F$-test)

To compare the variance between PBE and HSE errors, we should apply $F$-test to confirm whether the variance is equal.

In [28]:
# Apply F-test to test variance of the two data samples
var.test(data.ro_clean$pbe, data.ro_clean$hse)
	F test to compare two variances

data:  data.ro_clean$pbe and data.ro_clean$hse
F = 0.9666, num df = 275, denom df = 275, p-value = 0.7784
alternative hypothesis: true ratio of variances is not equal to 1
95 percent confidence interval:
 0.7627383 1.2249527
sample estimates:
ratio of variances 
         0.9666014 

The $p$-value of $F$-test is 0.7784 which is greater than the significance level 0.05. In conclusion, there is no significant difference between the two variances.

Independence test between methods and materials types ($\chi^2$-test & Fisher's exact test)

Based on the errors of PBE and HSE, we are able to know which method is more accurate for different materials. However, we want to know whether there is a relationship between method and materials types. In order to verify our hypothesis, we are going to apply $\chi^2$ test.

We will first figure out which method to use for different materials based on their calculated errors (which gives less error in magnitude).

In [29]:
# Create a new "method" column shows which method provides less error.
calc_method <- function(pbe_error, hse_error)
    {
    if (abs(pbe_error) < abs(hse_error)) {output <- "PBE"}
    else {output <- "HSE" }
    return(output)
}

data.ro_clean$method <- mapply(calc_method, data.ro_clean$pbe, data.ro_clean$hse)
In [30]:
# Create contingency table of method and type
library(MASS)   
data.method_type = table(data.ro_clean$method, data.ro_clean$type) 
detach("package:MASS", unload=TRUE)
data.method_type
     
      insulator metal
  HSE       106    68
  PBE        27    75

Contigency table of calculation method and materials type.

In [31]:
# Apply chi square test
chisq.test(data.method_type)
	Pearson's Chi-squared test with Yates' continuity correction

data:  data.method_type
X-squared = 29.201, df = 1, p-value = 6.526e-08

Since the $p$-value is less than the significance level (0.05), we cannot accept the null hypothesis. Thus, we conclude that there is a relationship between method and materials type. This means for different types of materials, we should consider different method to achieve the more accurate results.

In [32]:
# Apply Fisher's Exact Test
fisher.test(data.method_type)
	Fisher's Exact Test for Count Data

data:  data.method_type
p-value = 3.005e-08
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
 2.460225 7.699465
sample estimates:
odds ratio 
  4.305714 

Again, we can draw the same conclusion using Fisher's exact test, where we cannot accept the null hypothesis since the $p$-value is too small.

In [33]:
# Clean up
rm(calc_method, data.method_type)

Error median test without removing outliers (Wilcoxon Test)

Without removing the outliers, the distributions of PBE and HSE errors are not normal. Therefore, we have to apply non-parametric test to compare median errors. The well-known test is paired samples Wilcoxon test (also known as Wilcoxon signed-rank test)

Note here we did not apply the well-known non-parametric test, Mann-Whitney U test (or Wilcoxon rank sum test), because our dataset is paired.

In [34]:
wilcox.test(data.d_pbe, data.d_hse, paired = TRUE, alternative = "two.sided")
	Wilcoxon signed rank test with continuity correction

data:  data.d_pbe and data.d_hse
V = 40143, p-value < 2.2e-16
alternative hypothesis: true location shift is not equal to 0

We can conclude that the median error of PBE is significantly different from the median error of HSE with a $p$-value $\approx 0.0000$.

Bootstrapping (non-parametric)

Without removing the outliers, we can also estimate the mean of relative errors. Here, we define relative errors to be difference between absolute PBE error and absolute HSE error. The dataset of relative errors might not follow normal distribution (we will perform a normality test for sure). But using this non-parametric bootstrapping method, we can still estimate the confidence interval of several statistics we interested in, like mean and standard devitation. Bootstrapping is a method to resample the data with replacement, and it is very useful once we are not sure about the population distribution.

In [35]:
# Get "d_error" vector where D_error = abs(pbe_error)-abs(hse_error)
calc_d_error <- function(pbe_error, hse_error){
    return(abs(pbe_error) - abs(hse_error))
}

d_error = mapply(calc_d_error, data.d_pbe, data.d_hse)
In [36]:
# View the relative errors
options(repr.plot.width = 5, repr.plot.height = 4)
hist(d_error, col="steelblue", main="Histogram of relative errors", xlab="abs(PBE error)-abs(HSE error)")
In [37]:
# Again apply normality test
shapiro.test(d_error)
	Shapiro-Wilk normality test

data:  d_error
W = 0.97044, p-value = 1.032e-05

As we expected, the dataset of error does not follow normal distribution.

In [38]:
# Apply bootstrapping for mean relative errors
library(boot)

df.d_error <- data.frame(de=d_error)

meanfunc <- function(data, i){
  d <- data[i, ]
  return(mean(d))   
}
results <- boot(df.d_error[, "de", drop=FALSE], statistic=meanfunc, R=5000)

options(repr.plot.width = 8, repr.plot.height = 4)
plot(results)
In [39]:
# Get the confidence interval of mean of relative errors
bc <- boot.ci(results, type="bca")
ci_left <- bc$bca[4]
ci_right <- bc$bca[5]
cat(sprintf("The 95%% confidence interval for mean of relative errors is [%f, %f].", ci_left, ci_right))

detach("package:boot", unload=TRUE)

# Clean up
rm(bc, ci_left, ci_right, results, meanfunc, df.d_error, d_error, calc_d_error)
The 95% confidence interval for mean of relative errors is [0.011694, 0.045427].

Therefore, we can again conclude that mean of relative errors (difference between PBE and HSE) is greater than zero, indicating HSE method provides less errors in general.

Skew Normal Distribution Fit

Without removing the outliers, the datasets do not follow normal distribution. We then consider fitting skew normal distributions. We will first apply D'Agostino Skew test to detect skewness of the distribution. Then we will fit the sample data with a skew normal distribution.

In [40]:
# Apply D’Agostino Test
suppressMessages(library(moments))

agostino.test(data.d_pbe, alternative = c("two.sided", "less", "greater"))
agostino.test(data.d_hse, alternative = c("two.sided", "less", "greater"))

detach("package:moments", unload=TRUE)
	D'Agostino skewness test

data:  data.d_pbe
skew = 1.6498, z = 8.5967, p-value < 2.2e-16
alternative hypothesis: data have a skewness
	D'Agostino skewness test

data:  data.d_hse
skew = 1.8099, z = 9.1006, p-value < 2.2e-16
alternative hypothesis: data have a skewness

Therefore, we can conclude both data have skewness.

In [41]:
suppressMessages(library(fGarch))

# Skew normal distribution fit
snd_pbe <- snormFit(data.d_pbe)
pbe_mean <- as.double(snd_pbe$par[1])
pbe_sd <- as.double(snd_pbe$par[2])
pbe_xi <- as.double(snd_pbe$par[3])

snd_hse <- snormFit(data.d_hse)
hse_mean <- as.double(snd_hse$par[1])
hse_sd <- as.double(snd_hse$par[2])
hse_xi <- as.double(snd_hse$par[3])

# Plot fitted distribution with histograms
par(mfrow=c(2,1))
options(repr.plot.width = 6, repr.plot.height = 8)
hist(data.d_pbe,n=50,probability = TRUE, border = "orange", col = "white", main="PBE", xlab="Errors")
text(1.0, y=2.0, labels=sprintf("location: %.3f", snd_pbe$par[1]))
text(1.0, y=1.6, labels=sprintf("scale: %.3f", snd_pbe$par[2]))
text(1.0, y=1.2, labels=sprintf("skewness: %.3f", snd_pbe$par[3]))
box()
x_pbe = seq(min(data.d_pbe), max(data.d_pbe), length = 292)
lines(x_pbe, dsnorm(x_pbe, pbe_mean, pbe_sd, pbe_xi), lwd = 2, col="orange")

hist(data.d_hse,n=50,probability = TRUE, border = "blue", col = "white", main="HSE", xlab="Errors")
text(1.0, y=2.3, labels=sprintf("location: %.3f", snd_hse$par[1]))
text(1.0, y=1.8, labels=sprintf("scale: %.3f", snd_hse$par[2]))
text(1.0, y=1.3, labels=sprintf("skewness: %.3f", snd_hse$par[3]))
box()
x_hse = seq(min(data.d_hse), max(data.d_hse), length = 292)
lines(x_hse, dsnorm(x_hse, hse_mean, hse_sd, hse_xi), lwd = 2, col="blue")
In [42]:
# Clean up 
detach("package:fGarch", unload=TRUE)
rm(pbe_mean, hse_mean, pbe_sd, hse_sd, pbe_xi, hse_xi, snd_pbe, snd_hse, x_pbe, x_hse)