Materials data are calculated by Density Functional Theory (DFT) , using two different potential approximations, PBE and HSE. The method of PBE is fast and efficient. HSE, on the other hand, supposed to be more accurate but also time-consuming. Therefore, one should wisely choose the method to perform DFT calculation in order to get more accurate results without wasting a lot of CPU hours. Here we are going to work with a dataset of both PBE and HSE calculated materials formation energies, a very important materials property indicating the thermodynamical stability of a compound. In order to draw the conclusions about whether HSE is more accurate, we will apply several statistical analysis in this case study.
This project will cover the following topics:
(This is a R notebook.)
# Uncommet this block if these packages are not installed
# install.packages("PerformanceAnalytics")
# install.packages("psych")
# install.packages("ggpubr")
# install.packages("moments")
# install.packages("fGarch")
# install.packages("boot")
# Get raw data from a CSV file
raw_d <- read.csv("formation_energy_data.csv")
head(raw_d)
## Get the summary information of each columns
raw_d.plt <- raw_d[, c('PBE', 'HSE', 'Exp')]
summary(raw_d.plt)
In this table, for each compound, we have materials formation energies calculated by PBE, HSE and measure by experients.
# Visualize the data generally
library(psych)
options(repr.plot.width = 5, repr.plot.height = 5)
pairs.panels(raw_d.plt, scale=TRUE)
detach("package:psych", unload=TRUE)
## Clean up
rm(raw_d.plt)
We first investigate the correlation between data calculated using PBE and HSE methods. Based on our physical intuition, we expect a strong linear correlation between these two sets of data. We utilized Analysis of Variance (ANOVA) to verify our claim.
# Data preparation
data.pbe <- raw_d[, c('PBE')]
data.hse <- raw_d[, c('HSE')]
cat(sprintf("The sample size is %d.", nrow(raw_d)))
# Apply Single Linear Regression to fit the data
data.lm = lm(data.pbe~data.hse)
Our proposed the model is:
data.pbe = $\beta_1\cdot$data.hse+$\beta_0$
where $\beta_1$ is the slope and $\beta_0$ is the intercept.
# View the fitting results
data.lm$coefficients
Therefore, our fitted model is:
data.pbe = 0.9103$\cdot$data.hse+0.0305
# Apply ANOVA for our linear regression results
anova(data.lm)
# Critical F value for this case
qf(0.05, df1 = 1, df2 = 290, lower.tail = FALSE)
We use ANOVA to test the slope of our model.
Hypothesis:
Sample data analysis:
Conclusion: Since the $p$-value is less than 0.05, we reject the null hypothesis and conclude that the slope cannot be zero. Alternatively, we can also draw the same conclusion by comparing $F$-statistic to the criticle $F$ value, i.e. $F = 15814.46 > F_{0.05,1,290}=3.87$.
# Complete summary of the linear regression model
summary(data.lm)
A few points:
# Visualize data
library(ggplot2)
options(repr.plot.width = 4.5, repr.plot.height =4.5)
ggplot(,aes(x = data.hse, y = data.pbe)) +
geom_point(colour="black", alpha=0.8, size=1) +
stat_smooth(method = "lm", fill = "lightblue", size=1, alpha=0.3)
detach("package:ggplot2", unload=TRUE)
# Clean up
rm(data.hse, data.pbe, data.lm)
Here we study the errors between calculated materials properties (PBE & HSE) and experimentally measured materials properties. We want to draw a conclusion which method is more accurate.
# Data preparation
data.name <- raw_d[, c('name')]
data.pbe <- raw_d[, c('PBE')]
data.hse <- raw_d[, c('HSE')]
data.exp <- raw_d[, c('Exp')]
data.type <- raw_d[, c('Type')]
data.d_pbe <- data.pbe-data.exp
data.d_hse <- data.hse-data.exp
# Clean up
rm(data.pbe, data.hse, data.exp)
# General view of data in histograms
par(mfrow=c(1,2))
options(repr.plot.width = 8, repr.plot.height = 4)
hist(data.d_pbe, col='orange', xlab="Errors", main="PBE")
hist(data.d_hse, col='blue', xlab="Errors", main="HSE")
# General view of data in scatterplot and boxplots
options(repr.plot.width = 7, repr.plot.height = 7)
par(fig=c(0,0.85,0,0.85))
plot(data.d_pbe, data.d_hse, xlab="PBE errors",
ylab="HSE errors", col="black", pch=21)
abline(v=0, col="orange", lty=3, lwd=2)
abline(h=0, col="blue", lty=3, lwd=2)
par(fig=c(0,0.85,0.55,1), new=TRUE)
boxplot(data.d_pbe, horizontal=TRUE, col="white", border="orange", pch=21, axes=FALSE)
par(fig=c(0.65,1,0,0.85),new=TRUE)
boxplot(data.d_hse, col="white", border="blue", pch=21, axes=FALSE)
# Of course, we can view boxplots side-by-side
options(repr.plot.width = 6, repr.plot.height = 4)
boxplot(data.d_pbe,data.d_hse,
horizontal=TRUE,
width=c(0.03, 0.03),
names=c("PBE","HSE"),
col=c("white","white"),
border=c("orange", "blue"),
xlab="Calculated errors", ylab="Method")
A few conclusions by viewing these plots:
Before we further analyze our data, we first need to know whether the errors follow normal distributions. Therefore, we will start with the normality tests.
# Compare distributions of PBE/HSE errors with normal distribution using q-q plot
suppressMessages(library(ggpubr))
options(repr.plot.width = 7, repr.plot.height = 4)
q1 <- ggqqplot(data.d_pbe, title="PBE errors q-q plot", col="orange")
q2 <- ggqqplot(data.d_hse, title="HSE errors q-q plot", col="blue")
ggarrange(q1, q2, nrow=1, ncol=2)
# Clean up
rm(q1, q2)
detach("package:ggpubr", unload=TRUE)
As we can see, both data samples have relatively good fit with normal distributions.
# Apply Shapiro-Wilk Normality Test
shapiro.test(data.d_pbe)
shapiro.test(data.d_hse)
Based on these tests, the $p$ values are quite small. Therefore, we should reject the null hpythosis can conclude that both datasets fail to follow the normal distribution. This result might due to the existence of outliers. We then will remove the outliers and redo the normality tests.
The commonly used outlier dection tests, i.e. Grubbs’s Test and Dixon Q Test, cannot be used here because both tests rely on the assumption of normal distribution. Therefore, we should detect the outliers using non-parametric method. Here, we will remove those outliers if the value is outside the range $1.5*IQR$ (interquartile range).
# Remove outliers
remove_outliers <- function(x, na.rm = TRUE, ...) {
qnt <- quantile(x, probs=c(.25, .75), na.rm = na.rm, ...)
H <- 1.5 * IQR(x, na.rm = na.rm)
y <- x
y[x < (qnt[1] - H)] <- NA
y[x > (qnt[2] + H)] <- NA
y
}
d_pbe_new <- remove_outliers(data.d_pbe)
d_hse_new <- remove_outliers(data.d_hse)
data.ro <- data.frame(name=data.name, pbe=d_pbe_new, hse=d_hse_new, type=data.type)
# Drop missing data
data.ro_clean <- data.ro[complete.cases(data.ro),]
nrow(data.ro_clean)
# Clean up
rm(data.ro, d_pbe_new, d_hse_new, remove_outliers)
After we drop the outliers, the sample size reduces to 276 (from 292).
# Re-apply normality test after removing outliers
shapiro.test(data.ro_clean$pbe)
shapiro.test(data.ro_clean$hse)
After we drop the outliers, we found both datasets follow the normal distributions.
Once we confirmed the normal distribution, we can estimate the paramaters of both distributions based on maximum likelihood estimation (MLE).
The MLE derivations for normal distribution are shown as follows:
# Calculate sample mean & standard deviation
mu_pbe <- mean(data.ro_clean$pbe)
sigma_pbe <- sd(data.ro_clean$pbe)
mu_hse <- mean(data.ro_clean$hse)
sigma_hse <- sd(data.ro_clean$hse)
cat(sprintf("mean(PBE)=%f, sd(PBE)=%f\n", mu_pbe, sigma_pbe))
cat(sprintf("mean(HSE)=%f, sd(HSE)=%f", mu_hse, sigma_hse))
These values can be treated as the parameters of the normal distribution. Therefore, we can visualize the data by overlayering the fitted normal distribution on top of the histograms.
# General view of data in histograms
par(mfrow=c(1,2))
options(repr.plot.width = 8, repr.plot.height = 4)
hist(data.ro_clean$pbe, density=20, breaks=20, prob=TRUE, col="orange",
xlab="Errors", ylim=c(0,3.5), main="PBE")
curve(dnorm(x, mean=mu_pbe, sd=sigma_pbe),
col="orange", lwd=2, add=TRUE, yaxt="n")
hist(data.ro_clean$hse, density=20, breaks=20, prob=TRUE, col="blue",
xlab="Errors",ylim=c(0,3.5), main="HSE")
curve(dnorm(x, mean=mu_hse, sd=sigma_hse),
col="blue", lwd=2, add=TRUE, yaxt="n")
# Clean up
rm(mu_pbe, mu_hse, sigma_pbe, sigma_hse)
For each materials, we can compare which method will provides results closer to the experimental measurements. Here we want to get the confidence intervals between mean error difference to conclude which method might provide smaller errors.
# Apply t-test
t.test(data.ro_clean$pbe, data.ro_clean$hse, paired = FALSE)
Hence, we reject the null hypothesis that true difference in means is zero since the $p$-value is almost zero. The 95 percent confidence interval of mean of the differences between PBE and HSE is [0.1032, 0.1541].
Actually, the sample data are paired, i.e. for each material, we have both PBE and HSE calculation errors. We have already concluded that both PBE and HSE errors follow normal distributions. However, $t$-test on paired sample based on the assumption that the difference between pairs follows normal distribution. Therefore, we have to apply normality test for differences first.
# Apply SW normality test for difference
shapiro.test(data.ro_clean$pbe-data.ro_clean$hse)
Since the difference does not follow normal distribution, we cannot apply paired sample $t$-test here. However, we can perform non-parametric pair sample test, e.g. Wicoxon Test, which will be done in later section.
To compare the variance between PBE and HSE errors, we should apply $F$-test to confirm whether the variance is equal.
# Apply F-test to test variance of the two data samples
var.test(data.ro_clean$pbe, data.ro_clean$hse)
The $p$-value of $F$-test is 0.7784 which is greater than the significance level 0.05. In conclusion, there is no significant difference between the two variances.
Based on the errors of PBE and HSE, we are able to know which method is more accurate for different materials. However, we want to know whether there is a relationship between method and materials types. In order to verify our hypothesis, we are going to apply $\chi^2$ test.
We will first figure out which method to use for different materials based on their calculated errors (which gives less error in magnitude).
# Create a new "method" column shows which method provides less error.
calc_method <- function(pbe_error, hse_error)
{
if (abs(pbe_error) < abs(hse_error)) {output <- "PBE"}
else {output <- "HSE" }
return(output)
}
data.ro_clean$method <- mapply(calc_method, data.ro_clean$pbe, data.ro_clean$hse)
# Create contingency table of method and type
library(MASS)
data.method_type = table(data.ro_clean$method, data.ro_clean$type)
detach("package:MASS", unload=TRUE)
data.method_type
Contigency table of calculation method and materials type.
# Apply chi square test
chisq.test(data.method_type)
Since the $p$-value is less than the significance level (0.05), we cannot accept the null hypothesis. Thus, we conclude that there is a relationship between method and materials type. This means for different types of materials, we should consider different method to achieve the more accurate results.
# Apply Fisher's Exact Test
fisher.test(data.method_type)
Again, we can draw the same conclusion using Fisher's exact test, where we cannot accept the null hypothesis since the $p$-value is too small.
# Clean up
rm(calc_method, data.method_type)
Without removing the outliers, the distributions of PBE and HSE errors are not normal. Therefore, we have to apply non-parametric test to compare median errors. The well-known test is paired samples Wilcoxon test (also known as Wilcoxon signed-rank test)
Note here we did not apply the well-known non-parametric test, Mann-Whitney U test (or Wilcoxon rank sum test), because our dataset is paired.
wilcox.test(data.d_pbe, data.d_hse, paired = TRUE, alternative = "two.sided")
We can conclude that the median error of PBE is significantly different from the median error of HSE with a $p$-value $\approx 0.0000$.
Without removing the outliers, we can also estimate the mean of relative errors. Here, we define relative errors to be difference between absolute PBE error and absolute HSE error. The dataset of relative errors might not follow normal distribution (we will perform a normality test for sure). But using this non-parametric bootstrapping method, we can still estimate the confidence interval of several statistics we interested in, like mean and standard devitation. Bootstrapping is a method to resample the data with replacement, and it is very useful once we are not sure about the population distribution.
# Get "d_error" vector where D_error = abs(pbe_error)-abs(hse_error)
calc_d_error <- function(pbe_error, hse_error){
return(abs(pbe_error) - abs(hse_error))
}
d_error = mapply(calc_d_error, data.d_pbe, data.d_hse)
# View the relative errors
options(repr.plot.width = 5, repr.plot.height = 4)
hist(d_error, col="steelblue", main="Histogram of relative errors", xlab="abs(PBE error)-abs(HSE error)")
# Again apply normality test
shapiro.test(d_error)
As we expected, the dataset of error does not follow normal distribution.
# Apply bootstrapping for mean relative errors
library(boot)
df.d_error <- data.frame(de=d_error)
meanfunc <- function(data, i){
d <- data[i, ]
return(mean(d))
}
results <- boot(df.d_error[, "de", drop=FALSE], statistic=meanfunc, R=5000)
options(repr.plot.width = 8, repr.plot.height = 4)
plot(results)
# Get the confidence interval of mean of relative errors
bc <- boot.ci(results, type="bca")
ci_left <- bc$bca[4]
ci_right <- bc$bca[5]
cat(sprintf("The 95%% confidence interval for mean of relative errors is [%f, %f].", ci_left, ci_right))
detach("package:boot", unload=TRUE)
# Clean up
rm(bc, ci_left, ci_right, results, meanfunc, df.d_error, d_error, calc_d_error)
Therefore, we can again conclude that mean of relative errors (difference between PBE and HSE) is greater than zero, indicating HSE method provides less errors in general.
Without removing the outliers, the datasets do not follow normal distribution. We then consider fitting skew normal distributions. We will first apply D'Agostino Skew test to detect skewness of the distribution. Then we will fit the sample data with a skew normal distribution.
# Apply D’Agostino Test
suppressMessages(library(moments))
agostino.test(data.d_pbe, alternative = c("two.sided", "less", "greater"))
agostino.test(data.d_hse, alternative = c("two.sided", "less", "greater"))
detach("package:moments", unload=TRUE)
Therefore, we can conclude both data have skewness.
suppressMessages(library(fGarch))
# Skew normal distribution fit
snd_pbe <- snormFit(data.d_pbe)
pbe_mean <- as.double(snd_pbe$par[1])
pbe_sd <- as.double(snd_pbe$par[2])
pbe_xi <- as.double(snd_pbe$par[3])
snd_hse <- snormFit(data.d_hse)
hse_mean <- as.double(snd_hse$par[1])
hse_sd <- as.double(snd_hse$par[2])
hse_xi <- as.double(snd_hse$par[3])
# Plot fitted distribution with histograms
par(mfrow=c(2,1))
options(repr.plot.width = 6, repr.plot.height = 8)
hist(data.d_pbe,n=50,probability = TRUE, border = "orange", col = "white", main="PBE", xlab="Errors")
text(1.0, y=2.0, labels=sprintf("location: %.3f", snd_pbe$par[1]))
text(1.0, y=1.6, labels=sprintf("scale: %.3f", snd_pbe$par[2]))
text(1.0, y=1.2, labels=sprintf("skewness: %.3f", snd_pbe$par[3]))
box()
x_pbe = seq(min(data.d_pbe), max(data.d_pbe), length = 292)
lines(x_pbe, dsnorm(x_pbe, pbe_mean, pbe_sd, pbe_xi), lwd = 2, col="orange")
hist(data.d_hse,n=50,probability = TRUE, border = "blue", col = "white", main="HSE", xlab="Errors")
text(1.0, y=2.3, labels=sprintf("location: %.3f", snd_hse$par[1]))
text(1.0, y=1.8, labels=sprintf("scale: %.3f", snd_hse$par[2]))
text(1.0, y=1.3, labels=sprintf("skewness: %.3f", snd_hse$par[3]))
box()
x_hse = seq(min(data.d_hse), max(data.d_hse), length = 292)
lines(x_hse, dsnorm(x_hse, hse_mean, hse_sd, hse_xi), lwd = 2, col="blue")
# Clean up
detach("package:fGarch", unload=TRUE)
rm(pbe_mean, hse_mean, pbe_sd, hse_sd, pbe_xi, hse_xi, snd_pbe, snd_hse, x_pbe, x_hse)