Practise: ggplot2, dplyr, t test, Linear Regression
In this notebook we will investigate through a dataset named gapminder
You need to install gapminder, ggplot2, dplyr
First include the library gapminder and let’s see the summary
we see that it has columns like country, continent, year, life expentency, Total population and GDP/capita
library(gapminder)
## Warning: package 'gapminder' was built under R version 4.0.3
data("gapminder")
summary(gapminder)
##         country        continent        year         lifeExp     
##  Afghanistan:  12   Africa  :624   Min.   :1952   Min.   :23.60  
##  Albania    :  12   Americas:300   1st Qu.:1966   1st Qu.:48.20  
##  Algeria    :  12   Asia    :396   Median :1980   Median :60.71  
##  Angola     :  12   Europe  :360   Mean   :1980   Mean   :59.47  
##  Argentina  :  12   Oceania : 24   3rd Qu.:1993   3rd Qu.:70.85  
##  Australia  :  12                  Max.   :2007   Max.   :82.60  
##  (Other)    :1632                                                
##       pop              gdpPercap       
##  Min.   :6.001e+04   Min.   :   241.2  
##  1st Qu.:2.794e+06   1st Qu.:  1202.1  
##  Median :7.024e+06   Median :  3531.8  
##  Mean   :2.960e+07   Mean   :  7215.3  
##  3rd Qu.:1.959e+07   3rd Qu.:  9325.5  
##  Max.   :1.319e+09   Max.   :113523.1  
## 
After calling this attach fucntion, we dont need to mention dataset again and again, suppose, instead of gapminder$column we can use only column
We will look the histogram of population, its looked messy
we go for a log transformation. whats a log transformation, when to appy, we might discuss later
we also look at the density of the graph it looks like a normal distribution
attach(gapminder)
hist(pop)

hist(log(pop))

d <- density(log(pop))
plot(d)

Let’s see boxplot for different continent life expectancy, WE see africa has the lowest life expectancy and Ocenia and Europe has the heighest
But in europe we have lots of outliers/exceptions
boxplot(lifeExp~ continent)

###### Lets see gdp vs life expectency scatter plot, Its looks not perfect ###### Let go for log transformation, Its looks better?

And we also see a positive relation between gdp and life expectancy, More money, more lifetime?
plot(lifeExp ~ gdpPercap)

plot(lifeExp ~ log(gdpPercap))

Let’s see this numerically, correlation between lifeExp ~ log(gdpPercap))
cor.test(lifeExp, log(gdpPercap))
## 
##  Pearson's product-moment correlation
## 
## data:  lifeExp and log(gdpPercap)
## t = 56.5, df = 1702, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.7904458 0.8235215
## sample estimates:
##       cor 
## 0.8076179
We see that correlation is .80, thats a strong correlation.
We will go for T test, to see if that things significant, And after that a linear regression
BUT, before that lets do some dplyr :)
dplyr used pipe %>% (shortcut, ctrl+shift+m) function, pipe will attach previous condition/functionality to the next line
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
gapminder %>%
  select(country, lifeExp)%>%
  # Select used for selection of only this columns e:g counry, lifeEXp
  filter(country == "Ireland" | country == "Bangladesh") %>% 
  # Then we filter out our country of interst
  group_by(country) %>% 
  # We group by them by country so that we can see the result countrywise
  summarise(avg_life = (mean(lifeExp)))
## `summarise()` ungrouping output (override with `.groups` argument)
  #summarise function to seee the summary output
Now we will save this things to new dataframe for later investigation/visualization
#library(dplyr)
df1<- gapminder %>%
  select(country, lifeExp)%>%
  # Select used for selection of only this columns e:g counry, lifeEXp
  filter(country == "Ireland" | country == "Bangladesh") %>% 
  # Then we filter out our country of interst
  group_by(country)
Lets go for a t test to see if the difference of the mean significant
Our null hypothesis is there is no change, Alternative is there is a change for mean value of these two country
t.test(data = df1, lifeExp~country)
## 
##  Welch Two Sample t-test
## 
## data:  lifeExp by country
## t = -8.2567, df = 14.445, p-value = 7.6e-07
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -29.18792 -17.17842
## sample estimates:
## mean in group Bangladesh    mean in group Ireland 
##                 49.83408                 73.01725
Look at the p value, its too low
If t is low(less than alpha which is .05), the null must go, There is no place for null hyp. here, so Alternative wins
There is a difference between two countrys mean
Look at the output again, we are 95 percent confident that the change is between 29.18 to 17.17

Let’s have some plotting

library(ggplot2)
Lots of things here,
first ggplot, here data = gapminder, aesthetics, for x axis it is gdp/capita
for y axis its lifeExpectancy
Here col = year, that is design our plot with different color for different year
shape = continent, that is different shape for different continent
size = population, that larger bubbule/point for larger population
alpha = 0.7, that is transparancy, chage it between 0 to 1 to see the difference
geom_smooth, for a smooth line following the slope for different continent, method = lm, to make
the line a straight line, remove method = lm, now see the line is not straight anymore
and facet is used for separating a single graph to different graph
that is, facet_wrap, continent, make 5 graph for different continet
Remove facet part, now see all the graph joined together in a single graph
ggplot(gapminder, aes(log(gdpPercap),lifeExp, col = year, shape=continent,  size = pop))+
  geom_point(alpha= 0.7)+geom_smooth(method = lm)+facet_wrap(~continent)
## `geom_smooth()` using formula 'y ~ x'

Most wanted Linear regression
We waited for it, as strong positive correlation, and change are significant
lm(lifeExp~gdpPercap)
## 
## Call:
## lm(formula = lifeExp ~ gdpPercap)
## 
## Coefficients:
## (Intercept)    gdpPercap  
##   5.396e+01    7.649e-04
Linear regression, Y = mx +b (here, y = estimeted_output, m = slope, x = our input, b = coefficient/constant )
here m = 7.649e-04, and b = 5.396e+01
summary(lm(lifeExp~gdpPercap))
## 
## Call:
## lm(formula = lifeExp ~ gdpPercap)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -82.754  -7.758   2.176   8.225  18.426 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 5.396e+01  3.150e-01  171.29   <2e-16 ***
## gdpPercap   7.649e-04  2.579e-05   29.66   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.49 on 1702 degrees of freedom
## Multiple R-squared:  0.3407, Adjusted R-squared:  0.3403 
## F-statistic: 879.6 on 1 and 1702 DF,  p-value: < 2.2e-16
here Multiple R-squared is 0.34, that is not a good model mayb
It shows, our model can explain 34 percent of data only.
Lets go for a log trasformation
summary(lm(lifeExp~log(gdpPercap)))
## 
## Call:
## lm(formula = lifeExp ~ log(gdpPercap))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -32.778  -4.204   1.212   4.658  19.285 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     -9.1009     1.2277  -7.413 1.93e-13 ***
## log(gdpPercap)   8.4051     0.1488  56.500  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.62 on 1702 degrees of freedom
## Multiple R-squared:  0.6522, Adjusted R-squared:  0.652 
## F-statistic:  3192 on 1 and 1702 DF,  p-value: < 2.2e-16
Looks better, right? R squared increased to 65 percent :)

You can download the Rmd file from here