Practise: ggplot2, dplyr, t test, Linear Regression

In this notebook we will investigate through a dataset named gapminder

You need to install gapminder, ggplot2, dplyr

First include the library gapminder and let’s see the summary

we see that it has columns like country, continent, year, life expentency, Total population and GDP/capita

library(gapminder)

## Warning: package 'gapminder' was built under R version 4.0.3

data("gapminder")
summary(gapminder)

##         country        continent        year         lifeExp     
##  Afghanistan:  12   Africa  :624   Min.   :1952   Min.   :23.60  
##  Albania    :  12   Americas:300   1st Qu.:1966   1st Qu.:48.20  
##  Algeria    :  12   Asia    :396   Median :1980   Median :60.71  
##  Angola     :  12   Europe  :360   Mean   :1980   Mean   :59.47  
##  Argentina  :  12   Oceania : 24   3rd Qu.:1993   3rd Qu.:70.85  
##  Australia  :  12                  Max.   :2007   Max.   :82.60  
##  (Other)    :1632                                                
##       pop              gdpPercap       
##  Min.   :6.001e+04   Min.   :   241.2  
##  1st Qu.:2.794e+06   1st Qu.:  1202.1  
##  Median :7.024e+06   Median :  3531.8  
##  Mean   :2.960e+07   Mean   :  7215.3  
##  3rd Qu.:1.959e+07   3rd Qu.:  9325.5  
##  Max.   :1.319e+09   Max.   :113523.1  
##

After calling this attach fucntion, we dont need to mention dataset again and again, suppose, instead of gapminder$column we can use only column

We will look the histogram of population, its looked messy

we go for a log transformation. whats a log transformation, when to appy, we might discuss later

we also look at the density of the graph it looks like a normal distribution

attach(gapminder)
hist(pop)

hist(log(pop))

d <- density(log(pop))
plot(d)

Let’s see boxplot for different continent life expectancy, WE see africa has the lowest life expectancy and Ocenia and Europe has the heighest

But in europe we have lots of outliers/exceptions

boxplot(lifeExp~ continent)

###### Lets see gdp vs life expectency scatter plot, Its looks not perfect ###### Let go for log transformation, Its looks better?

And we also see a positive relation between gdp and life expectancy, More money, more lifetime?

plot(lifeExp ~ gdpPercap)

plot(lifeExp ~ log(gdpPercap))

Let’s see this numerically, correlation between lifeExp ~ log(gdpPercap))

cor.test(lifeExp, log(gdpPercap))

## 
##  Pearson's product-moment correlation
## 
## data:  lifeExp and log(gdpPercap)
## t = 56.5, df = 1702, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.7904458 0.8235215
## sample estimates:
##       cor 
## 0.8076179

We see that correlation is .80, thats a strong correlation.

We will go for T test, to see if that things significant, And after that a linear regression

BUT, before that lets do some dplyr :)

dplyr used pipe %>% (shortcut, ctrl+shift+m) function, pipe will attach previous condition/functionality to the next line

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

gapminder %>%
  select(country, lifeExp)%>%
  # Select used for selection of only this columns e:g counry, lifeEXp
  filter(country == "Ireland" | country == "Bangladesh") %>% 
  # Then we filter out our country of interst
  group_by(country) %>% 
  # We group by them by country so that we can see the result countrywise
  summarise(avg_life = (mean(lifeExp)))

## `summarise()` ungrouping output (override with `.groups` argument)

  #summarise function to seee the summary output

Now we will save this things to new dataframe for later investigation/visualization

#library(dplyr)
df1<- gapminder %>%
  select(country, lifeExp)%>%
  # Select used for selection of only this columns e:g counry, lifeEXp
  filter(country == "Ireland" | country == "Bangladesh") %>% 
  # Then we filter out our country of interst
  group_by(country)

Lets go for a t test to see if the difference of the mean significant

Our null hypothesis is there is no change, Alternative is there is a change for mean value of these two country

t.test(data = df1, lifeExp~country)

## 
##  Welch Two Sample t-test
## 
## data:  lifeExp by country
## t = -8.2567, df = 14.445, p-value = 7.6e-07
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -29.18792 -17.17842
## sample estimates:
## mean in group Bangladesh    mean in group Ireland 
##                 49.83408                 73.01725

Look at the p value, its too low

If t is low(less than alpha which is .05), the null must go, There is no place for null hyp. here, so Alternative wins

There is a difference between two countrys mean

Look at the output again, we are 95 percent confident that the change is between 29.18 to 17.17

Let’s have some plotting

library(ggplot2)

Lots of things here,

first ggplot, here data = gapminder, aesthetics, for x axis it is gdp/capita

for y axis its lifeExpectancy

Here col = year, that is design our plot with different color for different year

shape = continent, that is different shape for different continent

size = population, that larger bubbule/point for larger population

alpha = 0.7, that is transparancy, chage it between 0 to 1 to see the difference

geom_smooth, for a smooth line following the slope for different continent, method = lm, to make

the line a straight line, remove method = lm, now see the line is not straight anymore

Remove facet part, now see all the graph joined together in a single graph

ggplot(gapminder, aes(log(gdpPercap),lifeExp, col = year, shape=continent,  size = pop))+
  geom_point(alpha= 0.7)+geom_smooth(method = lm)+facet_wrap(~continent)

## `geom_smooth()` using formula 'y ~ x'

Most wanted Linear regression

We waited for it, as strong positive correlation, and change are significant

lm(lifeExp~gdpPercap)

## 
## Call:
## lm(formula = lifeExp ~ gdpPercap)
## 
## Coefficients:
## (Intercept)    gdpPercap  
##   5.396e+01    7.649e-04

Linear regression, Y = mx +b (here, y = estimeted_output, m = slope, x = our input, b = coefficient/constant )

here m = 7.649e-04, and b = 5.396e+01

summary(lm(lifeExp~gdpPercap))

## 
## Call:
## lm(formula = lifeExp ~ gdpPercap)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -82.754  -7.758   2.176   8.225  18.426 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 5.396e+01  3.150e-01  171.29   <2e-16 ***
## gdpPercap   7.649e-04  2.579e-05   29.66   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.49 on 1702 degrees of freedom
## Multiple R-squared:  0.3407, Adjusted R-squared:  0.3403 
## F-statistic: 879.6 on 1 and 1702 DF,  p-value: < 2.2e-16

here Multiple R-squared is 0.34, that is not a good model mayb

It shows, our model can explain 34 percent of data only.

Lets go for a log trasformation

summary(lm(lifeExp~log(gdpPercap)))

## 
## Call:
## lm(formula = lifeExp ~ log(gdpPercap))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -32.778  -4.204   1.212   4.658  19.285 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     -9.1009     1.2277  -7.413 1.93e-13 ***
## log(gdpPercap)   8.4051     0.1488  56.500  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.62 on 1702 degrees of freedom
## Multiple R-squared:  0.6522, Adjusted R-squared:  0.652 
## F-statistic:  3192 on 1 and 1702 DF,  p-value: < 2.2e-16

Looks better, right? R squared increased to 65 percent :)

You can download the Rmd file from here

Abrar Fahim

Practise: ggplot2, dplyr, t test, Linear Regression

Practise: ggplot2, dplyr, t test, Linear Regression

In this notebook we will investigate through a dataset named gapminder

You need to install gapminder, ggplot2, dplyr

First include the library gapminder and let’s see the summary

we see that it has columns like country, continent, year, life expentency, Total population and GDP/capita

After calling this attach fucntion, we dont need to mention dataset again and again, suppose, instead of gapminder$column we can use only column

We will look the histogram of population, its looked messy

we go for a log transformation. whats a log transformation, when to appy, we might discuss later

we also look at the density of the graph it looks like a normal distribution

Let’s see boxplot for different continent life expectancy, WE see africa has the lowest life expectancy and Ocenia and Europe has the heighest

But in europe we have lots of outliers/exceptions

And we also see a positive relation between gdp and life expectancy, More money, more lifetime?

Let’s see this numerically, correlation between lifeExp ~ log(gdpPercap))

We see that correlation is .80, thats a strong correlation.

We will go for T test, to see if that things significant, And after that a linear regression

BUT, before that lets do some dplyr :)

dplyr used pipe %>% (shortcut, ctrl+shift+m) function, pipe will attach previous condition/functionality to the next line

Now we will save this things to new dataframe for later investigation/visualization

Lets go for a t test to see if the difference of the mean significant

Our null hypothesis is there is no change, Alternative is there is a change for mean value of these two country

Look at the p value, its too low

If t is low(less than alpha which is .05), the null must go, There is no place for null hyp. here, so Alternative wins

There is a difference between two countrys mean

Look at the output again, we are 95 percent confident that the change is between 29.18 to 17.17

Let’s have some plotting

Lots of things here,

first ggplot, here data = gapminder, aesthetics, for x axis it is gdp/capita

for y axis its lifeExpectancy

Here col = year, that is design our plot with different color for different year

shape = continent, that is different shape for different continent

size = population, that larger bubbule/point for larger population

alpha = 0.7, that is transparancy, chage it between 0 to 1 to see the difference

geom_smooth, for a smooth line following the slope for different continent, method = lm, to make

the line a straight line, remove method = lm, now see the line is not straight anymore

and facet is used for separating a single graph to different graph

that is, facet_wrap, continent, make 5 graph for different continet

Remove facet part, now see all the graph joined together in a single graph

Most wanted Linear regression

We waited for it, as strong positive correlation, and change are significant

Linear regression, Y = mx +b (here, y = estimeted_output, m = slope, x = our input, b = coefficient/constant )

here m = 7.649e-04, and b = 5.396e+01

here Multiple R-squared is 0.34, that is not a good model mayb

It shows, our model can explain 34 percent of data only.

Lets go for a log trasformation

Looks better, right? R squared increased to 65 percent :)

You may like these posts

Post a Comment

0 Comments

Find Me

Footer Menu Widget