Working with dplyr and ggplot2

dplyr - Data Mainpulation Package

Intorduction

Most of the data scientists spend 70 to 80% of their time on data preparation for a given project also known as wrangling or cleaning or simply we can say data manipulations, so dplyr is one of the most popular package which can help R users to solve on preparing or manipulating the dataset before going for actual analysis or modeling. some of those operations such as selecting required columns, adding a new column, filtering required observations, or even some of the tasks like sorting or aggregating

dplyr has couple of functions like

select()
filter()
mutate()
arrange()
summarize()

and %>% operator

load required packages

library(dplyr)
library(ggplot2)

loading and examine the dataset

#for illustration purpose take the diamonds dataset from ggplot2 package and attached to this session
data(diamonds)
#examin first 6 observations
head(diamonds)
## # A tibble: 6 x 10
##   carat       cut color clarity depth table price     x     y     z
##   <dbl>     <ord> <ord>   <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1  0.23     Ideal     E     SI2  61.5    55   326  3.95  3.98  2.43
## 2  0.21   Premium     E     SI1  59.8    61   326  3.89  3.84  2.31
## 3  0.23      Good     E     VS1  56.9    65   327  4.05  4.07  2.31
## 4  0.29   Premium     I     VS2  62.4    58   334  4.20  4.23  2.63
## 5  0.31      Good     J     SI2  63.3    58   335  4.34  4.35  2.75
## 6  0.24 Very Good     J    VVS2  62.8    57   336  3.94  3.96  2.48
#take help from r documentation
#?diamonds
#examine the data
dim(diamonds)
## [1] 53940    10
str(diamonds)
## Classes 'tbl_df', 'tbl' and 'data.frame':    53940 obs. of  10 variables:
##  $ carat  : num  0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...
##  $ cut    : Ord.factor w/ 5 levels "Fair"<"Good"<..: 5 4 2 4 2 3 3 3 1 3 ...
##  $ color  : Ord.factor w/ 7 levels "D"<"E"<"F"<"G"<..: 2 2 2 6 7 7 6 5 2 5 ...
##  $ clarity: Ord.factor w/ 8 levels "I1"<"SI2"<"SI1"<..: 2 3 5 4 2 6 7 3 4 5 ...
##  $ depth  : num  61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ...
##  $ table  : num  55 61 65 58 58 57 57 55 61 61 ...
##  $ price  : int  326 326 327 334 335 336 336 337 337 338 ...
##  $ x      : num  3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ...
##  $ y      : num  3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ...
##  $ z      : num  2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ...
summary(diamonds)
##      carat               cut        color        clarity     
##  Min.   :0.2000   Fair     : 1610   D: 6775   SI1    :13065  
##  1st Qu.:0.4000   Good     : 4906   E: 9797   VS2    :12258  
##  Median :0.7000   Very Good:12082   F: 9542   SI2    : 9194  
##  Mean   :0.7979   Premium  :13791   G:11292   VS1    : 8171  
##  3rd Qu.:1.0400   Ideal    :21551   H: 8304   VVS2   : 5066  
##  Max.   :5.0100                     I: 5422   VVS1   : 3655  
##                                     J: 2808   (Other): 2531  
##      depth           table           price             x         
##  Min.   :43.00   Min.   :43.00   Min.   :  326   Min.   : 0.000  
##  1st Qu.:61.00   1st Qu.:56.00   1st Qu.:  950   1st Qu.: 4.710  
##  Median :61.80   Median :57.00   Median : 2401   Median : 5.700  
##  Mean   :61.75   Mean   :57.46   Mean   : 3933   Mean   : 5.731  
##  3rd Qu.:62.50   3rd Qu.:59.00   3rd Qu.: 5324   3rd Qu.: 6.540  
##  Max.   :79.00   Max.   :95.00   Max.   :18823   Max.   :10.740  
##                                                                  
##        y                z         
##  Min.   : 0.000   Min.   : 0.000  
##  1st Qu.: 4.720   1st Qu.: 2.910  
##  Median : 5.710   Median : 3.530  
##  Mean   : 5.735   Mean   : 3.539  
##  3rd Qu.: 6.540   3rd Qu.: 4.040  
##  Max.   :58.900   Max.   :31.800  
## 

select()

Select perticular interested columns from the dataset

# select only the variables carat, color and price
select(diamonds, carat, color, price)
## # A tibble: 53,940 x 3
##    carat color price
##    <dbl> <ord> <int>
##  1  0.23     E   326
##  2  0.21     E   326
##  3  0.23     E   327
##  4  0.29     I   334
##  5  0.31     J   335
##  6  0.24     J   336
##  7  0.24     I   336
##  8  0.26     H   337
##  9  0.22     E   337
## 10  0.23     H   338
## # ... with 53,930 more rows

filter()

filter acts like subsetting the data based on certain conditions

# examine the factors in cut variable
table(diamonds$cut)
## 
##      Fair      Good Very Good   Premium     Ideal 
##      1610      4906     12082     13791     21551
# subsetting or filtering the diamonds dataset where cut==”Premium”
filter(diamonds, cut=="Premium")
## # A tibble: 13,791 x 10
##    carat     cut color clarity depth table price     x     y     z
##    <dbl>   <ord> <ord>   <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
##  1  0.21 Premium     E     SI1  59.8    61   326  3.89  3.84  2.31
##  2  0.29 Premium     I     VS2  62.4    58   334  4.20  4.23  2.63
##  3  0.22 Premium     F     SI1  60.4    61   342  3.88  3.84  2.33
##  4  0.20 Premium     E     SI2  60.2    62   345  3.79  3.75  2.27
##  5  0.32 Premium     E      I1  60.9    58   345  4.38  4.42  2.68
##  6  0.24 Premium     I     VS1  62.5    57   355  3.97  3.94  2.47
##  7  0.29 Premium     F     SI1  62.4    58   403  4.24  4.26  2.65
##  8  0.22 Premium     E     VS2  61.6    58   404  3.93  3.89  2.41
##  9  0.22 Premium     D     VS2  59.3    62   404  3.91  3.88  2.31
## 10  0.30 Premium     J     SI2  59.3    61   405  4.43  4.38  2.61
## # ... with 13,781 more rows

mutate()

Mutate function is generally used to add variables to our dataset

diamondsNew<- mutate(diamonds, pricePerCarat = price/carat)
#examine the new dataset whether new variable is added or not
head(diamondsNew)
## # A tibble: 6 x 11
##   carat       cut color clarity depth table price     x     y     z
##   <dbl>     <ord> <ord>   <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1  0.23     Ideal     E     SI2  61.5    55   326  3.95  3.98  2.43
## 2  0.21   Premium     E     SI1  59.8    61   326  3.89  3.84  2.31
## 3  0.23      Good     E     VS1  56.9    65   327  4.05  4.07  2.31
## 4  0.29   Premium     I     VS2  62.4    58   334  4.20  4.23  2.63
## 5  0.31      Good     J     SI2  63.3    58   335  4.34  4.35  2.75
## 6  0.24 Very Good     J    VVS2  62.8    57   336  3.94  3.96  2.48
## # ... with 1 more variables: pricePerCarat <dbl>
names(diamondsNew)
##  [1] "carat"         "cut"           "color"         "clarity"      
##  [5] "depth"         "table"         "price"         "x"            
##  [9] "y"             "z"             "pricePerCarat"

arrange()

this function is used to sort or ordering the data

# first we will see the first 6 diamonds price in our dataset
head(diamonds$depth)
## [1] 61.5 59.8 56.9 62.4 63.3 62.8
# then we can use arrange function on top of this vector of first six observationds of depth variable
head(arrange(diamonds,depth))
## # A tibble: 6 x 10
##   carat   cut color clarity depth table price     x     y     z
##   <dbl> <ord> <ord>   <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1  1.00  Fair     G     SI1  43.0    59  3634  6.32  6.27  3.97
## 2  1.09 Ideal     J     VS2  43.0    54  4778  6.53  6.55  4.12
## 3  1.00  Fair     G     VS2  44.0    53  4032  6.31  6.24  4.12
## 4  1.43  Fair     I     VS1  50.8    60  6727  7.73  7.25  3.93
## 5  0.30  Fair     E    VVS2  51.0    67   945  4.67  4.62  2.37
## 6  0.70  Fair     D     SI1  52.2    65  1895  6.04  5.99  3.14
# the above output is basically shows in ascending order
# you can use desc() function inside arrange to make descending the data
head(arrange(diamonds,desc(depth)))
## # A tibble: 6 x 10
##   carat   cut color clarity depth table price     x     y     z
##   <dbl> <ord> <ord>   <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1  0.50  Fair     E     VS2  79.0    73  2579  5.21  5.18  4.09
## 2  0.50  Fair     E     VS2  79.0    73  2579  5.21  5.18  4.09
## 3  1.03  Fair     E      I1  78.2    54  1262  5.72  5.59  4.42
## 4  0.99  Fair     J      I1  73.6    60  1789  6.01  5.80  4.35
## 5  0.90  Fair     G     SI1  72.9    54  2691  5.74  5.67  4.16
## 6  0.96  Fair     G     SI2  72.2    56  2438  6.01  5.81  4.28

summarize()

This function is used to get the summary statistics of the data its very powerfull when we use this function with the combination of groupby

#to get the average of price variable
summarize(diamonds, avgPrice = mean(price, na.rm = TRUE) )
## # A tibble: 1 x 1
##   avgPrice
##      <dbl>
## 1   3932.8
#combination of summarize/summarise with group_by
summarise(group_by(diamonds, cut), mean=mean(price, na.rm = TRUE))
## # A tibble: 5 x 2
##         cut     mean
##       <ord>    <dbl>
## 1      Fair 4358.758
## 2      Good 3928.864
## 3 Very Good 3981.760
## 4   Premium 4584.258
## 5     Ideal 3457.542
summarize(group_by(diamonds, cut), mean=mean(price, na.rm = TRUE))
## # A tibble: 5 x 2
##         cut     mean
##       <ord>    <dbl>
## 1      Fair 4358.758
## 2      Good 3928.864
## 3 Very Good 3981.760
## 4   Premium 4584.258
## 5     Ideal 3457.542

%>% operator

The actual power of dplyr package lies in the usage of pipe operator (%>%), its very usefull when ever we required a chain of operations(series of activities) to work on one after another or using one command’s result as input for another command

#Now we will use those above functions filter select and mutate and combine them into one and get the result by using %>% operator  
# filter(diamonds, cut=="Premium")
#select(diamonds, carat, color, price)
# diamondsNew<- mutate(diamonds, pricePerCarat = price/carat)


diamondsPipe <- diamonds %>% filter(cut=="Premium") %>% select(carat, color, price) %>% mutate(pricePerCarat = price/carat)
head(diamondsPipe)
## # A tibble: 6 x 4
##   carat color price pricePerCarat
##   <dbl> <ord> <int>         <dbl>
## 1  0.21     E   326      1552.381
## 2  0.29     I   334      1151.724
## 3  0.22     F   342      1554.545
## 4  0.20     E   345      1725.000
## 5  0.32     E   345      1078.125
## 6  0.24     I   355      1479.167

ggplot2 - Data Visualization Package

R is one of the most powerfull language for visualizations with minimal lines of code ggplot2 is one of the package which can help the analysts to visualising the data by simple plotting to advanced visualisations

diamonds %>% 
  filter(cut == "Ideal") %>% 
  ggplot(aes(x=color,y=price)) + 
geom_boxplot()     

comments powered by Disqus