Importing Techniques Used in Spatial Data Analysis

In this post we are going to take a look at various data importing techniques used for spatial data analysis

Importing Data from Tables (read.table)

  • Opening data
  • Importing csv files
  • Checking the data structure for consistency

Accessing and importing open access environmental data is a crucial skill for data scientists. This section teaches you how to download data from the Web, import it in R and check it for consistency.

In this section, we are going to take a look at…

  • Download open-access data from the USGS website
  • Import it in R using read.table
  • Check its structure to start exploring the data
#Set the URL with the CSV Files
URL <- "http://earthquake.usgs.gov/earthquakes/feed/v1.0/summary/all_day.csv"
#Load the CSV File
Data <- read.table(file=URL, 
                   sep=",", 
                   header=TRUE, 
                   na.string="")
#Help function
help(read.table)
#Examining the data
str(Data)
## 'data.frame':    157 obs. of  22 variables:
##  $ time           : Factor w/ 157 levels "2017-12-23T13:08:19.896Z",..: 157 156 155 154 153 152 151 150 149 148 ...
##  $ latitude       : num  33.7 59.5 65.1 33.5 10.7 ...
##  $ longitude      : num  -116.7 -151.4 -148.7 -116.7 -86.2 ...
##  $ depth          : num  15.59 34.6 12.8 2.66 48.79 ...
##  $ mag            : num  0.57 2.6 1.7 0.14 4.4 0.38 0.7 2 1.1 1.03 ...
##  $ magType        : Factor w/ 5 levels "mb","md","ml",..: 3 3 3 3 1 3 3 3 3 3 ...
##  $ nst            : int  18 NA NA 10 NA 19 9 NA 45 36 ...
##  $ gap            : num  92 NA NA 98 168 ...
##  $ dmin           : num  0.0545 NA NA 0.0392 0.581 ...
##  $ rms            : num  0.11 0.6 0.53 0.07 0.97 0.12 NA 0.09 0.17 0.14 ...
##  $ net            : Factor w/ 9 levels "ak","ci","hv",..: 2 1 1 2 8 2 6 1 2 2 ...
##  $ id             : Factor w/ 157 levels "ak17647953","ak17647954",..: 98 40 39 97 156 96 136 38 95 94 ...
##  $ updated        : Factor w/ 157 levels "2017-12-23T13:11:44.641Z",..: 157 156 155 154 152 150 149 151 147 145 ...
##  $ place          : Factor w/ 131 levels "0km ESE of Pahala, Hawaii",..: 126 37 69 87 91 124 128 55 118 17 ...
##  $ type           : Factor w/ 2 levels "earthquake","quarry blast": 1 1 1 1 1 1 1 1 1 1 ...
##  $ horizontalError: num  0.3 NA NA 0.21 4.1 0.22 NA NA 0.19 0.25 ...
##  $ depthError     : num  0.44 0.2 0.2 0.19 11.3 0.43 NA 2.6 0.42 0.96 ...
##  $ magError       : num  0.132 NA NA 0.074 0.082 0.098 NA NA 0.151 0.204 ...
##  $ magNst         : int  18 NA NA 6 43 10 NA NA 26 22 ...
##  $ status         : Factor w/ 2 levels "automatic","reviewed": 1 1 1 1 2 1 1 1 1 1 ...
##  $ locationSource : Factor w/ 9 levels "ak","ci","hv",..: 2 1 1 2 8 2 6 1 2 2 ...
##  $ magSource      : Factor w/ 9 levels "ak","ci","hv",..: 2 1 1 2 8 2 6 1 2 2 ...

Downloading Open Data from FTP Sites

Often times, datasets are provided for free, but on FTP, websites and practitioners need to be able to access them. R is perfectly capable of downloading and importing data from FTP sites.

In this section, we are going to take a look at…

  • Understand the basics of downloading data in R
  • Download the data with the download.file function
  • Learn how to handle compressed formats
#Load required packages
library(RCurl)
## Loading required package: bitops
library(XML)
#Create a list with all the files on the FTP site
list <- getURL("ftp://ftp.ncdc.noaa.gov/pub/data/gsod/2016/", 
               dirlistonly = TRUE) 
#Clean the list 
FileList <- strsplit(list, split="\r\n")
#Create a new directory where to download these files
DIR <- paste(getwd(),"/NOAAFiles",sep="")
dir.create(DIR)
## Warning in dir.create(DIR): 'E:\Projects\sumendar.github.io\content\post
## \NOAAFiles' already exists
#Loop to download the files
for(FileName in unlist(FileList)){
  URL <- paste0("ftp://ftp.ncdc.noaa.gov/pub/data/gsod/2016/",FileName)
  download.file(URL, destfile=paste0(DIR,"/",FileName), method="auto", 
                mode="wb")
}
#A more elegant way
DownloadFile <- function(x){
  URL <- paste0("ftp://ftp.ncdc.noaa.gov/pub/data/gsod/2016/",x)
  download.file(URL, destfile=paste0(DIR,"/",x), method="auto", mode="wb")
}
lapply(unlist(FileList)[1:5], DownloadFile)
#Dowload a compressed file
URL <- "ftp://ftp.ncdc.noaa.gov/pub/data/gsod/2015/gsod_2015.tar"
download.file(URL, destfile=paste0(DIR,"/gsod_2015.tar"),
              method="auto",mode="wb")

untar(paste0(getwd(),"/NOAAFiles/","gsod_2015.tar"), 
      exdir=paste0(getwd(),"/NOAAFiles"))
help(unzip)
#For more information on the full experiment please visit:
#http://r-video-tutorial.blogspot.ch/2014/12/accessing-cleaning-and-plotting-noaa.html

Importing with read.lines (The Last Resort)

Some data cannot be open with neither read.table nor read.fwf
In this desperate cases in readLines can help
In this section, we are going to take a look at… *

#Download the data from the FTP site
URL <- "ftp://ftp.ncdc.noaa.gov/pub/data/noaa/2015/010231-99999-2015.gz"
FileName <- "010231-99999-2015.gz"
download.file(URL, destfile=paste0(getwd(),"/",FileName), method="auto", mode="wb")
data.strings <- readLines(gzfile(FileName, open="rt"))
## Warning in readLines(gzfile(FileName, open = "rt")): seek on a gzfile
## connection returned an internal error
head(data.strings)
## [1] "0071010231999992015010100204+64350+007800FM-15+000099999V0202201N021119999999N999999999+00801+00701999999ADDMA1100401999999REMMET044METAR ENDR 010020Z AUTO 22041KT 08/07 Q1004="
## [2] "0071010231999992015010100504+64350+007800FM-15+000099999V0202201N020619999999N999999999+00801+00701999999ADDMA1100301999999REMMET044METAR ENDR 010050Z AUTO 22040KT 08/07 Q1003="
## [3] "0071010231999992015010101204+64350+007800FM-15+000099999V0202201N020619999999N999999999+00801+00701999999ADDMA1100301999999REMMET044METAR ENDR 010120Z AUTO 22040KT 08/07 Q1003="
## [4] "0071010231999992015010101504+64350+007800FM-15+000099999V0202201N019019999999N999999999+00801+00701999999ADDMA1100301999999REMMET044METAR ENDR 010150Z AUTO 22037KT 08/07 Q1003="
## [5] "0071010231999992015010102204+64350+007800FM-15+000099999V0202201N017519999999N999999999+00801+00801999999ADDMA1100301999999REMMET044METAR ENDR 010220Z AUTO 22034KT 08/08 Q1003="
## [6] "0071010231999992015010102504+64350+007800FM-15+000099999V0202201N017519999999N999999999+00801+00701999999ADDMA1100201999999REMMET044METAR ENDR 010250Z AUTO 22034KT 08/07 Q1002="

Functions for test

Ext.Latitude <- function(x){
  substr(x, start=29, stop=34)
}

Functions for test 2

Ext.Longitude <- function(x){
  substr(x, start=35, stop=41)
}

Functions for test 2

Ext.Temp <- function(x){
  substr(x, start=88, stop=92)
}

lapply function usage

LAT <- lapply(data.strings, Ext.Latitude)
LON <- lapply(data.strings, Ext.Longitude)
TEMP <- lapply(data.strings, Ext.Temp)

Create a data.frame we can use for data analysis

DATA <- data.frame(Latitude=as.numeric(unlist(LAT))/1000,
                   Longitude=as.numeric(unlist(LON))/1000,
                   Temperature=as.numeric(unlist(TEMP))/10)

Final note code

DATA[DATA$Temperature==999.9,"Temperature"] <- NA

str(DATA)
## 'data.frame':    17291 obs. of  3 variables:
##  $ Latitude   : num  64.3 64.3 64.3 64.3 64.3 ...
##  $ Longitude  : num  7.8 7.8 7.8 7.8 7.8 7.8 7.8 7.8 7.8 7.8 ...
##  $ Temperature: num  8 8 8 8 8 8 8 8 8 8 ...
hist(DATA$Temperature, main="Temperature")

comments powered by Disqus