Importing Techniques Used in Spatial Data Analysis

In this post we are going to take a look at various data importing techniques used for spatial data analysis

Importing Data from Tables (read.table)

  • Opening data
  • Importing csv files
  • Checking the data structure for consistency

Accessing and importing open access environmental data is a crucial skill for data scientists. This section teaches you how to download data from the Web, import it in R and check it for consistency.

In this section, we are going to take a look at…

  • Download open-access data from the USGS website
  • Import it in R using read.table
  • Check its structure to start exploring the data
#Set the URL with the CSV Files
URL <- "http://earthquake.usgs.gov/earthquakes/feed/v1.0/summary/all_day.csv"
#Load the CSV File
Data <- read.table(file=URL, 
                   sep=",", 
                   header=TRUE, 
                   na.string="")
#Help function
help(read.table)
#Examining the data
str(Data)
## 'data.frame':    153 obs. of  22 variables:
##  $ time           : Factor w/ 153 levels "2018-01-21T19:05:13.020Z",..: 153 152 151 150 149 148 147 146 145 144 ...
##  $ latitude       : num  38.8 63.5 33.9 61.4 33.5 ...
##  $ longitude      : num  -123 -149 -117 -150 -117 ...
##  $ depth          : num  3.38 7.4 0.55 53.3 4.83 5.33 0.8 5.1 2.48 11.8 ...
##  $ mag            : num  0.36 1.4 1.8 1 0.91 0.53 1 2.1 1.19 1.7 ...
##  $ magType        : Factor w/ 4 levels "mb","mb_lg","md",..: 3 4 4 4 4 4 4 4 3 4 ...
##  $ nst            : int  8 NA 35 NA 35 20 NA NA 12 NA ...
##  $ gap            : num  108 NA 74 NA 36 78 NA NA 223 NA ...
##  $ dmin           : num  0.00611 NA 0.05287 NA 0.07425 ...
##  $ rms            : num  0.03 0.42 0.2 0.37 0.15 0.13 0.49 0.8 0.03 0.67 ...
##  $ net            : Factor w/ 12 levels "ak","ci","hv",..: 5 1 2 1 2 2 1 1 5 1 ...
##  $ id             : Factor w/ 153 levels "ak18153221","ak18153236",..: 107 31 71 29 70 69 28 27 106 26 ...
##  $ updated        : Factor w/ 153 levels "2018-01-21T19:24:44.040Z",..: 153 147 138 149 130 131 126 122 151 120 ...
##  $ place          : Factor w/ 133 levels "103km WSW of Healy, Alaska",..: 7 23 89 131 6 128 1 98 118 71 ...
##  $ type           : Factor w/ 3 levels "earthquake","explosion",..: 1 1 3 1 1 1 1 1 1 1 ...
##  $ horizontalError: num  0.97 NA 0.28 NA 0.17 0.24 NA NA 0.74 NA ...
##  $ depthError     : num  1.94 0.4 31.61 0.5 1.05 ...
##  $ magError       : num  NA NA 0.137 NA 0.099 0.068 NA NA 0.16 NA ...
##  $ magNst         : int  1 NA 28 NA 25 11 NA NA 11 NA ...
##  $ status         : Factor w/ 2 levels "automatic","reviewed": 1 1 1 2 2 2 1 1 1 1 ...
##  $ locationSource : Factor w/ 13 levels "ak","ci","guc",..: 6 1 2 1 2 2 1 1 6 1 ...
##  $ magSource      : Factor w/ 12 levels "ak","ci","hv",..: 5 1 2 1 2 2 1 1 5 1 ...

Downloading Open Data from FTP Sites

Often times, datasets are provided for free, but on FTP, websites and practitioners need to be able to access them. R is perfectly capable of downloading and importing data from FTP sites.

In this section, we are going to take a look at…

  • Understand the basics of downloading data in R
  • Download the data with the download.file function
  • Learn how to handle compressed formats
#Load required packages
library(RCurl)
## Loading required package: bitops
library(XML)
#Create a list with all the files on the FTP site
list <- getURL("ftp://ftp.ncdc.noaa.gov/pub/data/gsod/2016/", 
               dirlistonly = TRUE) 
#Clean the list 
FileList <- strsplit(list, split="\r\n")
#Create a new directory where to download these files
DIR <- paste(getwd(),"/NOAAFiles",sep="")
dir.create(DIR)
## Warning in dir.create(DIR): 'E:\Projects\sumendar.github.io\content\post
## \NOAAFiles' already exists
#Loop to download the files
for(FileName in unlist(FileList)){
  URL <- paste0("ftp://ftp.ncdc.noaa.gov/pub/data/gsod/2016/",FileName)
  download.file(URL, destfile=paste0(DIR,"/",FileName), method="auto", 
                mode="wb")
}
#A more elegant way
DownloadFile <- function(x){
  URL <- paste0("ftp://ftp.ncdc.noaa.gov/pub/data/gsod/2016/",x)
  download.file(URL, destfile=paste0(DIR,"/",x), method="auto", mode="wb")
}
lapply(unlist(FileList)[1:5], DownloadFile)
#Dowload a compressed file
URL <- "ftp://ftp.ncdc.noaa.gov/pub/data/gsod/2015/gsod_2015.tar"
download.file(URL, destfile=paste0(DIR,"/gsod_2015.tar"),
              method="auto",mode="wb")

untar(paste0(getwd(),"/NOAAFiles/","gsod_2015.tar"), 
      exdir=paste0(getwd(),"/NOAAFiles"))
help(unzip)
#For more information on the full experiment please visit:
#http://r-video-tutorial.blogspot.ch/2014/12/accessing-cleaning-and-plotting-noaa.html

Importing with read.lines (The Last Resort)

Some data cannot be open with neither read.table nor read.fwf
In this desperate cases in readLines can help
In this section, we are going to take a look at… *

#Download the data from the FTP site
URL <- "ftp://ftp.ncdc.noaa.gov/pub/data/noaa/2015/010231-99999-2015.gz"
FileName <- "010231-99999-2015.gz"
download.file(URL, destfile=paste0(getwd(),"/",FileName), method="auto", mode="wb")
data.strings <- readLines(gzfile(FileName, open="rt"))
## Warning in readLines(gzfile(FileName, open = "rt")): seek on a gzfile
## connection returned an internal error
head(data.strings)
## [1] "0071010231999992015010100204+64350+007800FM-15+000099999V0202201N021119999999N999999999+00801+00701999999ADDMA1100401999999REMMET044METAR ENDR 010020Z AUTO 22041KT 08/07 Q1004="
## [2] "0071010231999992015010100504+64350+007800FM-15+000099999V0202201N020619999999N999999999+00801+00701999999ADDMA1100301999999REMMET044METAR ENDR 010050Z AUTO 22040KT 08/07 Q1003="
## [3] "0071010231999992015010101204+64350+007800FM-15+000099999V0202201N020619999999N999999999+00801+00701999999ADDMA1100301999999REMMET044METAR ENDR 010120Z AUTO 22040KT 08/07 Q1003="
## [4] "0071010231999992015010101504+64350+007800FM-15+000099999V0202201N019019999999N999999999+00801+00701999999ADDMA1100301999999REMMET044METAR ENDR 010150Z AUTO 22037KT 08/07 Q1003="
## [5] "0071010231999992015010102204+64350+007800FM-15+000099999V0202201N017519999999N999999999+00801+00801999999ADDMA1100301999999REMMET044METAR ENDR 010220Z AUTO 22034KT 08/08 Q1003="
## [6] "0071010231999992015010102504+64350+007800FM-15+000099999V0202201N017519999999N999999999+00801+00701999999ADDMA1100201999999REMMET044METAR ENDR 010250Z AUTO 22034KT 08/07 Q1002="

Functions for test

Ext.Latitude <- function(x){
  substr(x, start=29, stop=34)
}

Functions for test 2

Ext.Longitude <- function(x){
  substr(x, start=35, stop=41)
}

Functions for test 2

Ext.Temp <- function(x){
  substr(x, start=88, stop=92)
}

lapply function usage

LAT <- lapply(data.strings, Ext.Latitude)
LON <- lapply(data.strings, Ext.Longitude)
TEMP <- lapply(data.strings, Ext.Temp)

Create a data.frame we can use for data analysis

DATA <- data.frame(Latitude=as.numeric(unlist(LAT))/1000,
                   Longitude=as.numeric(unlist(LON))/1000,
                   Temperature=as.numeric(unlist(TEMP))/10)

Final note code

DATA[DATA$Temperature==999.9,"Temperature"] <- NA

str(DATA)
## 'data.frame':    17291 obs. of  3 variables:
##  $ Latitude   : num  64.3 64.3 64.3 64.3 64.3 ...
##  $ Longitude  : num  7.8 7.8 7.8 7.8 7.8 7.8 7.8 7.8 7.8 7.8 ...
##  $ Temperature: num  8 8 8 8 8 8 8 8 8 8 ...
hist(DATA$Temperature, main="Temperature")

comments powered by Disqus