In this post we are going to take a look at various data importing techniques used for spatial data analysis
Importing Data from Tables (read.table)
- Opening data
- Importing csv files
- Checking the data structure for consistency
Accessing and importing open access environmental data is a crucial skill for data scientists. This section teaches you how to download data from the Web, import it in R and check it for consistency.
In this section, we are going to take a look at…
- Download open-access data from the USGS website
- Import it in R using read.table
- Check its structure to start exploring the data
#Set the URL with the CSV Files
URL <- "http://earthquake.usgs.gov/earthquakes/feed/v1.0/summary/all_day.csv"
#Load the CSV File
Data <- read.table(file=URL,
sep=",",
header=TRUE,
na.string="")
#Help function
help(read.table)
#Examining the data
str(Data)
## 'data.frame': 153 obs. of 22 variables:
## $ time : Factor w/ 153 levels "2018-01-21T19:05:13.020Z",..: 153 152 151 150 149 148 147 146 145 144 ...
## $ latitude : num 38.8 63.5 33.9 61.4 33.5 ...
## $ longitude : num -123 -149 -117 -150 -117 ...
## $ depth : num 3.38 7.4 0.55 53.3 4.83 5.33 0.8 5.1 2.48 11.8 ...
## $ mag : num 0.36 1.4 1.8 1 0.91 0.53 1 2.1 1.19 1.7 ...
## $ magType : Factor w/ 4 levels "mb","mb_lg","md",..: 3 4 4 4 4 4 4 4 3 4 ...
## $ nst : int 8 NA 35 NA 35 20 NA NA 12 NA ...
## $ gap : num 108 NA 74 NA 36 78 NA NA 223 NA ...
## $ dmin : num 0.00611 NA 0.05287 NA 0.07425 ...
## $ rms : num 0.03 0.42 0.2 0.37 0.15 0.13 0.49 0.8 0.03 0.67 ...
## $ net : Factor w/ 12 levels "ak","ci","hv",..: 5 1 2 1 2 2 1 1 5 1 ...
## $ id : Factor w/ 153 levels "ak18153221","ak18153236",..: 107 31 71 29 70 69 28 27 106 26 ...
## $ updated : Factor w/ 153 levels "2018-01-21T19:24:44.040Z",..: 153 147 138 149 130 131 126 122 151 120 ...
## $ place : Factor w/ 133 levels "103km WSW of Healy, Alaska",..: 7 23 89 131 6 128 1 98 118 71 ...
## $ type : Factor w/ 3 levels "earthquake","explosion",..: 1 1 3 1 1 1 1 1 1 1 ...
## $ horizontalError: num 0.97 NA 0.28 NA 0.17 0.24 NA NA 0.74 NA ...
## $ depthError : num 1.94 0.4 31.61 0.5 1.05 ...
## $ magError : num NA NA 0.137 NA 0.099 0.068 NA NA 0.16 NA ...
## $ magNst : int 1 NA 28 NA 25 11 NA NA 11 NA ...
## $ status : Factor w/ 2 levels "automatic","reviewed": 1 1 1 2 2 2 1 1 1 1 ...
## $ locationSource : Factor w/ 13 levels "ak","ci","guc",..: 6 1 2 1 2 2 1 1 6 1 ...
## $ magSource : Factor w/ 12 levels "ak","ci","hv",..: 5 1 2 1 2 2 1 1 5 1 ...
Downloading Open Data from FTP Sites
Often times, datasets are provided for free, but on FTP, websites and practitioners need to be able to access them. R is perfectly capable of downloading and importing data from FTP sites.
In this section, we are going to take a look at…
- Understand the basics of downloading data in R
- Download the data with the download.file function
- Learn how to handle compressed formats
#Load required packages
library(RCurl)
## Loading required package: bitops
library(XML)
#Create a list with all the files on the FTP site
list <- getURL("ftp://ftp.ncdc.noaa.gov/pub/data/gsod/2016/",
dirlistonly = TRUE)
#Clean the list
FileList <- strsplit(list, split="\r\n")
#Create a new directory where to download these files
DIR <- paste(getwd(),"/NOAAFiles",sep="")
dir.create(DIR)
## Warning in dir.create(DIR): 'E:\Projects\sumendar.github.io\content\post
## \NOAAFiles' already exists
#Loop to download the files
for(FileName in unlist(FileList)){
URL <- paste0("ftp://ftp.ncdc.noaa.gov/pub/data/gsod/2016/",FileName)
download.file(URL, destfile=paste0(DIR,"/",FileName), method="auto",
mode="wb")
}
#A more elegant way
DownloadFile <- function(x){
URL <- paste0("ftp://ftp.ncdc.noaa.gov/pub/data/gsod/2016/",x)
download.file(URL, destfile=paste0(DIR,"/",x), method="auto", mode="wb")
}
lapply(unlist(FileList)[1:5], DownloadFile)
#Dowload a compressed file
URL <- "ftp://ftp.ncdc.noaa.gov/pub/data/gsod/2015/gsod_2015.tar"
download.file(URL, destfile=paste0(DIR,"/gsod_2015.tar"),
method="auto",mode="wb")
untar(paste0(getwd(),"/NOAAFiles/","gsod_2015.tar"),
exdir=paste0(getwd(),"/NOAAFiles"))
help(unzip)
#For more information on the full experiment please visit:
#http://r-video-tutorial.blogspot.ch/2014/12/accessing-cleaning-and-plotting-noaa.html
Importing with read.lines (The Last Resort)
Some data cannot be open with neither read.table
nor read.fwf
In this desperate cases in readLines
can help
In this section, we are going to take a look at… *
#Download the data from the FTP site
URL <- "ftp://ftp.ncdc.noaa.gov/pub/data/noaa/2015/010231-99999-2015.gz"
FileName <- "010231-99999-2015.gz"
download.file(URL, destfile=paste0(getwd(),"/",FileName), method="auto", mode="wb")
data.strings <- readLines(gzfile(FileName, open="rt"))
## Warning in readLines(gzfile(FileName, open = "rt")): seek on a gzfile
## connection returned an internal error
head(data.strings)
## [1] "0071010231999992015010100204+64350+007800FM-15+000099999V0202201N021119999999N999999999+00801+00701999999ADDMA1100401999999REMMET044METAR ENDR 010020Z AUTO 22041KT 08/07 Q1004="
## [2] "0071010231999992015010100504+64350+007800FM-15+000099999V0202201N020619999999N999999999+00801+00701999999ADDMA1100301999999REMMET044METAR ENDR 010050Z AUTO 22040KT 08/07 Q1003="
## [3] "0071010231999992015010101204+64350+007800FM-15+000099999V0202201N020619999999N999999999+00801+00701999999ADDMA1100301999999REMMET044METAR ENDR 010120Z AUTO 22040KT 08/07 Q1003="
## [4] "0071010231999992015010101504+64350+007800FM-15+000099999V0202201N019019999999N999999999+00801+00701999999ADDMA1100301999999REMMET044METAR ENDR 010150Z AUTO 22037KT 08/07 Q1003="
## [5] "0071010231999992015010102204+64350+007800FM-15+000099999V0202201N017519999999N999999999+00801+00801999999ADDMA1100301999999REMMET044METAR ENDR 010220Z AUTO 22034KT 08/08 Q1003="
## [6] "0071010231999992015010102504+64350+007800FM-15+000099999V0202201N017519999999N999999999+00801+00701999999ADDMA1100201999999REMMET044METAR ENDR 010250Z AUTO 22034KT 08/07 Q1002="
Functions for test
Ext.Latitude <- function(x){
substr(x, start=29, stop=34)
}
Functions for test 2
Ext.Longitude <- function(x){
substr(x, start=35, stop=41)
}
Functions for test 2
Ext.Temp <- function(x){
substr(x, start=88, stop=92)
}
lapply function usage
LAT <- lapply(data.strings, Ext.Latitude)
LON <- lapply(data.strings, Ext.Longitude)
TEMP <- lapply(data.strings, Ext.Temp)
Create a data.frame we can use for data analysis
DATA <- data.frame(Latitude=as.numeric(unlist(LAT))/1000,
Longitude=as.numeric(unlist(LON))/1000,
Temperature=as.numeric(unlist(TEMP))/10)
Final note code
DATA[DATA$Temperature==999.9,"Temperature"] <- NA
str(DATA)
## 'data.frame': 17291 obs. of 3 variables:
## $ Latitude : num 64.3 64.3 64.3 64.3 64.3 ...
## $ Longitude : num 7.8 7.8 7.8 7.8 7.8 7.8 7.8 7.8 7.8 7.8 ...
## $ Temperature: num 8 8 8 8 8 8 8 8 8 8 ...
hist(DATA$Temperature, main="Temperature")