Introduction

The goal of this exercise is to evaluate the performance of various R packages designed for accessing MODIS data, specifically focusing on data access and import times. To do this, we will download and import a one-year monthly time series of MODIS Normalized Difference Vegetation Index (NDVI) at a 1 km spatial resolution for the entire country of Madagascar.

Other R packages available for accessing MODIS data

The MODIS package offers access to some MODIS data through global online data archives, but it lacks comprehensive documentation and was removed from the CRAN repository in 2023 for policy violations. Furthermore, some of its dependencies are not available anymore on CRAN.

The MODIStsp package provides both a command-line and a user-friendly graphical interface to extract MODIS data in standard TIFF or original HDF formats. However, it does not allow to extract data at a sub-tile level (i.e. spatial subsetting capabilities are limited), and it was also removed from the CRAN repository in 2023 at the maintainer’s request.

The MODIStools package serves as a programmatic interface to the ‘MODIS Land Products Subsets’ web service, providing access to 46 MODIS and VIIRS collections. This package, available on CRAN, extracts data at specific points or buffer zones around a point. However, for area-based queries, the maximum area size is 200 km x 200 km. This makes it suitable for point-based data extraction but less effective for wide area-based queries.

The appeears package acts as a programmatic interface to the NASA AppEEARS API services. AppEEARS is a NASA-built application that offers a simple and efficient way to access and transform geospatial data from a variety of federal data archives (including MODIS and VIIRS, but not GPM). AppEEARS allows accessing data from various NASA federal archives, including MODIS and VIIRS, and enables users to subset geospatial datasets using spatial, temporal, and band/layer parameters. Indeed, as for modisfast, AppEEARS uses OPeNDAP. While similar to modisfast, appeears offers a broader range of data sources but has a latency period (ranging from minutes to hours) for query processing due to server-side post-processing (mosaicking, reprojection, etc.).

Finally, some R packages, such as rgee [@rgee], rely on proprietary software or data access protocols and are not discussed here for that reason.

System benchmarking

Before proceeding to the test, some system benchmarking :

Sys.time()
#> [1] "2024-11-09 10:08:28 CET"
Sys.info()
#>                                                            sysname 
#>                                                            "Linux" 
#>                                                            release 
#>                                                 "6.8.0-48-generic" 
#>                                                            version 
#> "#48~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Mon Oct  7 11:24:13 UTC 2" 
#>                                                           nodename 
#>                                           "ptaconet-Latitude-7490" 
#>                                                            machine 
#>                                                           "x86_64" 
#>                                                              login 
#>                                                         "ptaconet" 
#>                                                               user 
#>                                                         "ptaconet" 
#>                                                     effective_user 
#>                                                         "ptaconet"
benchmarkme::get_r_version()
#> $platform
#> [1] "x86_64-pc-linux-gnu"
#> 
#> $arch
#> [1] "x86_64"
#> 
#> $os
#> [1] "linux-gnu"
#> 
#> $system
#> [1] "x86_64, linux-gnu"
#> 
#> $status
#> [1] ""
#> 
#> $major
#> [1] "4"
#> 
#> $minor
#> [1] "4.0"
#> 
#> $year
#> [1] "2024"
#> 
#> $month
#> [1] "04"
#> 
#> $day
#> [1] "24"
#> 
#> $`svn rev`
#> [1] "86474"
#> 
#> $language
#> [1] "R"
#> 
#> $version.string
#> [1] "R version 4.4.0 (2024-04-24)"
#> 
#> $nickname
#> [1] "Puppy Cup"
benchmarkme::get_cpu()
#> $vendor_id
#> [1] "GenuineIntel"
#> 
#> $model_name
#> [1] "Intel(R) Core(TM) i7-8650U CPU @ 1.90GHz"
#> 
#> $no_of_cores
#> [1] 8
benchmarkme::get_ram()
#> 33.5 GB

appeears

appears_start_time <- Sys.time()

roi <- st_as_sf(data.frame(id = "madagascar", geom = "POLYGON((41.95 -11.37,51.26 -11.37,51.26 -26.17,41.95 -26.17,41.95 -11.37))"), wkt="geom", crs = 4326) 
time_range <- as.Date(c("2023-01-01","2023-12-31"))
collection <- "MOD13A3.061"  
variables <- c("_1_km_monthly_NDVI") 

# Replace Sys.getenv("earthdata_un") and Sys.getenv("earthdata_pw") below by your Earthdata credentials  (see https://github.com/bluegreen-labs/appeears#use to setup an appeears connection)

rs_set_key(
  user = Sys.getenv("earthdata_un"),
  password = Sys.getenv("earthdata_pw")
  )

token <- rs_login(user = Sys.getenv("earthdata_un"))

df <- data.frame(
  task = "polygon_test",
  subtask = "subtask",
  start = as.character(time_range[1]),
  end = as.character(time_range[2]),
  product = collection,
  layer = variables
)

# create the task
task <- rs_build_task(
  df = df,
  roi = roi,
  format = "netcdf4"
)

# request the task to be executed
rs_request(
  request = task,
  user = Sys.getenv("earthdata_un"),
  transfer = TRUE,
  verbose = TRUE
)

r <- terra::rast(file.path(tempdir(),df$task,"MOD13A3.061_1km_aid0001.nc"))

r
#> class       : SpatRaster 
#> dimensions  : 1777, 1118, 24  (nrow, ncol, nlyr)
#> resolution  : 0.008333333, 0.008333333  (x, y)
#> extent      : 41.95, 51.26667, -26.175, -11.36667  (xmin, xmax, ymin, ymax)
#> coord. ref. : +proj=longlat +datum=WGS84 +no_defs 
#> sources     : MOD13A3.061_1km_aid0001.nc:_1_km_monthly_NDVI  (12 layers) 
#>               MOD13A3.061_1km_aid0001.nc:_1_km_monthly_VI_Quality  (12 layers) 
#> varnames    : _1_km_monthly_NDVI (1 km monthly NDVI) 
#>               _1_km_monthly_VI_Quality (1 km monthly VI Quality) 
#> names       : _1_km~DVI_1, _1_km~DVI_2, _1_km~DVI_3, _1_km~DVI_4, _1_km~DVI_5, _1_km~DVI_6, ... 
#> unit        :        NDVI,        NDVI,        NDVI,        NDVI,        NDVI,        NDVI, ... 
#> time (days) : 2023-01-01 to 2023-12-01


appears_end_time <- Sys.time()

modisfast

modisfast_start_time <- Sys.time()

roi <- st_as_sf(data.frame(id = "madagascar", geom = "POLYGON((41.95 -11.37,51.26 -11.37,51.26 -26.17,41.95 -26.17,41.95 -11.37))"), wkt="geom", crs = 4326) 
time_range <- as.Date(c("2023-01-01","2023-12-31"))
collection <- "MOD13A3.061"  
variables <- c("_1_km_monthly_NDVI") 

log <- mf_login(credentials = c(Sys.getenv("earthdata_un"), Sys.getenv("earthdata_pw")))
#> Checking credentials...

urls <- mf_get_url(
  collection = collection,
  variables = variables,
  roi = roi,
  time_range = time_range
 )
#> Building the URLs...
#> Estimated maximum size of data to be downloaded is 97 Mb

res_dl <- mf_download_data(urls, parallel = T)
#> 5  datasets in total :  0  already downloaded and  5  datasets to download
#> Downloading the data ... (destination folder: /tmp/RtmpbWJaGX/modisfast_405f7f2ea9cf )
#> Estimated maximum size of data to be downloaded is ~ 97 Mb
#> 
#> Actual size of downloaded data is 24 Mb

r <- mf_import_data(
  path = dirname(res_dl$destfile[1]),
  collection = collection, 
  proj_epsg = 4326
  )
#> Importing the dataset as a SpatRaster object...

r
#> class       : SpatRaster 
#> dimensions  : 1508, 1796, 12  (nrow, ncol, nlyr)
#> resolution  : 0.009861233, 0.009861233  (x, y)
#> extent      : 38.40805, 56.11883, -26.24157, -11.37083  (xmin, xmax, ymin, ymax)
#> coord. ref. : lon/lat WGS 84 (EPSG:4326) 
#> source(s)   : memory
#> names       : _1_km~DVI_1, _1_km~DVI_2, _1_km~DVI_3, _1_km~DVI_4, _1_km~DVI_5, _1_km~DVI_6, ... 
#> min values  :     -0.1992,  -0.1877457,  -0.1953000,  -0.1842000,   -0.197400,  -0.1993000, ... 
#> max values  :      0.9764,   0.9764000,   0.9748999,   0.9725385,    0.973778,   0.9645233, ... 
#> time (days) : 2023-01-01 to 2023-12-01

modisfast_end_time <- Sys.time()

MODIStsp

modistsp_start_time <- Sys.time()

MODIStsp(gui = FALSE, 
         parallel = TRUE,
         selprod = "Vegetation_Indexes_Monthly_1Km (M*D13A3)",
         prod_version = "061",
         bandsel = "NDVI",
         sensor = "Terra",
         start_date = gsub("-",".",as.character(time_range[1])),
         end_date   = gsub("-",".",as.character(time_range[2])),
         user = Sys.getenv("earthdata_un"),
         password = Sys.getenv("earthdata_pw"),
         bbox = c(41.95,-26.17,51.26,-11.37),
         download_range = "Full",
         spatmeth = "bbox",
         out_projsel = "User Defined",
         output_proj = "4326",
         out_res_sel = "Native",
         out_format = "GTiff"
)

modistsp_end_time <- Sys.time()

The system returns that “Username and/or password are not valid”, although they are those successfully used with both modisfast and appeears. We could therefore not finish the test with MODIStsp.

MODIStools

MOD13A3 is not available with MODIStools, so we run the exercise with MOD21A2 (spatial resolution : 1 km ; temporal resolution : 8 day). Because the temporal resolution is 4 fold lower than MOD13A3 (8 days instead of 1 month), we also divide the time range by 4 to have comparable data.

modistools_start_time <- Sys.time()

subset <- mt_subset(product = "MOD21A2",
                    lat = -19.5,
                    lon = 46.5,
                    band = "LST_Day_1KM",
                    start = as.character(time_range[1]),
                    end = "2023-03-12",
                    km_lr = 500,
                    km_ab = 900,
                    site_name = "madagascar",
                    internal = TRUE,
                    progress = FALSE)

modistools_end_time <- Sys.time()

The system returns that “Number of kilometers above or below the subset location must be less than or equal to 100.”, which is not enough for the whole size of Madagascar. MODIStools enables to download data for 200 km x 200 km areas maximum. We could therefore not finish the test with MODIStools.

Results

modisfast and appeears passed the test, MODIStools and MODIStsp did not (for different reasons). Let’s see the duration of data processing access :


# with modisfast
round(modisfast_end_time - modisfast_start_time,2)
#> Time difference of 1.21 mins

# with appeears
round(appears_end_time - appears_start_time,2)
#> Time difference of 3.91 mins

We see here that accessing and importing the time-series with modisfast was much faster than with appeears ! Once again, appeears has a variable latency period for query processing, ranging from minutes to hours, due to server-side post-processing. modisfast does not have such latency.