Title: | Automated Cleaning of Occurrence Records from Biological Collections |
---|---|
Description: | Automated flagging of common spatial and temporal errors in biological and paleontological collection data, for the use in conservation, ecology and paleontology. Includes automated tests to easily flag (and exclude) records assigned to country or province centroid, the open ocean, the headquarters of the Global Biodiversity Information Facility, urban areas or the location of biodiversity institutions (museums, zoos, botanical gardens, universities). Furthermore identifies per species outlier coordinates, zero coordinates, identical latitude/longitude and invalid coordinates. Also implements an algorithm to identify data sets with a significant proportion of rounded coordinates. Especially suited for large data sets. The reference for the methodology is: Zizka et al. (2019) <doi:10.1111/2041-210X.13152>. |
Authors: | Alexander Zizka [aut, cre], Daniele Silvestro [ctb], Tobias Andermann [ctb], Josue Azevedo [ctb], Camila Duarte Ritter [ctb], Daniel Edler [ctb], Harith Farooq [ctb], Andrei Herdean [ctb], Maria Ariza [ctb], Ruud Scharn [ctb], Sten Svanteson [ctb], Niklas Wengstrom [ctb], Vera Zizka [ctb], Alexandre Antonelli [ctb], Bruno Vilela [ctb] (Bruno updated the package to remove dependencies on sp, raster, rgdal, maptools, and rgeos packages), Irene Steves [rev] (Irene reviewed the package for ropensci, see <https://github.com/ropensci/onboarding/issues/210>), Francisco Rodriguez-Sanchez [rev] (Francisco reviewed the package for ropensci, see <https://github.com/ropensci/onboarding/issues/210>) |
Maintainer: | Alexander Zizka <[email protected]> |
License: | GPL-3 |
Version: | 3.0.1 |
Built: | 2024-11-27 03:39:58 UTC |
Source: | https://github.com/ropensci/CoordinateCleaner |
A data frame with information on Artificial Hotspot Occurrence Inventory (AHOI) as available in Park et al 2022. For more details see reference.
https://onlinelibrary.wiley.com/doi/10.1111/jbi.14543
Park, D. S., Xie, Y., Thammavong, H. T., Tulaiha, R., & Feng, X. (2023). Artificial Hotspot Occurrence Inventory (AHOI). Journal of Biogeography, 50, 441–449. doi:10.1111/jbi.14543
data("aohi")
data("aohi")
A SpatVector
with global coastlines, with a 1 degree buffer to extent coastlines as alternative reference for cc_sea
. Can be useful to identify species in the sea, without flagging records in mangroves, marshes, etc.
https://www.naturalearthdata.com/downloads/10m-physical-vectors/
data("buffland")
data("buffland")
A SpatVector
with global coastlines, with a -1 degree buffer to extent coastlines as alternative reference for cc_sea
. Can be useful to identify marine species on land without flagging records in estuaries, etc.
https://www.naturalearthdata.com/downloads/10m-physical-vectors/
data("buffsea")
data("buffsea")
Removes or flags records within Artificial Hotspot Occurrence Inventory. Poorly geo-referenced occurrence records in biological databases are often erroneously geo-referenced to highly recurring coordinates that were assessed by Park et al 2022. See the reference for more details.
cc_aohi( x, lon = "decimalLongitude", lat = "decimalLatitude", species = "species", taxa = c("Aves", "Insecta", "Mammalia", "Plantae"), buffer = 10000, geod = TRUE, value = "clean", verbose = TRUE )
cc_aohi( x, lon = "decimalLongitude", lat = "decimalLatitude", species = "species", taxa = c("Aves", "Insecta", "Mammalia", "Plantae"), buffer = 10000, geod = TRUE, value = "clean", verbose = TRUE )
x |
data.frame. Containing geographical coordinates and species names. |
lon |
character string. The column with the longitude coordinates. Default = “decimalLongitude”. |
lat |
character string. The column with the latitude coordinates. Default = “decimalLatitude”. |
species |
character string. The column with the species identity. Only required if verify = TRUE. |
taxa |
Artificial Hotspot Occurrence Inventory (AHOI) were created based on four different taxa, birds, insecta, mammalia, and plantae. Users can choose to keep all, or any specific taxa subset to define the AHOI locations. Default is to keep all: c("Aves", "Insecta", "Mammalia", "Plantae"). |
buffer |
The buffer around each capital coordinate (the centre of the city), where records should be flagged as problematic. Units depend on geod. Default = 10 kilometres. |
geod |
logical. If TRUE the radius around each capital is calculated based on a sphere, buffer is in meters and independent of latitude. If FALSE the radius is calculated assuming planar coordinates and varies slightly with latitude. Default = TRUE. See https://seethedatablog.wordpress.com/ for detail and credits. |
value |
character string. Defining the output value. See value. |
verbose |
logical. If TRUE reports the name of the test and the number of records flagged. |
Depending on the ‘value’ argument, either a data.frame
containing the records considered correct by the test (“clean”) or a
logical vector (“flagged”), with TRUE = test passed and FALSE = test
failed/potentially problematic . Default = “clean”.
See https://ropensci.github.io/CoordinateCleaner/ for more details and tutorials.
Park, D. S., Xie, Y., Thammavong, H. T., Tulaiha, R., & Feng, X. (2023). Artificial Hotspot Occurrence Inventory (AHOI). Journal of Biogeography, 50, 441–449. doi:10.1111/jbi.14543
Other Coordinates:
cc_cap()
,
cc_cen()
,
cc_coun()
,
cc_dupl()
,
cc_equ()
,
cc_gbif()
,
cc_inst()
,
cc_iucn()
,
cc_outl()
,
cc_sea()
,
cc_urb()
,
cc_val()
,
cc_zero()
x <- data.frame(species = letters[1:10], decimalLongitude = c(runif(99, -180, 180), -47.92), decimalLatitude = c(runif(99, -90,90), -15.78)) cc_aohi(x)
x <- data.frame(species = letters[1:10], decimalLongitude = c(runif(99, -180, 180), -47.92), decimalLatitude = c(runif(99, -90,90), -15.78)) cc_aohi(x)
Removes or flags records within a certain radius around country capitals. Poorly geo-referenced occurrence records in biological databases are often erroneously geo-referenced to capitals.
cc_cap( x, lon = "decimalLongitude", lat = "decimalLatitude", species = "species", buffer = 10000, geod = TRUE, ref = NULL, verify = FALSE, value = "clean", verbose = TRUE )
cc_cap( x, lon = "decimalLongitude", lat = "decimalLatitude", species = "species", buffer = 10000, geod = TRUE, ref = NULL, verify = FALSE, value = "clean", verbose = TRUE )
x |
data.frame. Containing geographical coordinates and species names. |
lon |
character string. The column with the longitude coordinates. Default = “decimalLongitude”. |
lat |
character string. The column with the latitude coordinates. Default = “decimalLatitude”. |
species |
character string. The column with the species identity. Only required if verify = TRUE. |
buffer |
The buffer around each capital coordinate (the centre of the city), where records should be flagged as problematic. Units depend on geod. Default = 10 kilometres. |
geod |
logical. If TRUE the radius around each capital is calculated based on a sphere, buffer is in meters and independent of latitude. If FALSE the radius is calculated assuming planar coordinates and varies slightly with latitude. Default = TRUE. See https://seethedatablog.wordpress.com/ for detail and credits. |
ref |
SpatVector (geometry: polygons). Providing the geographic
gazetteer. Can be any SpatVector (geometry: polygons), but the structure
must be identical to |
verify |
logical. If TRUE records are only flagged if they are the only record in a given species flagged close to a given reference. If FALSE, the distance is the only criterion |
value |
character string. Defining the output value. See value. |
verbose |
logical. If TRUE reports the name of the test and the number of records flagged. |
Depending on the ‘value’ argument, either a data.frame
containing the records considered correct by the test (“clean”) or a
logical vector (“flagged”), with TRUE = test passed and FALSE = test
failed/potentially problematic . Default = “clean”.
See https://ropensci.github.io/CoordinateCleaner/ for more details and tutorials.
Other Coordinates:
cc_aohi()
,
cc_cen()
,
cc_coun()
,
cc_dupl()
,
cc_equ()
,
cc_gbif()
,
cc_inst()
,
cc_iucn()
,
cc_outl()
,
cc_sea()
,
cc_urb()
,
cc_val()
,
cc_zero()
## Not run: x <- data.frame(species = letters[1:10], decimalLongitude = c(runif(99, -180, 180), -47.882778), decimalLatitude = c(runif(99, -90, 90), -15.793889)) cc_cap(x) cc_cap(x, value = "flagged") ## End(Not run)
## Not run: x <- data.frame(species = letters[1:10], decimalLongitude = c(runif(99, -180, 180), -47.882778), decimalLatitude = c(runif(99, -90, 90), -15.793889)) cc_cap(x) cc_cap(x, value = "flagged") ## End(Not run)
Removes or flags records within a radius around the geographic centroids of political countries and provinces. Poorly geo-referenced occurrence records in biological databases are often erroneously geo-referenced to centroids.
cc_cen( x, lon = "decimalLongitude", lat = "decimalLatitude", species = "species", buffer = 1000, geod = TRUE, test = "both", ref = NULL, verify = FALSE, value = "clean", verbose = TRUE )
cc_cen( x, lon = "decimalLongitude", lat = "decimalLatitude", species = "species", buffer = 1000, geod = TRUE, test = "both", ref = NULL, verify = FALSE, value = "clean", verbose = TRUE )
x |
data.frame. Containing geographical coordinates and species names. |
lon |
character string. The column with the longitude coordinates. Default = “decimalLongitude”. |
lat |
character string. The column with the latitude coordinates. Default = “decimalLatitude”. |
species |
character string. The column with the species identity. Only required if verify = TRUE. |
buffer |
numerical. The buffer around each province or country centroid, where records should be flagged as problematic. Units depend on geod. Default = 1 kilometre. |
geod |
logical. If TRUE the radius around each capital is calculated based on a sphere, buffer is in meters and independent of latitude. If FALSE the radius is calculated assuming planar coordinates and varies slightly with latitude. Default = TRUE. See https://seethedatablog.wordpress.com/ for detail and credits. |
test |
a character string. Specifying the details of the test. One of c(“both”, “country”, “provinces”). If both tests for country and province centroids. |
ref |
SpatVector (geometry: polygons). Providing the geographic
gazetteer. Can be any SpatVector (geometry: polygons), but the structure
must be identical to |
verify |
logical. If TRUE records are only flagged if they are the only record in a given species flagged close to a given reference. If FALSE, the distance is the only criterion |
value |
character string. Defining the output value. See value. |
verbose |
logical. If TRUE reports the name of the test and the number of records flagged. |
Depending on the ‘value’ argument, either a data.frame
containing the records considered correct by the test (“clean”) or a
logical vector (“flagged”), with TRUE = test passed and FALSE = test
failed/potentially problematic . Default = “clean”.
See https://ropensci.github.io/CoordinateCleaner/ for more details and tutorials.
Other Coordinates:
cc_aohi()
,
cc_cap()
,
cc_coun()
,
cc_dupl()
,
cc_equ()
,
cc_gbif()
,
cc_inst()
,
cc_iucn()
,
cc_outl()
,
cc_sea()
,
cc_urb()
,
cc_val()
,
cc_zero()
x <- data.frame(species = letters[1:10], decimalLongitude = c(runif(99, -180, 180), -47.92), decimalLatitude = c(runif(99, -90,90), -15.78)) cc_cen(x, geod = FALSE) ## Not run: cc_inst(x, value = "flagged", buffer = 50000) #geod = T ## End(Not run)
x <- data.frame(species = letters[1:10], decimalLongitude = c(runif(99, -180, 180), -47.92), decimalLatitude = c(runif(99, -90,90), -15.78)) cc_cen(x, geod = FALSE) ## Not run: cc_inst(x, value = "flagged", buffer = 50000) #geod = T ## End(Not run)
Removes or flags mismatches between geographic coordinates and additional country information (usually this information is reliably reported with specimens). Such a mismatch can occur for example, if latitude and longitude are switched.
cc_coun( x, lon = "decimalLongitude", lat = "decimalLatitude", iso3 = "countrycode", value = "clean", ref = NULL, ref_col = "iso_a3", verbose = TRUE, buffer = NULL )
cc_coun( x, lon = "decimalLongitude", lat = "decimalLatitude", iso3 = "countrycode", value = "clean", ref = NULL, ref_col = "iso_a3", verbose = TRUE, buffer = NULL )
x |
data.frame. Containing geographical coordinates and species names. |
lon |
character string. The column with the longitude coordinates. Default = “decimalLongitude”. |
lat |
character string. The column with the latitude coordinates. Default = “decimalLatitude”. |
iso3 |
a character string. The column with the country assignment of each record in three letter ISO code. Default = “countrycode”. |
value |
character string. Defining the output value. See value. |
ref |
SpatVector (geometry: polygons). Providing the geographic
gazetteer. Can be any SpatVector (geometry: polygons), but the structure
must be identical to |
ref_col |
the column name in the reference dataset, containing the relevant ISO codes for matching. Default is to "iso_a3_eh" which refers to the ISO-3 codes in the reference dataset. See notes. |
verbose |
logical. If TRUE reports the name of the test and the number of records flagged. |
buffer |
numeric. Units are in meters. If provided, a buffer is created around each country polygon. |
Depending on the ‘value’ argument, either a data.frame
containing the records considered correct by the test (“clean”) or a
logical vector (“flagged”), with TRUE = test passed and FALSE = test
failed/potentially problematic . Default = “clean”.
The ref_col argument allows to adapt the function to the structure of
alternative reference datasets. For instance, for
rnaturalearth::ne_countries(scale = "small")
, the default will fail,
but ref_col = "iso_a3" will work.
With the default reference, records are flagged if they fall outside the terrestrial territory of countries, hence records in territorial waters might be flagged. See https://ropensci.github.io/CoordinateCleaner/ for more details and tutorials.
Other Coordinates:
cc_aohi()
,
cc_cap()
,
cc_cen()
,
cc_dupl()
,
cc_equ()
,
cc_gbif()
,
cc_inst()
,
cc_iucn()
,
cc_outl()
,
cc_sea()
,
cc_urb()
,
cc_val()
,
cc_zero()
## Not run: x <- data.frame(species = letters[1:10], decimalLongitude = runif(100, -20, 30), decimalLatitude = runif(100, 35,60), countrycode = "RUS") cc_coun(x, value = "flagged")#non-terrestrial records are flagged as wrong. ## End(Not run)
## Not run: x <- data.frame(species = letters[1:10], decimalLongitude = runif(100, -20, 30), decimalLatitude = runif(100, 35,60), countrycode = "RUS") cc_coun(x, value = "flagged")#non-terrestrial records are flagged as wrong. ## End(Not run)
Removes or flags duplicated records based on species name and coordinates, as well as user-defined additional columns. True (specimen) duplicates or duplicates from the same species can make up the bulk of records in a biological collection database, but are undesirable for many analyses. Both can be flagged with this function, the former given enough additional information.
cc_dupl( x, lon = "decimalLongitude", lat = "decimalLatitude", species = "species", additions = NULL, value = "clean", verbose = TRUE )
cc_dupl( x, lon = "decimalLongitude", lat = "decimalLatitude", species = "species", additions = NULL, value = "clean", verbose = TRUE )
x |
data.frame. Containing geographical coordinates and species names. |
lon |
character string. The column with the longitude coordinates. Default = “decimalLongitude”. |
lat |
character string. The column with the latitude coordinates. Default = “decimalLatitude”. |
species |
a character string. The column with the species name. Default = “species”. |
additions |
a vector of character strings. Additional columns to be included in the test for duplication. For example as below, collector name and collector number. |
value |
character string. Defining the output value. See value. |
verbose |
logical. If TRUE reports the name of the test and the number of records flagged. |
Depending on the ‘value’ argument, either a data.frame
containing the records considered correct by the test (“clean”) or a
logical vector (“flagged”), with TRUE = test passed and FALSE = test
failed/potentially problematic . Default = “clean”.
Other Coordinates:
cc_aohi()
,
cc_cap()
,
cc_cen()
,
cc_coun()
,
cc_equ()
,
cc_gbif()
,
cc_inst()
,
cc_iucn()
,
cc_outl()
,
cc_sea()
,
cc_urb()
,
cc_val()
,
cc_zero()
x <- data.frame(species = letters[1:10], decimalLongitude = sample(x = 0:10, size = 100, replace = TRUE), decimalLatitude = sample(x = 0:10, size = 100, replace = TRUE), collector = "Bonpl", collector.number = c(1001, 354), collection = rep(c("K", "WAG","FR", "P", "S"), 20)) cc_dupl(x, value = "flagged") cc_dupl(x, additions = c("collector", "collector.number"))
x <- data.frame(species = letters[1:10], decimalLongitude = sample(x = 0:10, size = 100, replace = TRUE), decimalLatitude = sample(x = 0:10, size = 100, replace = TRUE), collector = "Bonpl", collector.number = c(1001, 354), collection = rep(c("K", "WAG","FR", "P", "S"), 20)) cc_dupl(x, value = "flagged") cc_dupl(x, additions = c("collector", "collector.number"))
Removes or flags records with equal latitude and longitude coordinates, either exact or absolute. Equal coordinates can often indicate data entry errors.
cc_equ( x, lon = "decimalLongitude", lat = "decimalLatitude", test = "absolute", value = "clean", verbose = TRUE )
cc_equ( x, lon = "decimalLongitude", lat = "decimalLatitude", test = "absolute", value = "clean", verbose = TRUE )
x |
data.frame. Containing geographical coordinates and species names. |
lon |
character string. The column with the longitude coordinates. Default = “decimalLongitude”. |
lat |
character string. The column with the latitude coordinates. Default = “decimalLatitude”. |
test |
character string. Defines if coordinates are compared exactly (“identical”) or on the absolute scale (i.e. -1 = 1, “absolute”). Default is to “absolute”. |
value |
character string. Defining the output value. See value. |
verbose |
logical. If TRUE reports the name of the test and the number of records flagged. |
Depending on the ‘value’ argument, either a data.frame
containing the records considered correct by the test (“clean”) or a
logical vector (“flagged”), with TRUE = test passed and FALSE = test
failed/potentially problematic . Default = “clean”.
Other Coordinates:
cc_aohi()
,
cc_cap()
,
cc_cen()
,
cc_coun()
,
cc_dupl()
,
cc_gbif()
,
cc_inst()
,
cc_iucn()
,
cc_outl()
,
cc_sea()
,
cc_urb()
,
cc_val()
,
cc_zero()
x <- data.frame(species = letters[1:10], decimalLongitude = runif(100, -180, 180), decimalLatitude = runif(100, -90,90)) cc_equ(x) cc_equ(x, value = "flagged")
x <- data.frame(species = letters[1:10], decimalLongitude = runif(100, -180, 180), decimalLatitude = runif(100, -90,90)) cc_equ(x) cc_equ(x, value = "flagged")
Removes or flags records within 0.5 degree radius around the GBIF headquarters in Copenhagen, DK.
cc_gbif( x, lon = "decimalLongitude", lat = "decimalLatitude", species = "species", buffer = 1000, geod = TRUE, verify = FALSE, value = "clean", verbose = TRUE )
cc_gbif( x, lon = "decimalLongitude", lat = "decimalLatitude", species = "species", buffer = 1000, geod = TRUE, verify = FALSE, value = "clean", verbose = TRUE )
x |
data.frame. Containing geographical coordinates and species names. |
lon |
character string. The column with the longitude coordinates. Default = “decimalLongitude”. |
lat |
character string. The column with the latitude coordinates. Default = “decimalLatitude”. |
species |
character string. The column with the species identity. Only required if verify = TRUE. |
buffer |
numerical. The buffer around the GBIF headquarters, where records should be flagged as problematic. Units depend on geod. Default = 100 m. |
geod |
logical. If TRUE the radius is calculated based on a sphere, buffer is in meters. If FALSE the radius is calculated in degrees. Default = T. |
verify |
logical. If TRUE records are only flagged if they are the only record in a given species flagged close to a given reference. If FALSE, the distance is the only criterion |
value |
character string. Defining the output value. See value. |
verbose |
logical. If TRUE reports the name of the test and the number of records flagged. |
Not recommended if working with records from Denmark or the Copenhagen area.
Depending on the ‘value’ argument, either a data.frame
containing the records considered correct by the test (“clean”) or a
logical vector (“flagged”), with TRUE = test passed and FALSE = test
failed/potentially problematic . Default = “clean”.
Other Coordinates:
cc_aohi()
,
cc_cap()
,
cc_cen()
,
cc_coun()
,
cc_dupl()
,
cc_equ()
,
cc_inst()
,
cc_iucn()
,
cc_outl()
,
cc_sea()
,
cc_urb()
,
cc_val()
,
cc_zero()
x <- data.frame(species = "A", decimalLongitude = c(12.58, 12.58), decimalLatitude = c(55.67, 30.00)) cc_gbif(x) cc_gbif(x, value = "flagged")
x <- data.frame(species = "A", decimalLongitude = c(12.58, 12.58), decimalLatitude = c(55.67, 30.00)) cc_gbif(x) cc_gbif(x, value = "flagged")
Removes or flags records assigned to the location of zoos, botanical gardens, herbaria, universities and museums, based on a global database of ~10,000 such biodiversity institutions. Coordinates from these locations can be related to data-entry errors, false automated geo-reference or individuals in captivity/horticulture.
cc_inst( x, lon = "decimalLongitude", lat = "decimalLatitude", species = "species", buffer = 100, geod = FALSE, ref = NULL, verify = FALSE, verify_mltpl = 10, value = "clean", verbose = TRUE )
cc_inst( x, lon = "decimalLongitude", lat = "decimalLatitude", species = "species", buffer = 100, geod = FALSE, ref = NULL, verify = FALSE, verify_mltpl = 10, value = "clean", verbose = TRUE )
x |
data.frame. Containing geographical coordinates and species names. |
lon |
character string. The column with the longitude coordinates. Default = “decimalLongitude”. |
lat |
character string. The column with the latitude coordinates. Default = “decimalLatitude”. |
species |
character string. The column with the species identity. Only required if verify = TRUE. |
buffer |
numerical. The buffer around each institution, where records should be flagged as problematic, in decimal degrees. Default = 100m. |
geod |
logical. If TRUE the radius around each capital is calculated based on a sphere, buffer is in meters and independent of latitude. If FALSE the radius is calculated assuming planar coordinates and varies slightly with latitude. Default = TRUE. See https://seethedatablog.wordpress.com/ for detail and credits. |
ref |
SpatVector (geometry: polygons). Providing the geographic
gazetteer. Can be any SpatVector (geometry: polygons), but the structure
must be identical to |
verify |
logical. If TRUE, records close to institutions are only flagged, if there are no other records of the same species in the greater vicinity (a radius of buffer * verify_mltpl). |
verify_mltpl |
numerical. indicates the factor by which the radius for verify exceeds the radius of the initial test. Default = 10, which might be suitable if geod is TRUE, but might be too large otherwise. |
value |
character string. Defining the output value. See value. |
verbose |
logical. If TRUE reports the name of the test and the number of records flagged. |
Note: the buffer radius is in degrees, thus will differ slightly between different latitudes.
Depending on the ‘value’ argument, either a data.frame
containing the records considered correct by the test (“clean”) or a
logical vector (“flagged”), with TRUE = test passed and FALSE = test
failed/potentially problematic . Default = “clean”.
Other Coordinates:
cc_aohi()
,
cc_cap()
,
cc_cen()
,
cc_coun()
,
cc_dupl()
,
cc_equ()
,
cc_gbif()
,
cc_iucn()
,
cc_outl()
,
cc_sea()
,
cc_urb()
,
cc_val()
,
cc_zero()
x <- data.frame(species = letters[1:10], decimalLongitude = c(runif(99, -180, 180), 37.577800), decimalLatitude = c(runif(99, -90,90), 55.710800)) #large buffer for demonstration, using geod = FALSE for shorter runtime cc_inst(x, value = "flagged", buffer = 10, geod = FALSE) ## Not run: #' cc_inst(x, value = "flagged", buffer = 50000) #geod = T ## End(Not run)
x <- data.frame(species = letters[1:10], decimalLongitude = c(runif(99, -180, 180), 37.577800), decimalLatitude = c(runif(99, -90,90), 55.710800)) #large buffer for demonstration, using geod = FALSE for shorter runtime cc_inst(x, value = "flagged", buffer = 10, geod = FALSE) ## Not run: #' cc_inst(x, value = "flagged", buffer = 50000) #geod = T ## End(Not run)
Removes or flags records outside of the provided natural range polygon, on a per species basis. Expects one entry per species. See the example or https://www.iucnredlist.org/resources/spatial-data-download for the required polygon structure.
cc_iucn( x, range, lon = "decimalLongitude", lat = "decimalLatitude", species = "species", buffer = 0, value = "clean", verbose = TRUE )
cc_iucn( x, range, lon = "decimalLongitude", lat = "decimalLatitude", species = "species", buffer = 0, value = "clean", verbose = TRUE )
x |
data.frame. Containing geographical coordinates and species names. |
range |
a SpatVector of natural ranges for species in x.
Must contain a column named as indicated by |
lon |
character string. The column with the longitude coordinates. Default = “decimalLongitude”. |
lat |
character string. The column with the latitude coordinates. Default = “decimalLatitude”. |
species |
a character string. The column with the species name. Default = “species”. |
buffer |
numerical. The buffer around each species' range, from where records should be flagged as problematic, in meters. Default = 0. |
value |
character string. Defining the output value. See value. |
verbose |
logical. If TRUE reports the name of the test and the number of records flagged. |
Download natural range maps in suitable format for amphibians, birds, mammals and reptiles from https://www.iucnredlist.org/resources/spatial-data-download. Note: the buffer radius is in degrees, thus will differ slightly between different latitudes.
Depending on the ‘value’ argument, either a data.frame
containing the records considered correct by the test (“clean”) or a
logical vector (“flagged”), with TRUE = test passed and FALSE = test
failed/potentially problematic . Default = “clean”.
See https://ropensci.github.io/CoordinateCleaner/ for more details and tutorials.
Other Coordinates:
cc_aohi()
,
cc_cap()
,
cc_cen()
,
cc_coun()
,
cc_dupl()
,
cc_equ()
,
cc_gbif()
,
cc_inst()
,
cc_outl()
,
cc_sea()
,
cc_urb()
,
cc_val()
,
cc_zero()
library(terra) x <- data.frame(species = c("A", "B"), decimalLongitude = runif(100, -170, 170), decimalLatitude = runif(100, -80,80)) range_species_A <- cbind(c(-45,-45,-60,-60,-45), c(-10,-25,-25,-10,-10)) rangeA <- terra::vect(range_species_A, "polygons") range_species_B <- cbind(c(15,15,32,32,15), c(10,-10,-10,10,10)) rangeB <- terra::vect(range_species_B, "polygons") range <- terra::vect(list(rangeA, rangeB)) range$binomial <- c("A", "B") cc_iucn(x = x, range = range, buffer = 0)
library(terra) x <- data.frame(species = c("A", "B"), decimalLongitude = runif(100, -170, 170), decimalLatitude = runif(100, -80,80)) range_species_A <- cbind(c(-45,-45,-60,-60,-45), c(-10,-25,-25,-10,-10)) rangeA <- terra::vect(range_species_A, "polygons") range_species_B <- cbind(c(15,15,32,32,15), c(10,-10,-10,10,10)) rangeB <- terra::vect(range_species_B, "polygons") range <- terra::vect(list(rangeA, rangeB)) range$binomial <- c("A", "B") cc_iucn(x = x, range = range, buffer = 0)
Removes out or flags records that are outliers in geographic space according
to the method defined via the method
argument. Geographic outliers
often represent erroneous coordinates, for example due to data entry errors,
imprecise geo-references, individuals in horticulture/captivity.
cc_outl( x, lon = "decimalLongitude", lat = "decimalLatitude", species = "species", method = "quantile", mltpl = 5, tdi = 1000, value = "clean", sampling_thresh = 0, verbose = TRUE, min_occs = 7, thinning = FALSE, thinning_res = 0.5 )
cc_outl( x, lon = "decimalLongitude", lat = "decimalLatitude", species = "species", method = "quantile", mltpl = 5, tdi = 1000, value = "clean", sampling_thresh = 0, verbose = TRUE, min_occs = 7, thinning = FALSE, thinning_res = 0.5 )
x |
data.frame. Containing geographical coordinates and species names. |
lon |
character string. The column with the longitude coordinates. Default = “decimalLongitude”. |
lat |
character string. The column with the latitude coordinates. Default = “decimalLatitude”. |
species |
character string. The column with the species name. Default = “species”. |
method |
character string. Defining the method for outlier selection. See details. One of “distance”, “quantile”, “mad”. Default = “quantile”. |
mltpl |
numeric. The multiplier of the interquartile range
( |
tdi |
numeric. The minimum absolute distance ( |
value |
character string. Defining the output value. See value. |
sampling_thresh |
numeric. Cut off threshold for the sampling
correction. Indicates the quantile of sampling in which outliers should be
ignored. For instance, if |
verbose |
logical. If TRUE reports the name of the test and the number of records flagged. |
min_occs |
Minimum number of geographically unique datapoints needed for
a species to be tested. This is necessary for reliable outlier estimation.
Species with fewer than min_occs records will not be tested and the output
value will be 'TRUE'. Default is to 7. If |
thinning |
forces a raster approximation for the distance calculation. This is routinely used for species with more than 10,000 records for computational reasons, but can be enforced for smaller datasets, which is recommended when sampling is very uneven. |
thinning_res |
The resolution for the spatial thinning in decimal degrees. Default = 0.5. |
The method for outlier identification depends on the method
argument.
If “quantile”: a boxplot method is used and records are flagged as
outliers if their mean distance to all other records of the same
species is larger than mltpl * the interquartile range of the mean distance
of all records of this species. If “mad”: the median absolute
deviation is used. In this case a record is flagged as outlier, if the
mean distance to all other records of the same species is larger than
the median of the mean distance of all points plus/minus the mad of the mean
distances of all records of the species * mltpl. If “distance”:
records are flagged as outliers, if the minimum distance to the next
record of the species is > tdi
. For species with records from > 10000
unique locations a random sample of 1000 records is used for the distance
matrix calculation. The test skips species with fewer than min_occs
,
geographically unique records.
The likelihood of occurrence records being erroneous outliers is linked to the sampling effort in any given location. To account for this, the sampling_cor option fetches the number of occurrence records available from www.gbif.org, per country as a proxy of sampling effort. The outlier test (the mean distance) for each records is than weighted by the log transformed number of records per square kilometre in this country. See for https://besjournals.onlinelibrary.wiley.com/doi/full/10.1111/2041-210X.13152 an example and further explanation of the outlier test.
Depending on the ‘value’ argument, either a data.frame
containing the records considered correct by the test (“clean”) or a
logical vector (“flagged”), with TRUE = test passed and FALSE = test
failed/potentially problematic . Default = “clean”.
See https://ropensci.github.io/CoordinateCleaner/ for more details and tutorials.
Other Coordinates:
cc_aohi()
,
cc_cap()
,
cc_cen()
,
cc_coun()
,
cc_dupl()
,
cc_equ()
,
cc_gbif()
,
cc_inst()
,
cc_iucn()
,
cc_sea()
,
cc_urb()
,
cc_val()
,
cc_zero()
x <- data.frame(species = letters[1:10], decimalLongitude = runif(100, -180, 180), decimalLatitude = runif(100, -90,90)) cc_outl(x) cc_outl(x, method = "quantile", value = "flagged") cc_outl(x, method = "distance", value = "flagged", tdi = 10000) cc_outl(x, method = "distance", value = "flagged", tdi = 1000)
x <- data.frame(species = letters[1:10], decimalLongitude = runif(100, -180, 180), decimalLatitude = runif(100, -90,90)) cc_outl(x) cc_outl(x, method = "quantile", value = "flagged") cc_outl(x, method = "distance", value = "flagged", tdi = 10000) cc_outl(x, method = "distance", value = "flagged", tdi = 1000)
Removes or flags coordinates outside the reference landmass. Can be used to restrict datasets to terrestrial taxa, or exclude records from the open ocean, when depending on the reference (see details). Often records of terrestrial taxa can be found in the open ocean, mostly due to switched latitude and longitude.
cc_sea( x, lon = "decimalLongitude", lat = "decimalLatitude", ref = NULL, scale = 110, value = "clean", speedup = TRUE, verbose = TRUE, buffer = NULL )
cc_sea( x, lon = "decimalLongitude", lat = "decimalLatitude", ref = NULL, scale = 110, value = "clean", speedup = TRUE, verbose = TRUE, buffer = NULL )
x |
data.frame. Containing geographical coordinates and species names. |
lon |
character string. The column with the longitude coordinates. Default = “decimalLongitude”. |
lat |
character string. The column with the latitude coordinates. Default = “decimalLatitude”. |
ref |
SpatVector (geometry: polygons). Providing the geographic gazetteer. Can be any SpatVector (geometry: polygons), but the structure must be identical to rnaturalearth::ne_download(scale = 110, type = 'land', category = 'physical', returnclass = 'sf'). Default = rnaturalearth::ne_download(scale = 110, type = 'land', category = 'physical', returnclass = 'sf'). |
scale |
the scale of the default reference, as downloaded from natural earth. Must be one of 10, 50, 110. Higher numbers equal higher detail. Default = 110. |
value |
character string. Defining the output value. See value. |
speedup |
logical. Using heuristic to speed up the analysis for large data sets with many records per location. |
verbose |
logical. If TRUE reports the name of the test and the number of records flagged. |
buffer |
numeric. Units are in meters. If provided, a buffer is created around the sea polygon, or ref provided. |
In some cases flagging records close of the coastline is not recommendable,
because of the low precision of the reference dataset, minor GPS imprecision
or because a dataset might include coast or marshland species. If you only
want to flag records in the open ocean, consider using a buffered landmass
reference, e.g.: buffland
.
Depending on the ‘value’ argument, either a data.frame
containing the records considered correct by the test (“clean”) or a
logical vector (“flagged”), with TRUE = test passed and FALSE = test
failed/potentially problematic . Default = “clean”.
See https://ropensci.github.io/CoordinateCleaner/ for more details and tutorials.
Other Coordinates:
cc_aohi()
,
cc_cap()
,
cc_cen()
,
cc_coun()
,
cc_dupl()
,
cc_equ()
,
cc_gbif()
,
cc_inst()
,
cc_iucn()
,
cc_outl()
,
cc_urb()
,
cc_val()
,
cc_zero()
x <- data.frame(species = letters[1:10], decimalLongitude = runif(10, -30, 30), decimalLatitude = runif(10, -30, 30)) cc_sea(x, value = "flagged")
x <- data.frame(species = letters[1:10], decimalLongitude = runif(10, -30, 30), decimalLatitude = runif(10, -30, 30)) cc_sea(x, value = "flagged")
Removes or flags records from inside urban areas, based on a geographic gazetteer. Often records from large databases span substantial time periods (centuries) and old records might represent habitats which today are replaced by city area.
cc_urb( x, lon = "decimalLongitude", lat = "decimalLatitude", ref = NULL, value = "clean", verbose = TRUE )
cc_urb( x, lon = "decimalLongitude", lat = "decimalLatitude", ref = NULL, value = "clean", verbose = TRUE )
x |
data.frame. Containing geographical coordinates and species names. |
lon |
character string. The column with the longitude coordinates. Default = “decimalLongitude”. |
lat |
character string. The column with the latitude coordinates. Default = “decimalLatitude”. |
ref |
a SpatVector. Providing the geographic gazetteer
with the urban areas. See details. By default
rnaturalearth::ne_download(scale = 'medium', type = 'urban_areas',
returnclass = "sf"). Can be any |
value |
character string. Defining the output value. See value. |
verbose |
logical. If TRUE reports the name of the test and the number of records flagged. |
Depending on the ‘value’ argument, either a data.frame
containing the records considered correct by the test (“clean”) or a
logical vector (“flagged”), with TRUE = test passed and FALSE = test
failed/potentially problematic . Default = “clean”.
See https://ropensci.github.io/CoordinateCleaner/ for more details and tutorials.
Other Coordinates:
cc_aohi()
,
cc_cap()
,
cc_cen()
,
cc_coun()
,
cc_dupl()
,
cc_equ()
,
cc_gbif()
,
cc_inst()
,
cc_iucn()
,
cc_outl()
,
cc_sea()
,
cc_val()
,
cc_zero()
## Not run: x <- data.frame(species = letters[1:10], decimalLongitude = runif(100, -180, 180), decimalLatitude = runif(100, -90,90)) cc_urb(x) cc_urb(x, value = "flagged") ## End(Not run)
## Not run: x <- data.frame(species = letters[1:10], decimalLongitude = runif(100, -180, 180), decimalLatitude = runif(100, -90,90)) cc_urb(x) cc_urb(x, value = "flagged") ## End(Not run)
Removes or flags non-numeric and not available coordinates as well as lat >90, lat <-90, lon > 180 and lon < -180 are flagged.
cc_val( x, lon = "decimalLongitude", lat = "decimalLatitude", value = "clean", verbose = TRUE )
cc_val( x, lon = "decimalLongitude", lat = "decimalLatitude", value = "clean", verbose = TRUE )
x |
data.frame. Containing geographical coordinates and species names. |
lon |
character string. The column with the longitude coordinates. Default = “decimalLongitude”. |
lat |
character string. The column with the latitude coordinates. Default = “decimalLatitude”. |
value |
character string. Defining the output value. See value. |
verbose |
logical. If TRUE reports the name of the test and the number of records flagged. |
This test is obligatory before running any further tests of CoordinateCleaner, as additional tests only run with valid coordinates.
Depending on the ‘value’ argument, either a data.frame
containing the records considered correct by the test (“clean”) or a
logical vector (“flagged”), with TRUE = test passed and FALSE = test
failed/potentially problematic . Default = “clean”.
See https://ropensci.github.io/CoordinateCleaner/ for more details and tutorials.
Other Coordinates:
cc_aohi()
,
cc_cap()
,
cc_cen()
,
cc_coun()
,
cc_dupl()
,
cc_equ()
,
cc_gbif()
,
cc_inst()
,
cc_iucn()
,
cc_outl()
,
cc_sea()
,
cc_urb()
,
cc_zero()
x <- data.frame(species = letters[1:10], decimalLongitude = c(runif(106, -180, 180), NA, "13W33'", "67,09", 305), decimalLatitude = runif(110, -90,90)) cc_val(x) cc_val(x, value = "flagged")
x <- data.frame(species = letters[1:10], decimalLongitude = c(runif(106, -180, 180), NA, "13W33'", "67,09", 305), decimalLatitude = runif(110, -90,90)) cc_val(x) cc_val(x, value = "flagged")
Removes or flags records with either zero longitude or latitude and a radius around the point at zero longitude and zero latitude. These problems are often due to erroneous data-entry or geo-referencing and can lead to typical patterns of high diversity around the equator.
cc_zero( x, lon = "decimalLongitude", lat = "decimalLatitude", buffer = 0.5, value = "clean", verbose = TRUE )
cc_zero( x, lon = "decimalLongitude", lat = "decimalLatitude", buffer = 0.5, value = "clean", verbose = TRUE )
x |
data.frame. Containing geographical coordinates and species names. |
lon |
character string. The column with the longitude coordinates. Default = “decimalLongitude”. |
lat |
character string. The column with the latitude coordinates. Default = “decimalLatitude”. |
buffer |
numerical. The buffer around the 0/0 point, where records should be flagged as problematic, in decimal degrees. Default = 0.5. |
value |
character string. Defining the output value. See value. |
verbose |
logical. If TRUE reports the name of the test and the number of records flagged. |
Depending on the ‘value’ argument, either a data.frame
containing the records considered correct by the test (“clean”) or a
logical vector (“flagged”), with TRUE = test passed and FALSE = test
failed/potentially problematic . Default = “clean”.
See https://ropensci.github.io/CoordinateCleaner/ for more details and tutorials.
Other Coordinates:
cc_aohi()
,
cc_cap()
,
cc_cen()
,
cc_coun()
,
cc_dupl()
,
cc_equ()
,
cc_gbif()
,
cc_inst()
,
cc_iucn()
,
cc_outl()
,
cc_sea()
,
cc_urb()
,
cc_val()
x <- data.frame(species = "A", decimalLongitude = c(0,34.84, 0, 33.98), decimalLatitude = c(23.08, 0, 0, 15.98)) cc_zero(x) cc_zero(x, value = "flagged")
x <- data.frame(species = "A", decimalLongitude = c(0,34.84, 0, 33.98), decimalLatitude = c(23.08, 0, 0, 15.98)) cc_zero(x) cc_zero(x, value = "flagged")
This test flags datasets where a significant fraction of records has been subject to a common degree minute to decimal degree conversion error, where the degree sign is recognized as decimal delimiter.
cd_ddmm( x, lon = "decimalLongitude", lat = "decimalLatitude", ds = "dataset", pvalue = 0.025, diff = 1, mat_size = 1000, min_span = 2, value = "clean", verbose = TRUE, diagnostic = FALSE )
cd_ddmm( x, lon = "decimalLongitude", lat = "decimalLatitude", ds = "dataset", pvalue = 0.025, diff = 1, mat_size = 1000, min_span = 2, value = "clean", verbose = TRUE, diagnostic = FALSE )
x |
data.frame. Containing geographical coordinates and species names. |
lon |
character string. The column with the longitude coordinates. Default = “decimalLongitude”. |
lat |
character string. The column with the latitude coordinates. Default = “decimalLatitude”. |
ds |
a character string. The column with the dataset of each record. In
case |
pvalue |
numeric. The p-value for the one-sided t-test to flag the test as passed or not. Both ddmm.pvalue and diff must be met. Default = 0.025. |
diff |
numeric. The threshold difference for the ddmm test. Indicates by which fraction the records with decimals below 0.6 must outnumber the records with decimals above 0.6. Default = 1 |
mat_size |
numeric. The size of the matrix for the binomial test. Must be changed in decimals (e.g. 100, 1000, 10000). Adapt to dataset size, generally 100 is better for datasets < 10000 records, 1000 is better for datasets with 10000 - 1M records. Higher values also work reasonably well for smaller datasets, therefore, default = 1000. For large datasets try 10000. |
min_span |
numeric. The minimum geographic extent of datasets to be tested. Default = 2. |
value |
character string. Defining the output value. See value. |
verbose |
logical. If TRUE reports the name of the test and the number of records flagged. |
diagnostic |
logical. If TRUE plots the analyses matrix for each dataset. |
If the degree sign is recognized as decimal delimiter during coordinate
conversion, no coordinate decimals above 0.59 (59') are possible. The test
here uses a binomial test to test if a significant proportion of records in
a dataset have been subject to this problem. The test is best adjusted via
the diff argument. The lower diff
, the stricter the test. Also scales
with dataset size. Empirically, for datasets with < 5,000 unique coordinate
records diff = 0.1
has proven reasonable flagging most datasets with
>25% problematic records and all dataset with >50% problematic records.
For datasets between 5,000 and 100,000 geographic unique records diff
= 0.01
is recommended, for datasets between 100,000 and 1 M records diff =
0.001, and so on.
Depending on the ‘value’ argument, either a data.frame
with summary statistics and flags for each dataset (“dataset”) or a
data.frame
containing the records considered correct by the test
(“clean”) or a logical vector (“flags”), with TRUE = test passed and FALSE =
test failed/potentially problematic. Default =
“clean”.
See https://ropensci.github.io/CoordinateCleaner/ for more details and tutorials.
Other Datasets:
cd_round()
clean <- data.frame(species = letters[1:10], decimalLongitude = runif(100, -180, 180), decimalLatitude = runif(100, -90,90), dataset = "FR") cd_ddmm(x = clean, value = "flagged") #problematic dataset lon <- sample(0:180, size = 100, replace = TRUE) + runif(100, 0,0.59) lat <- sample(0:90, size = 100, replace = TRUE) + runif(100, 0,0.59) prob <- data.frame(species = letters[1:10], decimalLongitude = lon, decimalLatitude = lat, dataset = "FR") cd_ddmm(x = prob, value = "flagged")
clean <- data.frame(species = letters[1:10], decimalLongitude = runif(100, -180, 180), decimalLatitude = runif(100, -90,90), dataset = "FR") cd_ddmm(x = clean, value = "flagged") #problematic dataset lon <- sample(0:180, size = 100, replace = TRUE) + runif(100, 0,0.59) lat <- sample(0:90, size = 100, replace = TRUE) + runif(100, 0,0.59) prob <- data.frame(species = letters[1:10], decimalLongitude = lon, decimalLatitude = lat, dataset = "FR") cd_ddmm(x = prob, value = "flagged")
Flags datasets with periodicity patterns indicative of a rasterized (lattice) collection scheme, as often obtain from e.g. atlas data. Using a combination of autocorrelation and sliding-window outlier detection to identify periodicity patterns in the data. See https://besjournals.onlinelibrary.wiley.com/doi/full/10.1111/2041-210X.13152 for further details and a description of the algorithm
cd_round( x, lon = "decimalLongitude", lat = "decimalLatitude", ds = "dataset", T1 = 7, reg_out_thresh = 2, reg_dist_min = 0.1, reg_dist_max = 2, min_unique_ds_size = 4, graphs = TRUE, test = "both", value = "clean", verbose = TRUE )
cd_round( x, lon = "decimalLongitude", lat = "decimalLatitude", ds = "dataset", T1 = 7, reg_out_thresh = 2, reg_dist_min = 0.1, reg_dist_max = 2, min_unique_ds_size = 4, graphs = TRUE, test = "both", value = "clean", verbose = TRUE )
x |
data.frame. Containing geographical coordinates and species names. |
lon |
character string. The column with the longitude coordinates. Default = “decimalLongitude”. |
lat |
character string. The column with the latitude coordinates. Default = “decimalLatitude”. |
ds |
a character string. The column with the dataset of each record. In
case |
T1 |
numeric. The threshold for outlier detection in a in an interquantile range based test. This is the major parameter to specify the sensitivity of the test: lower values, equal higher detection rate. Values between 7-11 are recommended. Default = 7. |
reg_out_thresh |
numeric. Threshold on the number of equal distances between outlier points. See details. Default = 2. |
reg_dist_min |
numeric. The minimum detection distance between outliers in degrees (the minimum resolution of grids that will be flagged). Default = 0.1. |
reg_dist_max |
numeric. The maximum detection distance between outliers in degrees (the maximum resolution of grids that will be flagged). Default = 2. |
min_unique_ds_size |
numeric. The minimum number of unique locations (values in the tested column) for datasets to be included in the test. Default = 4. |
graphs |
logical. If TRUE, diagnostic plots are produced. Default = TRUE. |
test |
character string. Indicates which column to test. Either “lat” for latitude, “lon” for longitude, or “both” for both. In the latter case datasets are only flagged if both test are failed. Default = “both” |
value |
character string. Defining the output value. See value. |
verbose |
logical. If TRUE reports the name of the test and the number of records flagged. |
Depending on the ‘value’ argument, either a data.frame
with summary statistics and flags for each dataset (“dataset”) or a
data.frame
containing the records considered correct by the test
(“clean”) or a logical vector (“flagged”), with TRUE = test passed and FALSE =
test failed/potentially problematic. Default =
“clean”.
See https://ropensci.github.io/CoordinateCleaner/ for more details and tutorials.
Other Datasets:
cd_ddmm()
#simulate bias grid, one degree resolution, 10% error on a 1000 records dataset #simulate biased fraction of the data, grid resolution = 1 degree #simulate non-biased fraction of the data bi <- sample(3 + 0:5, size = 100, replace = TRUE) mu <- runif(3, 0, 15) sig <- runif(3, 0.1, 5) cl <- rnorm(n = 900, mean = mu, sd = sig) lon <- c(cl, bi) bi <- sample(9:13, size = 100, replace = TRUE) mu <- runif(3, 0, 15) sig <- runif(3, 0.1, 5) cl <- rnorm(n = 900, mean = mu, sd = sig) lat <- c(cl, bi) #add biased data inp <- data.frame(decimalLongitude = lon, decimalLatitude = lat, dataset = "test") #run test ## Not run: cd_round(inp, value = "dataset") ## End(Not run)
#simulate bias grid, one degree resolution, 10% error on a 1000 records dataset #simulate biased fraction of the data, grid resolution = 1 degree #simulate non-biased fraction of the data bi <- sample(3 + 0:5, size = 100, replace = TRUE) mu <- runif(3, 0, 15) sig <- runif(3, 0.1, 5) cl <- rnorm(n = 900, mean = mu, sd = sig) lon <- c(cl, bi) bi <- sample(9:13, size = 100, replace = TRUE) mu <- runif(3, 0, 15) sig <- runif(3, 0.1, 5) cl <- rnorm(n = 900, mean = mu, sd = sig) lat <- c(cl, bi) #add biased data inp <- data.frame(decimalLongitude = lon, decimalLatitude = lat, dataset = "test") #run test ## Not run: cd_round(inp, value = "dataset") ## End(Not run)
Removes or flags records that are temporal outliers based on interquantile ranges.
cf_age( x, lon = "decimalLongitude", lat = "decimalLatitude", min_age = "min_ma", max_age = "max_ma", taxon = "accepted_name", method = "quantile", size_thresh = 7, mltpl = 5, replicates = 5, flag_thresh = 0.5, uniq_loc = FALSE, value = "clean", verbose = TRUE )
cf_age( x, lon = "decimalLongitude", lat = "decimalLatitude", min_age = "min_ma", max_age = "max_ma", taxon = "accepted_name", method = "quantile", size_thresh = 7, mltpl = 5, replicates = 5, flag_thresh = 0.5, uniq_loc = FALSE, value = "clean", verbose = TRUE )
x |
data.frame. Containing fossil records with taxon names, ages, and geographic coordinates. |
lon |
character string. The column with the longitude coordinates.
To identify unique records if |
lat |
character string. The column with the longitude coordinates.
Default = “decimalLatitude”. To identify unique records if |
min_age |
character string. The column with the minimum age. Default = “min_ma”. |
max_age |
character string. The column with the maximum age. Default = “max_ma”. |
taxon |
character string. The column with the taxon name. If “”, searches for outliers over the entire dataset, otherwise per specified taxon. Default = “accepted_name”. |
method |
character string. Defining the method for outlier selection. See details. Either “quantile” or “mad”. Default = “quantile”. |
size_thresh |
numeric. The minimum number of records needed for a dataset to be tested. Default = 10. |
mltpl |
numeric. The multiplier of the interquartile range
( |
replicates |
numeric. The number of replications for the distance matrix calculation. See details. Default = 5. |
flag_thresh |
numeric. The fraction of passed replicates necessary to pass the test. See details. Default = 0.5. |
uniq_loc |
logical. If TRUE only single records per location and time
point (and taxon if |
value |
character string. Defining the output value. See value. |
verbose |
logical. If TRUE reports the name of the test and the number of records flagged. |
The outlier detection is based on an interquantile range test. A temporal
distance matrix among all records is calculated based on a single point selected by random
between the minimum and maximum age for each record. The mean distance for
each point to all neighbours is calculated and the sum of these distances
is then tested against the interquantile range and flagged as an outlier if
. The test is replicated ‘replicates’
times, to account for dating uncertainty. Records are flagged as outliers
if they are flagged by a fraction of more than ‘flag.thresh’
replicates. Only datasets/taxa comprising more than ‘size_thresh’
records are tested. Distance are calculated as Euclidean distance.
Depending on the ‘value’ argument, either a data.frame
containing the records considered correct by the test (“clean”) or a
logical vector (“flagged”), with TRUE = test passed and FALSE = test
failed/potentially problematic . Default = “clean”.
See https://ropensci.github.io/CoordinateCleaner/ for more details and tutorials.
Other fossils:
cf_equal()
,
cf_outl()
,
cf_range()
,
write_pyrate()
minages <- c(runif(n = 11, min = 10, max = 25), 62.5) x <- data.frame(species = c(letters[1:10], rep("z", 2)), min_ma = minages, max_ma = c(minages[1:11] + runif(n = 11, min = 0, max = 5), 65)) cf_age(x, value = "flagged", taxon = "") # unique locations only x <- data.frame(species = c(letters[1:10], rep("z", 2)), decimalLongitude = c(runif(n = 10, min = 4, max = 16), 75, 7), decimalLatitude = c(runif(n = 12, min = -5, max = 5)), min_ma = minages, max_ma = c(minages[1:11] + runif(n = 11, min = 0, max = 5), 65)) cf_age(x, value = "flagged", taxon = "", uniq_loc = TRUE)
minages <- c(runif(n = 11, min = 10, max = 25), 62.5) x <- data.frame(species = c(letters[1:10], rep("z", 2)), min_ma = minages, max_ma = c(minages[1:11] + runif(n = 11, min = 0, max = 5), 65)) cf_age(x, value = "flagged", taxon = "") # unique locations only x <- data.frame(species = c(letters[1:10], rep("z", 2)), decimalLongitude = c(runif(n = 10, min = 4, max = 16), 75, 7), decimalLatitude = c(runif(n = 12, min = -5, max = 5)), min_ma = minages, max_ma = c(minages[1:11] + runif(n = 11, min = 0, max = 5), 65)) cf_age(x, value = "flagged", taxon = "", uniq_loc = TRUE)
Removes or flags records with equal minimum and maximum age.
cf_equal( x, min_age = "min_ma", max_age = "max_ma", value = "clean", verbose = TRUE )
cf_equal( x, min_age = "min_ma", max_age = "max_ma", value = "clean", verbose = TRUE )
x |
data.frame. Containing fossil records with taxon names, ages, and geographic coordinates. |
min_age |
character string. The column with the minimum age. Default = “min_ma”. |
max_age |
character string. The column with the maximum age. Default = “max_ma”. |
value |
character string. Defining the output value. See value. |
verbose |
logical. If TRUE reports the name of the test and the number of records flagged. |
Depending on the ‘value’ argument, either a data.frame
containing the records considered correct by the test (“clean”) or a
logical vector (“flagged”), with TRUE = test passed and FALSE = test
failed/potentially problematic . Default = “clean”.
See https://ropensci.github.io/CoordinateCleaner/ for more details and tutorials.
Other fossils:
cf_age()
,
cf_outl()
,
cf_range()
,
write_pyrate()
minages <- runif(n = 10, min = 0.1, max = 25) x <- data.frame(species = letters[1:10], min_ma = minages, max_ma = minages + runif(n = 10, min = 0, max = 10)) x <- rbind(x, data.frame(species = "z", min_ma = 5, max_ma = 5)) cf_equal(x, value = "flagged")
minages <- runif(n = 10, min = 0.1, max = 25) x <- data.frame(species = letters[1:10], min_ma = minages, max_ma = minages + runif(n = 10, min = 0, max = 10)) x <- rbind(x, data.frame(species = "z", min_ma = 5, max_ma = 5)) cf_equal(x, value = "flagged")
Removes or flags records of fossils that are spatio-temporal outliers based on interquantile ranges. Records are flagged if they are either extreme in time or space, or both.
cf_outl( x, lon = "decimalLongitude", lat = "decimalLatitude", min_age = "min_ma", max_age = "max_ma", taxon = "accepted_name", method = "quantile", size_thresh = 7, mltpl = 5, replicates = 5, flag_thresh = 0.5, uniq_loc = FALSE, value = "clean", verbose = TRUE )
cf_outl( x, lon = "decimalLongitude", lat = "decimalLatitude", min_age = "min_ma", max_age = "max_ma", taxon = "accepted_name", method = "quantile", size_thresh = 7, mltpl = 5, replicates = 5, flag_thresh = 0.5, uniq_loc = FALSE, value = "clean", verbose = TRUE )
x |
data.frame. Containing fossil records with taxon names, ages, and geographic coordinates. |
lon |
character string. The column with the longitude coordinates.
To identify unique records if |
lat |
character string. The column with the longitude coordinates.
Default = “decimalLatitude”. To identify unique records if |
min_age |
character string. The column with the minimum age. Default = “min_ma”. |
max_age |
character string. The column with the maximum age. Default = “max_ma”. |
taxon |
character string. The column with the taxon name. If “”, searches for outliers over the entire dataset, otherwise per specified taxon. Default = “accepted_name”. |
method |
character string. Defining the method for outlier selection. See details. Either “quantile” or “mad”. Default = “quantile”. |
size_thresh |
numeric. The minimum number of records needed for a dataset to be tested. Default = 10. |
mltpl |
numeric. The multiplier of the interquartile range
( |
replicates |
numeric. The number of replications for the distance matrix calculation. See details. Default = 5. |
flag_thresh |
numeric. The fraction of passed replicates necessary to pass the test. See details. Default = 0.5. |
uniq_loc |
logical. If TRUE only single records per location and time
point (and taxon if |
value |
character string. Defining the output value. See value. |
verbose |
logical. If TRUE reports the name of the test and the number of records flagged. |
The outlier detection is based on an interquantile range test. In a first
step a distance matrix of geographic distances among all records is
calculate. Subsequently a similar distance matrix of temporal distances
among all records is calculated based on a single point selected by random
between the minimum and maximum age for each record. The mean distance for
each point to all neighbours is calculated for both matrices and spatial and
temporal distances are scaled to the same range. The sum of these distanced
is then tested against the interquantile range and flagged as an outlier if
. The test is replicated ‘replicates’
times, to account for temporal uncertainty. Records are flagged as outliers
if they are flagged by a fraction of more than ‘flag.thres’
replicates. Only datasets/taxa comprising more than ‘size_thresh’
records are tested. Note that geographic distances are calculated as
geospheric distances for datasets (or taxa) with fewer than 10,000 records
and approximated as Euclidean distances for datasets/taxa with 10,000 to
25,000 records. Datasets/taxa comprising more than 25,000 records are
skipped.
Depending on the ‘value’ argument, either a data.frame
containing the records considered correct by the test (“clean”) or a
logical vector (“flagged”), with TRUE = test passed and FALSE = test
failed/potentially problematic . Default = “clean”.
See https://ropensci.github.io/CoordinateCleaner/ for more details and tutorials.
Other fossils:
cf_age()
,
cf_equal()
,
cf_range()
,
write_pyrate()
minages <- c(runif(n = 11, min = 10, max = 25), 62.5) x <- data.frame(species = c(letters[1:10], rep("z", 2)), lng = c(runif(n = 10, min = 4, max = 16), 75, 7), lat = c(runif(n = 12, min = -5, max = 5)), min_ma = minages, max_ma = c(minages[1:11] + runif(n = 11, min = 0, max = 5), 65)) cf_outl(x, value = "flagged", taxon = "")
minages <- c(runif(n = 11, min = 10, max = 25), 62.5) x <- data.frame(species = c(letters[1:10], rep("z", 2)), lng = c(runif(n = 10, min = 4, max = 16), 75, 7), lat = c(runif(n = 12, min = -5, max = 5)), min_ma = minages, max_ma = c(minages[1:11] + runif(n = 11, min = 0, max = 5), 65)) cf_outl(x, value = "flagged", taxon = "")
Removes or flags records with an unexpectedly large temporal range, based on a quantile outlier test.
cf_range( x, lon = "decimalLongitude", lat = "decimalLatitude", min_age = "min_ma", max_age = "max_ma", taxon = "accepted_name", method = "quantile", mltpl = 5, size_thresh = 7, max_range = 500, uniq_loc = FALSE, value = "clean", verbose = TRUE )
cf_range( x, lon = "decimalLongitude", lat = "decimalLatitude", min_age = "min_ma", max_age = "max_ma", taxon = "accepted_name", method = "quantile", mltpl = 5, size_thresh = 7, max_range = 500, uniq_loc = FALSE, value = "clean", verbose = TRUE )
x |
data.frame. Containing fossil records with taxon names, ages, and geographic coordinates. |
lon |
character string. The column with the longitude coordinates.
To identify unique records if |
lat |
character string. The column with the longitude coordinates.
Default = “decimalLatitude”. To identify unique records if |
min_age |
character string. The column with the minimum age. Default = “min_ma”. |
max_age |
character string. The column with the maximum age. Default = “max_ma”. |
taxon |
character string. The column with the taxon name. If “”, searches for outliers over the entire dataset, otherwise per specified taxon. Default = “accepted_name”. |
method |
character string. Defining the method for outlier selection. See details. Either “quantile” or “mad”. Default = “quantile”. |
mltpl |
numeric. The multiplier of the interquartile range
( |
size_thresh |
numeric. The minimum number of records needed for a dataset to be tested. Default = 10. |
max_range |
numeric. A absolute maximum time interval between min age
and max age. Only relevant for |
uniq_loc |
logical. If TRUE only single records per location and time
point (and taxon if |
value |
character string. Defining the output value. See value. |
verbose |
logical. If TRUE reports the name of the test and the number of records flagged. |
Depending on the ‘value’ argument, either a data.frame
containing the records considered correct by the test (“clean”) or a
logical vector (“flagged”), with TRUE = test passed and FALSE = test
failed/potentially problematic . Default = “clean”.
See https://ropensci.github.io/CoordinateCleaner/ for more details and tutorials.
Other fossils:
cf_age()
,
cf_equal()
,
cf_outl()
,
write_pyrate()
minages <- runif(n = 11, min = 0.1, max = 25) x <- data.frame(species = c(letters[1:10], "z"), lng = c(runif(n = 9, min = 4, max = 16), 75, 7), lat = c(runif(n = 11, min = -5, max = 5)), min_ma = minages, max_ma = minages + c(runif(n = 10, min = 0, max = 5), 25)) cf_range(x, value = "flagged", taxon = "")
minages <- runif(n = 11, min = 0.1, max = 25) x <- data.frame(species = c(letters[1:10], "z"), lng = c(runif(n = 9, min = 4, max = 16), 75, 7), lat = c(runif(n = 11, min = -5, max = 5)), min_ma = minages, max_ma = minages + c(runif(n = 10, min = 0, max = 5), 25)) cf_range(x, value = "flagged", taxon = "")
Cleaning geographic coordinates by multiple empirical tests to flag potentially erroneous coordinates, addressing issues common in biological collection databases.
clean_coordinates( x, lon = "decimalLongitude", lat = "decimalLatitude", species = "species", countries = NULL, tests = c("capitals", "centroids", "equal", "gbif", "institutions", "outliers", "seas", "zeros"), capitals_rad = 10000, centroids_rad = 1000, centroids_detail = "both", inst_rad = 100, outliers_method = "quantile", outliers_mtp = 5, outliers_td = 1000, outliers_size = 7, range_rad = 0, zeros_rad = 0.5, capitals_ref = NULL, centroids_ref = NULL, country_ref = NULL, country_refcol = "iso_a3", country_buffer = NULL, inst_ref = NULL, range_ref = NULL, seas_ref = NULL, seas_scale = 50, seas_buffer = NULL, urban_ref = NULL, aohi_rad = NULL, value = "spatialvalid", verbose = TRUE, report = FALSE )
clean_coordinates( x, lon = "decimalLongitude", lat = "decimalLatitude", species = "species", countries = NULL, tests = c("capitals", "centroids", "equal", "gbif", "institutions", "outliers", "seas", "zeros"), capitals_rad = 10000, centroids_rad = 1000, centroids_detail = "both", inst_rad = 100, outliers_method = "quantile", outliers_mtp = 5, outliers_td = 1000, outliers_size = 7, range_rad = 0, zeros_rad = 0.5, capitals_ref = NULL, centroids_ref = NULL, country_ref = NULL, country_refcol = "iso_a3", country_buffer = NULL, inst_ref = NULL, range_ref = NULL, seas_ref = NULL, seas_scale = 50, seas_buffer = NULL, urban_ref = NULL, aohi_rad = NULL, value = "spatialvalid", verbose = TRUE, report = FALSE )
x |
data.frame. Containing geographical coordinates and species names. |
lon |
character string. The column with the longitude coordinates. Default = “decimalLongitude”. |
lat |
character string. The column with the latitude coordinates. Default = “decimalLatitude”. |
species |
a character string. A vector of the same length as rows in x,
with the species identity for each record. If NULL, |
countries |
a character string. The column with the country assignment of each record in three letter ISO code. Default = “countrycode”. If missing, the countries test is skipped. |
tests |
a vector of character strings, indicating which tests to run. See details for all tests available. Default = c("capitals", "centroids", "equal", "gbif", "institutions", "outliers", "seas", "zeros") |
capitals_rad |
numeric. The radius around capital coordinates in meters. Default = 10000. |
centroids_rad |
numeric. The radius around centroid coordinates in meters. Default = 1000. |
centroids_detail |
a |
inst_rad |
numeric. The radius around biodiversity institutions coordinates in metres. Default = 100. |
outliers_method |
The method used for outlier testing. See details. |
outliers_mtp |
numeric. The multiplier for the interquartile range of
the outlier test. If NULL |
outliers_td |
numeric. The minimum distance of a record to all other records of a species to be identified as outlier, in km. Default = 1000. |
outliers_size |
numerical. The minimum number of records in a dataset to run the taxon-specific outlier test. Default = 7. |
range_rad |
buffer around natural ranges. Default = 0. |
zeros_rad |
numeric. The radius around 0/0 in degrees. Default = 0.5. |
capitals_ref |
a |
centroids_ref |
a |
country_ref |
a |
country_refcol |
the column name in the reference dataset, containing the relevant ISO codes for matching. Default is to "iso_a3_eh" which referes to the ISO-3 codes in the reference dataset. See notes. |
country_buffer |
numeric. Units are in meters. If provided, a buffer is created around each country polygon. |
inst_ref |
a |
range_ref |
a |
seas_ref |
a |
seas_scale |
The scale of the default landmass reference. Must be one of 10, 50, 110. Higher numbers equal higher detail. Default = 50. |
seas_buffer |
numeric. Units are in meters. If provided, a buffer is created around sea polygon. |
urban_ref |
a |
aohi_rad |
numeric. The radius around aohi coordinates in meters. Default = 1000. |
value |
a character string defining the output value. See the value
section for details. one of ‘spatialvalid’, ‘summary’,
‘clean’. Default = ‘ |
verbose |
logical. If TRUE reports the name of the test and the number of records flagged. |
report |
logical or character. If TRUE a report file is written to the working directory, summarizing the cleaning results. If a character, the path to which the file should be written. Default = FALSE. |
The function needs all coordinates to be formally valid according to WGS84. If the data contains invalid coordinates, the function will stop and return a vector flagging the invalid records. TRUE = non-problematic coordinate, FALSE = potentially problematic coordinates.
capitals tests a radius around adm-0 capitals. The
radius is capitals_rad
.
centroids tests a radius around country centroids.
The radius is centroids_rad
.
countries tests if coordinates are from the country indicated in the country column. Switched off by default.
duplicates tests for duplicate records. This checks for identical coordinates or if a species vector is provided for identical coordinates within a species. All but the first records are flagged as duplicates. Switched off by default.
equal tests for equal absolute longitude and latitude.
gbif tests a one-degree radius around the GBIF headquarters in Copenhagen, Denmark.
institutions tests a radius around known
biodiversity institutions from instiutions
. The radius is
inst_rad
.
outliers tests each species for outlier records.
Depending on the outliers_mtp
and outliers.td
arguments either
flags records that are a minimum distance away from all other records of this
species (outliers_td
) or records that are outside a multiple of the
interquartile range of minimum distances to the next neighbour of this
species (outliers_mtp
). Three different methods are available for the
outlier test: "If “outlier” a boxplot method is used and records are
flagged as outliers if their mean distance to all other records of the
same species is larger than mltpl * the interquartile range of the mean
distance of all records of this species. If “mad” the median absolute
deviation is used. In this case a record is flagged as outlier, if the
mean distance to all other records of the same species is larger than
the median of the mean distance of all points plus/minus the mad of the mean
distances of all records of the species * mltpl. If “distance” records
are flagged as outliers, if the minimum distance to the next record of
the species is > tdi
.
ranges tests if records fall within provided natural range polygons on
a per species basis. See cc_iucn
for details.
seas tests if coordinates fall into the ocean.
urban tests if coordinates are from urban areas. Switched off by default
validity checks if coordinates correspond to a lat/lon coordinate reference system. This test is always on, since all records need to pass for any other test to run.
zeros tests for plain zeros, equal latitude and
longitude and a radius around the point 0/0. The radius is zeros.rad
.
Depending on the output argument:
an object of class spatialvalid
similar to x
with one column added for each test. TRUE = clean coordinate entry, FALSE = potentially
problematic coordinate entries. The .summary column is FALSE if any test flagged
the respective coordinate.
a logical vector with the same order as the input data summarizing the results of all test. TRUE = clean coordinate, FALSE = potentially problematic (= at least one test failed).
a data.frame
similar to x
with potentially problematic records removed
Always tests for coordinate validity: non-numeric or missing coordinates and coordinates exceeding the global extent (lon/lat, WGS84). See https://ropensci.github.io/CoordinateCleaner/ for more details and tutorials.
The country_refcol argument allows to adapt the function to the
structure of alternative reference datasets. For instance, for
rnaturalearth::ne_countries(scale = "small", returnclass = "sf")
, the default will fail,
but country_refcol = "iso_a3" will work.
Other Wrapper functions:
clean_dataset()
,
clean_fossils()
exmpl <- data.frame(species = sample(letters, size = 250, replace = TRUE), decimalLongitude = runif(250, min = 42, max = 51), decimalLatitude = runif(250, min = -26, max = -11)) test <- clean_coordinates(x = exmpl, tests = c("equal")) ## Not run: #run more tests test <- clean_coordinates(x = exmpl, tests = c("capitals", "centroids","equal", "gbif", "institutions", "outliers", "seas", "zeros")) ## End(Not run) summary(test)
exmpl <- data.frame(species = sample(letters, size = 250, replace = TRUE), decimalLongitude = runif(250, min = 42, max = 51), decimalLatitude = runif(250, min = -26, max = -11)) test <- clean_coordinates(x = exmpl, tests = c("equal")) ## Not run: #run more tests test <- clean_coordinates(x = exmpl, tests = c("capitals", "centroids","equal", "gbif", "institutions", "outliers", "seas", "zeros")) ## End(Not run) summary(test)
Tests for problems associated with coordinate conversions and rounding, based on dataset properties. Includes test to identify contributing datasets with potential errors with converting ddmm to dd.dd, and periodicity in the data decimals indicating rounding or a raster basis linked to low coordinate precision. Specifically:
ddmm tests for erroneous conversion from a degree minute format (ddmm) to a decimal degree (dd.dd) format
periodicity test for periodicity in the data, which can indicate imprecise coordinates, due to rounding or rasterization.
clean_dataset( x, lon = "decimalLongitude", lat = "decimalLatitude", ds = "dataset", tests = c("ddmm", "periodicity"), value = "dataset", verbose = TRUE, ... )
clean_dataset( x, lon = "decimalLongitude", lat = "decimalLatitude", ds = "dataset", tests = c("ddmm", "periodicity"), value = "dataset", verbose = TRUE, ... )
x |
data.frame. Containing geographical coordinates and species names. |
lon |
character string. The column with the longitude coordinates. Default = “decimalLongitude”. |
lat |
character string. The column with the latitude coordinates. Default = “decimalLatitude”. |
ds |
a character string. The column with the dataset of each record. In
case |
tests |
a vector of character strings, indicating which tests to run. See details for all tests available. Default = c("ddmm", "periodicity") |
value |
a character string. Defining the output value. See value. Default = “dataset”. |
verbose |
logical. If TRUE reports the name of the test and the number of records flagged. |
... |
additional arguments to be passed to |
These tests are based on the statistical distribution of coordinates and their decimals within datasets of geographic distribution records to identify datasets with potential errors/biases. Three potential error sources can be identified. The ddmm flag tests for the particular pattern that emerges if geographical coordinates in a degree minute annotation are transferred into decimal degrees, simply replacing the degree symbol with the decimal point. This kind of problem has been observed by in older datasets first recorded on paper using typewriters, where e.g. a floating point was used as symbol for degrees. The function uses a binomial test to check if more records than expected have decimals below 0.6 (which is the maximum that can be obtained in minutes, as one degree has 60 minutes) and if the number of these records is higher than those above 0.59 by a certain proportion. The periodicity test uses rate estimation in a Poisson process to estimate if there is periodicity in the decimals of a dataset (as would be expected by for example rounding or data that was collected in a raster format) and if there is an over proportional number of records with the decimal 0 (full degrees) which indicates rounding and thus low precision. The default values are empirically optimized by with GBIF data, but should probably be adapted.
Depending on the ‘value’ argument:
a data.frame
with the
the test summary statistics for each dataset in x
a data.frame
containing only
records from datasets in x
that passed the tests
a logical vector of the same length as
rows in x
, with TRUE = test passed and
FALSE = test failed/potentially problematic.
See https://ropensci.github.io/CoordinateCleaner/ for more details and tutorials.
Other Wrapper functions:
clean_coordinates()
,
clean_fossils()
#Create test dataset clean <- data.frame(dataset = rep("clean", 1000), decimalLongitude = runif(min = -43, max = -40, n = 1000), decimalLatitude = runif(min = -13, max = -10, n = 1000)) bias.long <- c(round(runif(min = -42, max = -40, n = 500), 1), round(runif(min = -42, max = -40, n = 300), 0), runif(min = -42, max = -40, n = 200)) bias.lat <- c(round(runif(min = -12, max = -10, n = 500), 1), round(runif(min = -12, max = -10, n = 300), 0), runif(min = -12, max = -10, n = 200)) bias <- data.frame(dataset = rep("biased", 1000), decimalLongitude = bias.long, decimalLatitude = bias.lat) test <- rbind(clean, bias) ## Not run: #run clean_dataset flags <- clean_dataset(test) #check problems #clean hist(test[test$dataset == rownames(flags[flags$summary,]), "decimalLongitude"]) #biased hist(test[test$dataset == rownames(flags[!flags$summary,]), "decimalLongitude"]) ## End(Not run)
#Create test dataset clean <- data.frame(dataset = rep("clean", 1000), decimalLongitude = runif(min = -43, max = -40, n = 1000), decimalLatitude = runif(min = -13, max = -10, n = 1000)) bias.long <- c(round(runif(min = -42, max = -40, n = 500), 1), round(runif(min = -42, max = -40, n = 300), 0), runif(min = -42, max = -40, n = 200)) bias.lat <- c(round(runif(min = -12, max = -10, n = 500), 1), round(runif(min = -12, max = -10, n = 300), 0), runif(min = -12, max = -10, n = 200)) bias <- data.frame(dataset = rep("biased", 1000), decimalLongitude = bias.long, decimalLatitude = bias.lat) test <- rbind(clean, bias) ## Not run: #run clean_dataset flags <- clean_dataset(test) #check problems #clean hist(test[test$dataset == rownames(flags[flags$summary,]), "decimalLongitude"]) #biased hist(test[test$dataset == rownames(flags[!flags$summary,]), "decimalLongitude"]) ## End(Not run)
Cleaning records by multiple empirical tests to flag potentially erroneous coordinates and time-spans, addressing issues common in fossil collection databases. Individual tests can be activated via the tests argument:
clean_fossils( x, lon = "decimalLongitude", lat = "decimalLatitude", min_age = "min_ma", max_age = "max_ma", taxon = "accepted_name", tests = c("agesequal", "centroids", "equal", "gbif", "institutions", "spatiotemp", "temprange", "validity", "zeros"), countries = NULL, centroids_rad = 0.05, centroids_detail = "both", inst_rad = 0.001, outliers_method = "quantile", outliers_threshold = 5, outliers_size = 7, outliers_replicates = 5, zeros_rad = 0.5, centroids_ref = NULL, country_ref = NULL, inst_ref = NULL, value = "spatialvalid", verbose = TRUE, report = FALSE )
clean_fossils( x, lon = "decimalLongitude", lat = "decimalLatitude", min_age = "min_ma", max_age = "max_ma", taxon = "accepted_name", tests = c("agesequal", "centroids", "equal", "gbif", "institutions", "spatiotemp", "temprange", "validity", "zeros"), countries = NULL, centroids_rad = 0.05, centroids_detail = "both", inst_rad = 0.001, outliers_method = "quantile", outliers_threshold = 5, outliers_size = 7, outliers_replicates = 5, zeros_rad = 0.5, centroids_ref = NULL, country_ref = NULL, inst_ref = NULL, value = "spatialvalid", verbose = TRUE, report = FALSE )
x |
data.frame. Containing fossil records, containing taxon names, ages, and geographic coordinates.. |
lon |
character string. The column with the longitude coordinates. Default = “decimalLongitude”. |
lat |
character string. The column with the latitude coordinates. Default = “decimalLatitude”. |
min_age |
character string. The column with the minimum age. Default = “min_ma”. |
max_age |
character string. The column with the maximum age. Default = “max_ma”. |
taxon |
character string. The column with the taxon name. If “”, searches for outliers over the entire dataset, otherwise per specified taxon. Default = “accepted_name”. |
tests |
vector of character strings, indicating which tests to run. See details for all tests available. Default = c("centroids", "equal", "gbif", "institutions", "temprange", "spatiotemp", "agesequal", "zeros") |
countries |
a character string. The column with the country assignment of each record in three letter ISO code. Default = “countrycode”. If missing, the countries test is skipped. |
centroids_rad |
numeric. The radius around centroid coordinates in meters. Default = 1000. |
centroids_detail |
a |
inst_rad |
numeric. The radius around biodiversity institutions coordinates in metres. Default = 100. |
outliers_method |
The method used for outlier testing. See details. |
outliers_threshold |
numerical. The multiplier for the interquantile
range for outlier detection. The higher the number, the more conservative
the outlier tests. See |
outliers_size |
numerical. The minimum number of records in a dataset to run the taxon-specific outlier test. Default = 7. |
outliers_replicates |
numeric. The number of replications for the distance matrix calculation. See details. Default = 5. |
zeros_rad |
numeric. The radius around 0/0 in degrees. Default = 0.5. |
centroids_ref |
a |
country_ref |
a |
inst_ref |
a |
value |
a character string defining the output value. See the value
section for details. one of ‘spatialvalid’, ‘summary’,
‘clean’. Default = ‘ |
verbose |
logical. If TRUE reports the name of the test and the number of records flagged. |
report |
logical or character. If TRUE a report file is written to the working directory, summarizing the cleaning results. If a character, the path to which the file should be written. Default = FALSE. |
agesequal tests for equal minimum and maximum age.
centroids tests a radius around country centroids.
The radius is centroids_rad
.
countries tests if coordinates are from the country indicated in the country column. Switched off by default.
equal tests for equal absolute longitude and latitude.
gbif tests a one-degree radius around the GBIF headquarters in Copenhagen, Denmark.
institutions tests a radius around known
biodiversity institutions from instiutions
. The radius is
inst_rad
.
spatiotemp test for records which are outlier in time and space. See below for details.
temprange tests for records with unexpectedly large temporal ranges, using a quantile-based outlier test.
validity checks if coordinates correspond to a lat/lon coordinate reference system. This test is always on, since all records need to pass for any other test to run.
zeros tests for plain zeros, equal latitude and
longitude and a radius around the point 0/0. The radius is zeros_rad
.
The outlier detection in ‘spatiotemp’ is based on an interquantile range test. In a first
step a distance matrix of geographic distances among all records is
calculate. Subsequently a similar distance matrix of temporal distances
among all records is calculated based on a single point selected by random
between the minimum and maximum age for each record. The mean distance for
each point to all neighbours is calculated for both matrices and spatial and
temporal distances are scaled to the same range. The sum of these distanced
is then tested against the interquantile range and flagged as an outlier if
. The test is replicated ‘replicates’
times, to account for temporal uncertainty. Records are flagged as outliers
if they are flagged by a fraction of more than ‘flag_thresh’
replicates. Only datasets/taxa comprising more than ‘size.thresh’
records are tested. Note that geographic distances are calculated as
geospheric distances for datasets (or taxa) with fewer than 10,000 records
and approximated as Euclidean distances for datasets/taxa with 10,000 to
25,000 records. Datasets/taxa comprising more than 25,000 records are
skipped.
Depending on the output argument:
an object of class spatialvalid
similar to x
with one column added for each test. TRUE = clean coordinate entry, FALSE = potentially
problematic coordinate entries. The .summary column is FALSE if any test flagged
the respective coordinate.
a logical vector with the same order as the input data summarizing the results of all test. TRUE = clean coordinate, FALSE = potentially problematic (= at least one test failed).
a data.frame
similar to x
with potentially problematic records removed
Always tests for coordinate validity: non-numeric or missing coordinates and coordinates exceeding the global extent (lon/lat, WGS84).
See https://ropensci.github.io/CoordinateCleaner/ for more details and tutorials.
Other Wrapper functions:
clean_coordinates()
,
clean_dataset()
minages <- runif(250, 0, 65) exmpl <- data.frame(accepted_name = sample(letters, size = 250, replace = TRUE), decimalLongitude = runif(250, min = 42, max = 51), decimalLatitude = runif(250, min = -26, max = -11), min_ma = minages, max_ma = minages + runif(250, 0.1, 65)) test <- clean_fossils(x = exmpl) summary(test)
minages <- runif(250, 0, 65) exmpl <- data.frame(accepted_name = sample(letters, size = 250, replace = TRUE), decimalLongitude = runif(250, min = 42, max = 51), decimalLatitude = runif(250, min = -26, max = -11), min_ma = minages, max_ma = minages + runif(250, 0.1, 65)) test <- clean_fossils(x = exmpl) summary(test)
A data.frame
with coordinates of country and province centroids and
country capitals as reference for the clean_coordinates
,
cc_cen
and cc_cap
functions. Coordinates are
based on the Central Intelligence Agency World Factbook
https://www.cia.gov/the-world-factbook/,
https://thematicmapping.org/downloads/world_borders.php and geolocate
https://geo-locate.org.
A data frame with 5,305 observations on 13 variables. #'
ISO-3 code for each country, in case of provinces also referring to the country.
ISO-2 code for each country, in case of provinces also referring to the country.
adm code for countries and provinces.
a factor; name of the country or province.
identifying if the entry refers to a country or province level.
Longitude of the country centroid.
Latitude of the country centroid.
Name of the country capital, empty for provinces.
Longitude of the country capital.
Latitude of the country capital.
The area of the country or province.
The uncertainty of the country centroid.
The data source. Currently only available for https://geo-locate.org
CENTRAL INTELLIGENCE AGENCY (2014) The World Factbook, Washington, DC.
https://www.cia.gov/the-world-factbook/ https://thematicmapping.org/downloads/world_borders.php https://geo-locate.org
data(countryref) head(countryref)
data(countryref) head(countryref)
A global gazetteer for biodiversity institutions from various sources, including zoos, museums, botanical gardens, GBIF contributors, herbaria, university collections.
A data frame with 12170 observations on 12 variables.
Compiled from various sources:
Global Biodiversity Information Facility https://www.gbif.org/
Wikipedia https://www.wikipedia.org/
Geonames https://www.geonames.org/
The Global Registry of Biodiversity Repositories
Index Herbariorum https://sweetgum.nybg.org/science/ih/
Botanic Gardens Conservation International https://www.bgci.org/
data(institutions) str(institutions)
data(institutions) str(institutions)
Test if its argument is a spatialvalid object
is.spatialvalid(x)
is.spatialvalid(x)
x |
the object to be tested |
returns TRUE
if its argument is a spatialvalid
A dataset of 5000 flowering plant fossil occurrences as example for data of the paleobiology Database, downloaded using the paleobioDB packages as specified in the vignette “Cleaning_PBDB_fossils_with_CoordinateCleaner”.
A data frame with 5000 observations on 36 variables.
The Paleobiology database https://paleobiodb.org/
Sara Varela, Javier Gonzalez Hernandez and Luciano Fabris Sgarbi (2016). paleobioDB: Download and Process Data from the Paleobiology Database. R package version 0.5.0. https://CRAN.R-project.org/package=paleobioDB.
data(institutions) str(institutions)
data(institutions) str(institutions)
A set of plots to explore objects of the class spatialvalid
. A plot
to visualize the flags from clean_coordinates
## S3 method for class 'spatialvalid' plot( x, lon = "decimalLongitude", lat = "decimalLatitude", bgmap = NULL, clean = TRUE, details = FALSE, pts_size = 1, font_size = 10, zoom_f = 0.1, ... )
## S3 method for class 'spatialvalid' plot( x, lon = "decimalLongitude", lat = "decimalLatitude", bgmap = NULL, clean = TRUE, details = FALSE, pts_size = 1, font_size = 10, zoom_f = 0.1, ... )
x |
an object of the class |
lon |
character string. The column with the longitude coordinates. Default = “decimalLongitude”. |
lat |
character string. The column with the latitude coordinates. Default = “decimalLatitude”. |
bgmap |
an object of the class |
clean |
logical. If TRUE, non-flagged coordinates are included in the map. |
details |
logical. If TRUE, occurrences are color-coded by the type of flag. |
pts_size |
numeric. The point size for the plot. |
font_size |
numeric. The font size for the legend and axes |
zoom_f |
numeric. the fraction by which to expand the plotting area from the occurrence records. Increase, if countries do not show up on the background map. |
... |
arguments to be passed to methods. |
A plot of the records flagged as potentially erroneous by
clean_coordinates
.
exmpl <- data.frame(species = sample(letters, size = 250, replace = TRUE), decimalLongitude = runif(250, min = 42, max = 51), decimalLatitude = runif(250, min = -26, max = -11)) test <- clean_coordinates(exmpl, species = "species", tests = c("sea", "gbif", "zeros"), verbose = FALSE) summary(test) plot(test)
exmpl <- data.frame(species = sample(letters, size = 250, replace = TRUE), decimalLongitude = runif(250, min = 42, max = 51), decimalLatitude = runif(250, min = -26, max = -11)) test <- clean_coordinates(exmpl, species = "species", tests = c("sea", "gbif", "zeros"), verbose = FALSE) summary(test) plot(test)
Creates the input necessary to run Pyrate, based on a data.frame with fossil ages (as derived e.g. from clean_fossils) and a vector of the extinction status for each sample. Creates files in the working directory!
write_pyrate( x, status, fname, taxon = "accepted_name", min_age = "min_ma", max_age = "max_ma", trait = NULL, path = getwd(), replicates = 1, cutoff = NULL, random = TRUE )
write_pyrate( x, status, fname, taxon = "accepted_name", min_age = "min_ma", max_age = "max_ma", trait = NULL, path = getwd(), replicates = 1, cutoff = NULL, random = TRUE )
x |
data.frame. Containing fossil records with taxon names, ages, and geographic coordinates. |
status |
a vector of character strings of length |
fname |
a character string. The prefix to use for the output files. |
taxon |
character string. The column with the taxon name. Default = “accepted_name”. |
min_age |
character string. The column with the minimum age. Default = “min_ma”. |
max_age |
character string. The column with the maximum age. Default = “max_ma”. |
trait |
a numeric vector of length |
path |
a character string. giving the absolute path to write the output files. Default is the working directory. |
replicates |
a numerical. The number of replicates for the randomized age generation. See details. Default = 1. |
cutoff |
a numerical. Specify a threshold to exclude fossil occurrences with a high temporal uncertainty, i.e. with a wide temporal range between min_age and max_age. Examples: cutoff=NULL (default; all occurrences are kept in the data set) cutoff=5 (all occurrences with a temporal range of 5 Myr or higher are excluded from the data set) |
random |
logical. Specify whether to take a random age (between MinT and MaxT) for each occurrence or the midpoint age. Note that this option defaults to TRUE if several replicates are generated (i.e. replicates > 1). Examples: random = TRUE (default) random = FALSE (use midpoint ages) |
The replicate option allows the user to generate several replicates of the data set in a single input file, each time re-drawing the ages of the occurrences at random from uniform distributions with boundaries MinT and MaxT. The replicates can be analysed in different runs (see PyRate command -j) and combining the results of these replicates is a way to account for the uncertainty of the true ages of the fossil occurrences. Examples: replicates=1 (default, generates 1 data set), replicates=10 (generates 10 random replicates of the data set).
PyRate input files in the working directory.
See https://github.com/dsilvestro/PyRate/wiki for more details and tutorials on PyRate and PyRate input.
Other fossils:
cf_age()
,
cf_equal()
,
cf_outl()
,
cf_range()
minages <- runif(250, 0, 65) exmpl <- data.frame(accepted_name = sample(letters, size = 250, replace = TRUE), lng = runif(250, min = 42, max = 51), lat = runif(250, min = -26, max = -11), min_ma = minages, max_ma = minages + runif(250, 0.1, 65)) #a vector with the status for each record, #make sure species are only classified as either extinct or extant, #otherwise the function will drop an error status <- sample(c("extinct", "extant"), size = nrow(exmpl), replace = TRUE) #or from a list of species status <- sample(c("extinct", "extant"), size = length(letters), replace = TRUE) names(status) <- letters status <- status[exmpl$accepted_name] ## Not run: write_pyrate(x = exmpl,fname = "test", status = status) ## End(Not run)
minages <- runif(250, 0, 65) exmpl <- data.frame(accepted_name = sample(letters, size = 250, replace = TRUE), lng = runif(250, min = 42, max = 51), lat = runif(250, min = -26, max = -11), min_ma = minages, max_ma = minages + runif(250, 0.1, 65)) #a vector with the status for each record, #make sure species are only classified as either extinct or extant, #otherwise the function will drop an error status <- sample(c("extinct", "extant"), size = nrow(exmpl), replace = TRUE) #or from a list of species status <- sample(c("extinct", "extant"), size = length(letters), replace = TRUE) names(status) <- letters status <- status[exmpl$accepted_name] ## Not run: write_pyrate(x = exmpl,fname = "test", status = status) ## End(Not run)