Appendix 2: NuSEDS Data Processing
Below is a summary of NuSEDS cleaning procedure and attribution of the data to Conservation Units (CUs) as defined in the Pacific Salmon Explorer (PSE). The complete details of the cleaning procedure can be found at 1_nuseds_collation.html and the corresponding Rmd script can be accessed from github/…/spawner-surveys/code/1_nuseds_collation.rmd.
The detailed procedure to match the cleaned NuSEDS data to CUs is available at 2_nuseds_cuid_pse.html and the corresponding Rmd script can be downloaded from github/…/spawner-surveys/code/2_nuseds_cuid_pse.rmd.
The different versions of the dataset can be downloaded from zenodo/…/2_nuseds_cuid_streamid.
Cleaning Procedure
The objective of the procedure is to obtain the yearly counts of each salmon population in their respective stream and associate these populations to their corresponding conservation unit (CU). The NuSEDS data is separated into two datasets. The all_areas_nuseds dataset contains the observed yearly counts (related fields: NATURAL_ADULT_SPAWNERS
, NATURAL_SPAWNERS_TOTAL
, etc.) for each population (related fields: SPECIES
, POP_ID
, POPULATION
) in their respective site (related fields: AREA
, GFE_ID
, WATERBODY
, GAZETTED_NAME
, etc.), along with the associated methods used (related fields: ESTIMATE_METHOD
, ESTIMATE_CLASSIFICATION
, ENUMERATION_METHODS
).
We first define the unique yearly count field MAX_ESTIMATE
for each population as the maximum value of the count-related fields in all_areas_nuseds, i.e., NATURAL_ADULT_SPAWNERS
, NATURAL_JACK_SPAWNERS
, NATURAL_SPAWNERS_TOTAL
, ADULT_BROODSTOCK_REMOVALS
, JACK_BROODSTOCK_REMOVALS
, TOTAL_BROODSTOCK_REMOVALS
, OTHER_REMOVALS
and TOTAL_RETURN_TO_RIVER
. MAX_ESTIMATE
is the only count-related field we use in the rest of the procedure. A population’s MAX_ESTIMATE
data points is referred to as its “time series”.
The second dataset, conservation_unit_system_sites, links each population (related fields: POP_ID
) to their respective CU (related fields: CU_NAME
, FULL_CU_IN
, CU_LONGT
, CU_LAT
, etc.) and site (related fields: GFE_ID
, SYSTEM_SITE
, X_LONGT
, Y_LAT
). Ideally, attributing each time series in all_areas_nuseds its corresponding CU and location’s coordinates in conservation_unit_system_sites would simply consist in merging the two datasets using the population and location identification number POP_ID
and GFE_ID
, respectively. Unfortunately, numerous time series are problematic, which occurs when:
A time series is present in all_areas_nuseds but its
POP_ID
andGFE_ID
association is absent in conservation_unit_system_sites (there are 4428 populations in that case).Multiple time series of the same population (i.e. same
POP_ID
) are observed in multiple locations (i.e. differentGFE_IDs
), which should not occur because aPOP_ID
should be defined for a unique location.Multiple populations (i.e. different
POP_ID
) of a same CU are observed in the same location (i.e. sameGFE_ID
), suggesting that these populations should form one unique population.
The observation of problematic time series revealed inconsistencies such as missing, duplicated, and conflicting data points (i.e. different counts in the same year). In order to rescue as much data as possible, we attempted to solve these issues by either (i) deleting, (ii) combining or (iii) summing a problematic series to an alternative series (i.e., a time series present in all_areas_nuseds that shares either the same POP_ID
or GFE_ID
and species), or (iv) creating a new reference to the time series in conservation_unit_system_sites. In each case, we used all the information available in the different fields and made assumptions based on our professional judgment. Below are more details about the different interventions applied to problematic time series:
We deleted a problematic time series (or a portion of it) from all_areas_nuseds when (1) it was a duplicate of another time series, (2) it had less than four data points and was not spotted during the cleaning process, or (3) the lack of information prevented to combine or merge it to an alternative series or to create a new series in conservation_unit_system_sites (e.g. if run timing was unknown).
We combined a problematic time series with an alternative series when data points were not in conflict (i.e. different value for a same year) and the information available in the other fields was not prescriptive (e.g. the population belong to the same CU, the locations have similar names or spatial coordinates, the runtimings are not different).
We summed a problematic time series to an alternative series having different
POP_ID
but sameGFE_ID
andCU_NAME
and with conflicting data points.We created a new reference of the problematic time series in conservation_unit_system_sites (i.e. a new row with a unique
POP_ID
-GFE_ID
combination) when it could not be deleted, merged or combined to an alternative series.
The diverse and idiosyncratic nature of these problematic cases prevented the establishment of a systematic and simple cleaning process. Many of these cases were resolved individually by observing and comparing time series and accounting for the information available. We proceeded in consecutive steps in which we applied the different types of intervention mentioned above, either after visually inspecting individual time series or by applying a systematic procedure. These steps are detailed below:
Remove the time series (i.e. unique
POP_ID
-GFE_ID
associations) with only NAs values forMAX_ESTIMATE
in all_areas_nuseds (corresponding to 4424 time series, 101,944 rows or 25% of all_areas_nuseds)Assess all the references to time series present in conservation_unit_system_sites but absent in all_areas_nuseds (247 out of 714 time series in all_areas_nuseds are not referenced in conservation_unit_system_sites)
2.a. Cases with presence of alternative series in all_areas_nuseds (n = 3)
- Each case was assessed visually.
2.b. Cases with no alternative time series in all_areas_nuseds (n = 243)
- The rows corresponding to these time series are removed from conservation_unit_system_sites (242 of the 243 references correspond to time series in all_areas_nuseds removed in step 1. above because they contained only NAs).
2.c. Case with multiple
GFE_ID
for a singlePOP_ID
(n = 1)- The problematic time series was simply removed from conservation_unit_system_sites.
- Case of time series with the same
POP_ID
but differentGFE_ID
s in conservation_unit_system_sites and present in all_areas_nuseds (n = 1).
- Two time series had the same
POP_ID
but differentGFE_ID
. Both time series were kept because they are not compatible and have many and recent data points.
- Assess all the time series present in all_areas_nuseds but absent in conservation_unit_system_sites (n = 242)
4.a. Cases of series with alternative
POP_ID
and no alternativeGFE_ID
(n = 126)- Each case was assessed visually.
4.b. Cases of series with alternative
GFE_ID
with or without alternativePOP_ID
(n = 64)- Problematic series with the same
POP_ID
s (e.g. series withPOP_ID
= 47925 andGFE_ID
= 15, 31516 and 31740, respectively) having the same alternative series (e.g. series withPOP_ID
= 47925 andGFE_ID
= 14) are assessed together, resulting in 37 groups of time series to assess separately.
- Problematic series with the same
4.c. Cases with no alternative series (n = 166)
We attempted to associate these time series to a CU using the fields
X_LONGT
andY_LAT
to intersect them with the CUs’ shapefiles used in the PSE.Time series without coordinates (or for which coordinates could not be found) were removed from all_areas_nuseds (n = 5).
The reference of the rest of the time series was added to conservation_unit_system_sites
Merge all_areas_nuseds and conservation_unit_system_sites using the fields
POP_ID
andGFE_ID
.Assess time series with different
POP_ID
but a sameCU_NAME
andGFE_ID
(n = 126).
Case 1: there is only one data point in one of the series; if it is compatible, then merge to the alternative series, otherwise remove it from all_areas_nuseds.
Case 2: One of the series is 100% duplicated with the other one: to remove from all_areas_nuseds.
Case 3: the rest of the series are merged; adding the values when data points are in conflict.
Attribution of the geographic regions to CUs
We matched the cleaned NuSEDS data with the CUs and streams defined in the Pacific Salmon Explorer, which are associated with a given region (PSE). We used the fields POP_ID
, FULL_C_IN
and CU_NAME
to match the CUs and the fields SYSTEM_SITE
, X_LONGT
and Y_LAT
to match the streams. There are 23 CU_NAME
not matching to any CUs in the PSE, corresponding to 87 populations and 2256 data points. These time series are kept in the dataset 2_nuseds_cuid_streamid.csv but do not appear in the PSE.
Estimation of population counts
The observed count (or MAX_ESTIMATE
in the cleaned dataset) for each population is the maximum value of the following fields in all_areas_nuseds: NATURAL_ADULT_SPAWNERS
, NATURAL_JACK_SPAWNERS
, NATURAL_SPAWNERS_TOTAL
, ADULT_BROODSTOCK_REMOVALS
, JACK_BROODSTOCK_REMOVALS
, TOTAL_BROODSTOCK_REMOVALS
, OTHER_REMOVALS
and TOTAL_RETURN_TO_RIVER
.
Coordinates adjustment
There are 41 locations (with a unique GFE_ID) with identical coordinates in conservation_unit_system_sites. This is because the Y_LAT
and X_LONGT
field does not corresponds to the survey locations but to the “location of the mouth of the waterbody if flowing, or the centroid if not”. Consequently, all locations within a water body have the same coordinates. In large rivers (e.g. Thompson river), such imprecision can shift the coordinates 100s of km away from the true assessment location. We consequently manually adjusted the coordiantes of certain locations to prevent duplication and to improve spatial accuracy.