Appendix 2: NuSEDS Data Processing

Below is a summary of NuSEDS cleaning procedure and attribution of the data to Conservation Units (CUs) as defined in the Pacific Salmon Explorer (PSE). The complete details of the cleaning procedure can be found at 1_nuseds_collation.html and the corresponding Rmd script can be accessed from github/…/spawner-surveys/code/1_nuseds_collation.rmd.

The detailed procedure to match the cleaned NuSEDS data to CUs is available at 2_nuseds_cuid_pse.html and the corresponding Rmd script can be downloaded from github/…/spawner-surveys/code/2_nuseds_cuid_pse.rmd.

The different versions of the dataset can be downloaded from zenodo/…/2_nuseds_cuid_streamid.

Cleaning Procedure

The objective of the procedure is to obtain the yearly counts of each salmon population in their respective stream and associate these populations to their corresponding conservation unit (CU). The NuSEDS data is separated into two datasets. The all_areas_nuseds dataset contains the observed yearly counts (related fields: NATURAL_ADULT_SPAWNERS, NATURAL_SPAWNERS_TOTAL, etc.) for each population (related fields: SPECIES, POP_ID, POPULATION) in their respective site (related fields: AREA, GFE_ID, WATERBODY, GAZETTED_NAME, etc.), along with the associated methods used (related fields: ESTIMATE_METHOD, ESTIMATE_CLASSIFICATION, ENUMERATION_METHODS).

We first define the unique yearly count field MAX_ESTIMATE for each population as the maximum value of the count-related fields in all_areas_nuseds, i.e., NATURAL_ADULT_SPAWNERS, NATURAL_JACK_SPAWNERS, NATURAL_SPAWNERS_TOTAL, ADULT_BROODSTOCK_REMOVALS, JACK_BROODSTOCK_REMOVALS, TOTAL_BROODSTOCK_REMOVALS, OTHER_REMOVALS and TOTAL_RETURN_TO_RIVER. MAX_ESTIMATE is the only count-related field we use in the rest of the procedure. A population’s MAX_ESTIMATE data points is referred to as its “time series”.

The second dataset, conservation_unit_system_sites, links each population (related fields: POP_ID) to their respective CU (related fields: CU_NAME, FULL_CU_IN, CU_LONGT, CU_LAT, etc.) and site (related fields: GFE_ID, SYSTEM_SITE, X_LONGT, Y_LAT). Ideally, attributing each time series in all_areas_nuseds its corresponding CU and location’s coordinates in conservation_unit_system_sites would simply consist in merging the two datasets using the population and location identification number POP_ID and GFE_ID, respectively. Unfortunately, numerous time series are problematic, which occurs when:

A time series is present in all_areas_nuseds but its POP_ID and GFE_ID association is absent in conservation_unit_system_sites (there are 4428 populations in that case).
Multiple time series of the same population (i.e. same POP_ID) are observed in multiple locations (i.e. different GFE_IDs), which should not occur because a POP_ID should be defined for a unique location.
Multiple populations (i.e. different POP_ID) of a same CU are observed in the same location (i.e. same GFE_ID), suggesting that these populations should form one unique population.

The observation of problematic time series revealed inconsistencies such as missing, duplicated, and conflicting data points (i.e. different counts in the same year). In order to rescue as much data as possible, we attempted to solve these issues by either (i) deleting, (ii) combining or (iii) summing a problematic series to an alternative series (i.e., a time series present in all_areas_nuseds that shares either the same POP_ID or GFE_ID and species), or (iv) creating a new reference to the time series in conservation_unit_system_sites. In each case, we used all the information available in the different fields and made assumptions based on our professional judgment. Below are more details about the different interventions applied to problematic time series:

We deleted a problematic time series (or a portion of it) from all_areas_nuseds when (1) it was a duplicate of another time series, (2) it had less than four data points and was not spotted during the cleaning process, or (3) the lack of information prevented to combine or merge it to an alternative series or to create a new series in conservation_unit_system_sites (e.g. if run timing was unknown).
We combined a problematic time series with an alternative series when data points were not in conflict (i.e. different value for a same year) and the information available in the other fields was not prescriptive (e.g. the population belong to the same CU, the locations have similar names or spatial coordinates, the runtimings are not different).
We summed a problematic time series to an alternative series having different POP_ID but same GFE_ID and CU_NAME and with conflicting data points.
We created a new reference of the problematic time series in conservation_unit_system_sites (i.e. a new row with a unique POP_ID - GFE_ID combination) when it could not be deleted, merged or combined to an alternative series.

The diverse and idiosyncratic nature of these problematic cases prevented the establishment of a systematic and simple cleaning process. Many of these cases were resolved individually by observing and comparing time series and accounting for the information available. We proceeded in consecutive steps in which we applied the different types of intervention mentioned above, either after visually inspecting individual time series or by applying a systematic procedure. These steps are detailed below:

Remove the time series (i.e. unique POP_ID - GFE_ID associations) with only NAs values for MAX_ESTIMATE in all_areas_nuseds (corresponding to 4424 time series, 101,944 rows or 25% of all_areas_nuseds)
Assess all the references to time series present in conservation_unit_system_sites but absent in all_areas_nuseds (247 out of 714 time series in all_areas_nuseds are not referenced in conservation_unit_system_sites)

2.a. Cases with presence of alternative series in all_areas_nuseds (n = 3)
- Each case was assessed visually.
2.b. Cases with no alternative time series in all_areas_nuseds (n = 243)
- The rows corresponding to these time series are removed from conservation_unit_system_sites (242 of the 243 references correspond to time series in all_areas_nuseds removed in step 1. above because they contained only NAs).
2.c. Case with multiple GFE_ID for a single POP_ID (n = 1)
- The problematic time series was simply removed from conservation_unit_system_sites.

Case of time series with the same POP_ID but different GFE_IDs in conservation_unit_system_sites and present in all_areas_nuseds (n = 1).

Two time series had the same POP_ID but different GFE_ID. Both time series were kept because they are not compatible and have many and recent data points.

Assess all the time series present in all_areas_nuseds but absent in conservation_unit_system_sites (n = 242)

4.a. Cases of series with alternative POP_ID and no alternative GFE_ID (n = 126)
- Each case was assessed visually.
4.b. Cases of series with alternative GFE_ID with or without alternative POP_ID (n = 64)
- Problematic series with the same POP_IDs (e.g. series with POP_ID = 47925 and GFE_ID = 15, 31516 and 31740, respectively) having the same alternative series (e.g. series with POP_ID = 47925 and GFE_ID = 14) are assessed together, resulting in 37 groups of time series to assess separately.
4.c. Cases with no alternative series (n = 166)
- We attempted to associate these time series to a CU using the fields X_LONGT and Y_LAT to intersect them with the CUs’ shapefiles used in the PSE.
- Time series without coordinates (or for which coordinates could not be found) were removed from all_areas_nuseds (n = 5).
- The reference of the rest of the time series was added to conservation_unit_system_sites

Merge all_areas_nuseds and conservation_unit_system_sites using the fields POP_ID and GFE_ID.
Assess time series with different POP_ID but a same CU_NAME and GFE_ID (n = 126).

Case 1: there is only one data point in one of the series; if it is compatible, then merge to the alternative series, otherwise remove it from all_areas_nuseds.
Case 2: One of the series is 100% duplicated with the other one: to remove from all_areas_nuseds.
Case 3: the rest of the series are merged; adding the values when data points are in conflict.

Attribution of the geographic regions to CUs

We matched the cleaned NuSEDS data with the CUs and streams defined in the Pacific Salmon Explorer, which are associated with a given region (PSE). We used the fields POP_ID, FULL_C_IN and CU_NAMEto match the CUs and the fields SYSTEM_SITE, X_LONGT and Y_LAT to match the streams. There are 23 CU_NAME not matching to any CUs in the PSE, corresponding to 87 populations and 2256 data points. These time series are kept in the dataset 2_nuseds_cuid_streamid.csv but do not appear in the PSE.

Estimation of population counts

The observed count (or MAX_ESTIMATE in the cleaned dataset) for each population is the maximum value of the following fields in all_areas_nuseds: NATURAL_ADULT_SPAWNERS, NATURAL_JACK_SPAWNERS, NATURAL_SPAWNERS_TOTAL, ADULT_BROODSTOCK_REMOVALS, JACK_BROODSTOCK_REMOVALS, TOTAL_BROODSTOCK_REMOVALS, OTHER_REMOVALS and TOTAL_RETURN_TO_RIVER.

Coordinates adjustment

There are 41 locations (with a unique GFE_ID) with identical coordinates in conservation_unit_system_sites. This is because the Y_LAT and X_LONGT field does not corresponds to the survey locations but to the “location of the mouth of the waterbody if flowing, or the centroid if not”. Consequently, all locations within a water body have the same coordinates. In large rivers (e.g. Thompson river), such imprecision can shift the coordinates 100s of km away from the true assessment location. We consequently manually adjusted the coordiantes of certain locations to prevent duplication and to improve spatial accuracy.