4 Data Sourcing

Published

July 17, 2025

Warning: package 'openxlsx' was built under R version 4.5.1

Warning: package 'remotes' was built under R version 4.5.1

We downloaded all results from the U.S. Water Quality Portal at https://www.waterqualitydata.us/ with the organization ID “Kenai Watershed Forum” on March 5, 2025. Results from this download populate the chapters throughout.

For an example of how data is prepared for submission to the Water Quality Portal, see Appendix A.

Warning: One or more parsing issues, call `problems()` on your data frame for details,
e.g.:
  dat <- vroom(...)
  problems(dat)

Warning: There was 1 warning in `mutate()`.
ℹ In argument: `ResultMeasureValue = as.numeric(ResultMeasureValue)`.
Caused by warning:
! NAs introduced by coercion

# Data Sourcing --- execute: echo: false date: "`r Sys.Date()`" format: html: code-fold: true code-tools: true code-summary: "Show the code" --- ```{r echo = F, message = F} ## initialization # clear environment rm(list=ls()) # load packages library(tidyverse) library(lubridate) library(readr) library(readxl) library(writexl) library(hms) library(plotly) library(DT) library(readxl) library(openxlsx) library(leaflet) library(DT) library(ggpubr) library(plotrix) library(remotes) library(magrittr) library(janitor) library(xfun) library(cowplot) #install_github("USGS-R/EGRET") select <- dplyr::select # to do: add explanatory content to this qmd document # - ref to data absence epa hows my waterway, in progress fix # - ref to total extent of data in epa wqx # - make sure outliers.xlsx doc can be downloaded & accessed # - modify such that outliers are labeled and present in the tabel downloads, but not shown in the boxplots # - confirm outlier removal works, org to prep to eqp wqx sync. WAIT UNTIL ALL DATA SYNCHED TO THIS DOCUMENT BEFORE EXCISING OUTLIERS ``` ```{r echo = F} ##### Notes on initial data sourcing # Data is sourced from the following queries at <https://waterqualitydata.us> on Feb 24, 2021: # CSV download for SAMPLE data: <https://www.waterqualitydata.us/portal/#bBox=-151.322501%2C60.274310%2C-149.216144%2C60.738915&mimeType=csv&dataProfile=narrowResult> # CSV download for STATION data: <https://www.waterqualitydata.us/portal/#countrycode=US&statecode=US%3A02&countycode=US%3A02%3A122&bBox=-151.322501%2C60.274310%2C-149.216144%2C60.738915&mimeType=csv> # Note: these CSV files are excluded from the GitHub repository because they are too large to sync. To reproduce the analysis, download and save these files locally instead. (See the ReadMe file at data/WQX_downloads in the repository). # Using these same queries in the future will download the most current csv files that may have received additional data in the interim since the last time this script was run. # We initially attempted to perform the task described above with the dataRetrieval package, but were not successful. As of 2025 this tool has transitioned to the "TADA" (tools for automated data analysis) package from EPA #library(dataRetrieval) # Note: in there future, I hope to automatically retrieve the most current water quality data from the EPA WQX ``` We downloaded all results from the U.S. Water Quality Portal at https://www.waterqualitydata.us/ with the organization ID "Kenai Watershed Forum" on March 5, 2025. Results from this download populate the chapters throughout. For an example of how data is prepared for submission to the Water Quality Portal, see Appendix A. ```{r echo = F, message = F} # inspect the download results queried from the same above url on 2/3/2025 # (note that as of Feb 2025, a note at the above url describs a future transition to a v 3.0 of WQP; "improved features") # import sample data dat <- read_csv("other/input/WQX_downloads/wqp_download_20250305/narrowresult.csv") %>% # retain only KWF data filter(OrganizationFormalName == "Kenai Watershed Forum") %>% # remove empty columns select_if(~!(all(is.na(.)))) # remove extraneous text from "Kenai Watershed Forum(Volunteer)*" #dat %<>% # mutate(OrganizationFormalName = gsub("\\s*\$[^\$]+\\)","",as.character(dat$OrganizationFormalName))) %>% # mutate(OrganizationFormalName = str_remove(OrganizationFormalName,"\\*")) # import station data, from seperate file site_dat <- read_csv("other/input/WQX_downloads/wqp_download_20250203/station.csv") %>% # retain only Kenai Baseline sites (contains "KBL") filter(OrganizationIdentifier == "KENAI_WQX") %>% # remove empty columns select_if(~!(all(is.na(.)))) # note: the MonitoringLocationName strings for sites uploaded in 2021 do not jive with those from 2000 - 2013 # join site and sample data dat <- left_join(dat,site_dat) %>% # retain only observations with all site data filter(!is.na(LatitudeMeasure)) %>% # retain only Kenai Baseline sites (contains "KBL") filter(grepl("KBL", MonitoringLocationName)) ### Miscellaneous additional dataframe prep ###### # specify date format dat %<>% transform(ActivityStartDate = date(ActivityStartDate)) #%>% # remove missing observations # filter(!is.na(ResultMeasureValue)) # Remove air temperature observations dat %<>% filter(CharacteristicName != "Temperature, air") ``` ```{r echo = F, message = F} # UNITS # confirm that each parameter has consistent units. View dataframe here, and if units are not consistent, apply temporary corrections here. Ultimately need to apply corrections to WQP upload, so code here should accommodate read in once corrections are applied. # view dataframe of unique units by parameter parameter_units <- dat %>% select(CharacteristicName, `ResultMeasure/MeasureUnitCode`) %>% distinct() %>% group_by(CharacteristicName) %>% mutate(unit_id = paste0("unit_", row_number())) %>% pivot_wider(names_from = unit_id, values_from = `ResultMeasure/MeasureUnitCode`) # specific conductance # spring 2013 values have units of mS/cm rather than uS/cm. looking at them in context of existing data, they apear within normal range (not consistently 1000x greater). Therefore, apply manual correction for spring 2013 SpCond unit dat <- dat %>% mutate(`ResultMeasure/MeasureUnitCode` = case_when( CharacteristicName == "Specific conductance" & ActivityStartDate == "2013-05-07" ~ "uS/cm", TRUE ~ `ResultMeasure/MeasureUnitCode`)) # bacteria - fecal coliform units should all be "cfu/100ml" dat <- dat %>% mutate(`ResultMeasure/MeasureUnitCode` = case_when( CharacteristicName == "Fecal Coliform" ~ "cfu/100ml", TRUE ~ `ResultMeasure/MeasureUnitCode`)) # TOTAL METALS (Mg, Ca, Fe, others) # address unit consistency introduced by 2021 data. Some are reported in ug/L, and some in mg/L dat <- dat %>% # fix unit values mutate(`ResultMeasureValue` = as.numeric(`ResultMeasureValue`)) %>% mutate(`ResultMeasureValue` = case_when( CharacteristicName %in% c("Magnesium","Calcium","Iron") & `ResultMeasure/MeasureUnitCode` == "ug/L" ~ `ResultMeasureValue` / 1000, TRUE ~ `ResultMeasureValue` )) %>% # convert back to character for the rest of this preparation, here? # fix unit names mutate(`ResultMeasure/MeasureUnitCode` = case_when( CharacteristicName %in% c("Magnesium","Calcium","Iron") & `ResultMeasure/MeasureUnitCode` == "ug/L" ~ "mg/L", TRUE ~ `ResultMeasure/MeasureUnitCode` )) # SAMPLE FRACTION # address characteristic form (total, dissolved, etc). how many unique forms per parameter? parameter_fraction <- dat %>% select(CharacteristicName, ResultSampleFractionText) %>% distinct() %>% group_by(CharacteristicName) %>% mutate(unit_id = paste0("unit_", row_number())) %>% pivot_wider(names_from = unit_id, values_from = ResultSampleFractionText) # CHARACTERISTIC NAME CORRECTIONS # Nitrogen dat <- dat %>% mutate(CharacteristicName = str_replace_all( CharacteristicName, c( "Inorganic nitrogen \$nitrate and nitrite\$ \\*\\*\\*retired\\*\\*\\*use Nitrate \\+ Nitrite" = "Nitrate + Nitrite", "Inorganic nitrogen \$nitrate and nitrite\$" = "Nitrate + Nitrite" ) )) # SITE NAME CORRECTIONS # Create and export table of site names site_tbl <- dat %>% group_by(MonitoringLocationName,LatitudeMeasure,LongitudeMeasure) %>% summarise(min_date = min(ActivityStartDate), max_date = max(ActivityStartDate)) %>% rename("latitude" = "LatitudeMeasure", "longitude" = "LongitudeMeasure") # export csv of site names as received from EPA WQX write.csv(site_tbl,"other/output/field_qa_qc_data/2000_2021_sitenames.csv", row.names = F) # read back in the manually edited table with info on maintsem/trib and river mile site_tbl <- read_csv("other/input/baseline_sites.csv") %>% remove_empty() %>% select(-latitude,-longitude) # 2/15/2023: having some trouble with this join for unknown reasons. # So, instead, we can extract trib/mainstem info and river mile from the existing MonitoringLocationName column (KBL_t_00.0 = kenai baseline tributary mile 0) dat %<>% separate_wider_delim(MonitoringLocationName, delim = "_", names = c("project","trib_mainstem","river_mile"), cols_remove = F) # transition column names to "easier" formats dat %<>% clean_names() # export a csv with current column names, to provide the column format names for a seperate, manually curated outlier .xlsx file located in the same directory write.csv(dat,"other/input/outliers/baseline_outliers.csv") # Create column of tributary names (e.g. Beaver Creek) to associate with tributary river miles # read in tribuary names trib_names <- read.csv("other/input/baseline_sites.csv") %>% select(MonitoringLocationName, tributary_name, tributary_mainstem) %>% filter(tributary_mainstem == "tributary") %>% clean_names() %>% select(-tributary_mainstem) # join trib names to overall table dat %<>% left_join(trib_names, by = "monitoring_location_name") # create general "site name" column. use tributary name if available, and river mile if not dat %<>% transform(river_mile = as.character(as.numeric(river_mile))) %>% mutate(site_name = case_when( !is.na(tributary_name) ~ tributary_name, is.na(tributary_name) ~ river_mile)) #### SEASONS BOUNDARY #### ### Designate samples as "spring" vs "summer" for each observation ("season") # choose a "boundary date" between spring and summer (see appendix "Sample Event Timing" for rationale) # boundary date selected is June 4 (julian day 155) boundary_date <- 155 dat %<>% mutate(julian_day = yday(activity_start_date)) %>% mutate(season = case_when( julian_day > 155 ~ "Summer", julian_day < 155 ~ "Spring" )) # OUTLIERS # exclude outliers with an inner_join function # see manually identified outliers at "other/input/outliers/baseline_outliers.xlsx" outliers <- read.xlsx("other/input/outliers/baseline_outliers.xlsx") %>% select(activity_identifier,characteristic_name) dat <- anti_join(dat,outliers) # EXPORT PREPROCESSED DATA TO LOCAL PROJECT DIRECTORY, FOR PLOTS # write finalized data that is output form all the steps above, to be used in next steps in the scripts that follow # this export format represents the most granular presentation, nothing has been aggregated yet. The "reg_limits.qmd" script aggregates volatiles, for example write.csv(dat,"other/output/analysis_format/baseline_export_format.csv",row.names = F) # ensure final result includes "ResultStatusIdentifer." Keep rejects in table download, but not displayed in plots ``` ```{r} # stop evaluation here knitr::knit_exit() ```