We downloaded all results from the U.S. Water Quality Portal at https://www.waterqualitydata.us/ with the organization ID “Kenai Watershed Forum” on March 5, 2025. Results from this download populate the chapters throughout.
For an example of how data is prepared for submission to the Water Quality Portal, see Appendix A.
Warning: One or more parsing issues, call `problems()` on your data frame for details,
e.g.:
dat <- vroom(...)
problems(dat)
Warning: There was 1 warning in `mutate()`.
ℹ In argument: `ResultMeasureValue = as.numeric(ResultMeasureValue)`.
Caused by warning:
! NAs introduced by coercion
# Data Sourcing---execute: echo: falsedate: "`r Sys.Date()`"format: html: code-fold: true code-tools: true code-summary: "Show the code"---```{r echo = F, message = F}## initialization# clear environmentrm(list=ls())# load packageslibrary(tidyverse)library(lubridate)library(readr)library(readxl)library(writexl)library(hms)library(plotly)library(DT)library(readxl)library(openxlsx)library(leaflet)library(DT)library(ggpubr)library(plotrix)library(remotes)library(magrittr)library(janitor)library(xfun)library(cowplot)#install_github("USGS-R/EGRET")select <- dplyr::select# to do: add explanatory content to this qmd document# - ref to data absence epa hows my waterway, in progress fix# - ref to total extent of data in epa wqx# - make sure outliers.xlsx doc can be downloaded & accessed# - modify such that outliers are labeled and present in the tabel downloads, but not shown in the boxplots# - confirm outlier removal works, org to prep to eqp wqx sync. WAIT UNTIL ALL DATA SYNCHED TO THIS DOCUMENT BEFORE EXCISING OUTLIERS``````{r echo = F}##### Notes on initial data sourcing# Data is sourced from the following queries at <https://waterqualitydata.us> on Feb 24, 2021:# CSV download for SAMPLE data: <https://www.waterqualitydata.us/portal/#bBox=-151.322501%2C60.274310%2C-149.216144%2C60.738915&mimeType=csv&dataProfile=narrowResult># CSV download for STATION data: <https://www.waterqualitydata.us/portal/#countrycode=US&statecode=US%3A02&countycode=US%3A02%3A122&bBox=-151.322501%2C60.274310%2C-149.216144%2C60.738915&mimeType=csv># Note: these CSV files are excluded from the GitHub repository because they are too large to sync. To reproduce the analysis, download and save these files locally instead. (See the ReadMe file at data/WQX_downloads in the repository).# Using these same queries in the future will download the most current csv files that may have received additional data in the interim since the last time this script was run.# We initially attempted to perform the task described above with the dataRetrieval package, but were not successful. As of 2025 this tool has transitioned to the "TADA" (tools for automated data analysis) package from EPA#library(dataRetrieval)# Note: in there future, I hope to automatically retrieve the most current water quality data from the EPA WQX```We downloaded all results from the U.S. Water Quality Portal at https://www.waterqualitydata.us/ with the organization ID "Kenai Watershed Forum" on March 5, 2025. Results from this download populate the chapters throughout.For an example of how data is prepared for submission to the Water Quality Portal, see Appendix A.```{r echo = F, message = F}# inspect the download results queried from the same above url on 2/3/2025# (note that as of Feb 2025, a note at the above url describs a future transition to a v 3.0 of WQP; "improved features")# import sample datadat <- read_csv("other/input/WQX_downloads/wqp_download_20250305/narrowresult.csv") %>% # retain only KWF data filter(OrganizationFormalName == "Kenai Watershed Forum") %>% # remove empty columns select_if(~!(all(is.na(.))))# remove extraneous text from "Kenai Watershed Forum(Volunteer)*"#dat %<>%# mutate(OrganizationFormalName = gsub("\\s*\\([^\\)]+\\)","",as.character(dat$OrganizationFormalName))) %>%# mutate(OrganizationFormalName = str_remove(OrganizationFormalName,"\\*"))# import station data, from seperate filesite_dat <- read_csv("other/input/WQX_downloads/wqp_download_20250203/station.csv") %>% # retain only Kenai Baseline sites (contains "KBL") filter(OrganizationIdentifier == "KENAI_WQX") %>% # remove empty columns select_if(~!(all(is.na(.)))) # note: the MonitoringLocationName strings for sites uploaded in 2021 do not jive with those from 2000 - 2013# join site and sample datadat <- left_join(dat,site_dat) %>% # retain only observations with all site data filter(!is.na(LatitudeMeasure)) %>% # retain only Kenai Baseline sites (contains "KBL") filter(grepl("KBL", MonitoringLocationName))### Miscellaneous additional dataframe prep ####### specify date formatdat %<>% transform(ActivityStartDate = date(ActivityStartDate)) #%>% # remove missing observations# filter(!is.na(ResultMeasureValue))# Remove air temperature observationsdat %<>% filter(CharacteristicName != "Temperature, air") ``````{r echo = F, message = F}# UNITS# confirm that each parameter has consistent units. View dataframe here, and if units are not consistent, apply temporary corrections here. Ultimately need to apply corrections to WQP upload, so code here should accommodate read in once corrections are applied.# view dataframe of unique units by parameterparameter_units <- dat %>% select(CharacteristicName, `ResultMeasure/MeasureUnitCode`) %>% distinct() %>% group_by(CharacteristicName) %>% mutate(unit_id = paste0("unit_", row_number())) %>% pivot_wider(names_from = unit_id, values_from = `ResultMeasure/MeasureUnitCode`)# specific conductance# spring 2013 values have units of mS/cm rather than uS/cm. looking at them in context of existing data, they apear within normal range (not consistently 1000x greater). Therefore, apply manual correction for spring 2013 SpCond unit dat <- dat %>% mutate(`ResultMeasure/MeasureUnitCode` = case_when( CharacteristicName == "Specific conductance" & ActivityStartDate == "2013-05-07" ~ "uS/cm", TRUE ~ `ResultMeasure/MeasureUnitCode`))# bacteria - fecal coliform units should all be "cfu/100ml"dat <- dat %>% mutate(`ResultMeasure/MeasureUnitCode` = case_when( CharacteristicName == "Fecal Coliform" ~ "cfu/100ml", TRUE ~ `ResultMeasure/MeasureUnitCode`))# TOTAL METALS (Mg, Ca, Fe, others)# address unit consistency introduced by 2021 data. Some are reported in ug/L, and some in mg/Ldat <- dat %>% # fix unit values mutate(`ResultMeasureValue` = as.numeric(`ResultMeasureValue`)) %>% mutate(`ResultMeasureValue` = case_when( CharacteristicName %in% c("Magnesium","Calcium","Iron") & `ResultMeasure/MeasureUnitCode` == "ug/L" ~ `ResultMeasureValue` / 1000, TRUE ~ `ResultMeasureValue` )) %>% # convert back to character for the rest of this preparation, here? # fix unit names mutate(`ResultMeasure/MeasureUnitCode` = case_when( CharacteristicName %in% c("Magnesium","Calcium","Iron") & `ResultMeasure/MeasureUnitCode` == "ug/L" ~ "mg/L", TRUE ~ `ResultMeasure/MeasureUnitCode` )) # SAMPLE FRACTION# address characteristic form (total, dissolved, etc). how many unique forms per parameter?parameter_fraction <- dat %>% select(CharacteristicName, ResultSampleFractionText) %>% distinct() %>% group_by(CharacteristicName) %>% mutate(unit_id = paste0("unit_", row_number())) %>% pivot_wider(names_from = unit_id, values_from = ResultSampleFractionText)# CHARACTERISTIC NAME CORRECTIONS# Nitrogendat <- dat %>% mutate(CharacteristicName = str_replace_all( CharacteristicName, c( "Inorganic nitrogen \\(nitrate and nitrite\\) \\*\\*\\*retired\\*\\*\\*use Nitrate \\+ Nitrite" = "Nitrate + Nitrite", "Inorganic nitrogen \\(nitrate and nitrite\\)" = "Nitrate + Nitrite" ) ))# SITE NAME CORRECTIONS# Create and export table of site namessite_tbl <- dat %>% group_by(MonitoringLocationName,LatitudeMeasure,LongitudeMeasure) %>% summarise(min_date = min(ActivityStartDate), max_date = max(ActivityStartDate)) %>% rename("latitude" = "LatitudeMeasure", "longitude" = "LongitudeMeasure")# export csv of site names as received from EPA WQXwrite.csv(site_tbl,"other/output/field_qa_qc_data/2000_2021_sitenames.csv", row.names = F)# read back in the manually edited table with info on maintsem/trib and river milesite_tbl <- read_csv("other/input/baseline_sites.csv") %>% remove_empty() %>% select(-latitude,-longitude)# 2/15/2023: having some trouble with this join for unknown reasons. # So, instead, we can extract trib/mainstem info and river mile from the existing MonitoringLocationName column (KBL_t_00.0 = kenai baseline tributary mile 0)dat %<>% separate_wider_delim(MonitoringLocationName, delim = "_", names = c("project","trib_mainstem","river_mile"), cols_remove = F)# transition column names to "easier" formatsdat %<>% clean_names()# export a csv with current column names, to provide the column format names for a seperate, manually curated outlier .xlsx file located in the same directorywrite.csv(dat,"other/input/outliers/baseline_outliers.csv")# Create column of tributary names (e.g. Beaver Creek) to associate with tributary river miles# read in tribuary namestrib_names <- read.csv("other/input/baseline_sites.csv") %>% select(MonitoringLocationName, tributary_name, tributary_mainstem) %>% filter(tributary_mainstem == "tributary") %>% clean_names() %>% select(-tributary_mainstem)# join trib names to overall tabledat %<>% left_join(trib_names, by = "monitoring_location_name")# create general "site name" column. use tributary name if available, and river mile if notdat %<>% transform(river_mile = as.character(as.numeric(river_mile))) %>% mutate(site_name = case_when( !is.na(tributary_name) ~ tributary_name, is.na(tributary_name) ~ river_mile))#### SEASONS BOUNDARY ####### Designate samples as "spring" vs "summer" for each observation ("season")# choose a "boundary date" between spring and summer (see appendix "Sample Event Timing" for rationale)# boundary date selected is June 4 (julian day 155)boundary_date <- 155dat %<>% mutate(julian_day = yday(activity_start_date)) %>% mutate(season = case_when( julian_day > 155 ~ "Summer", julian_day < 155 ~ "Spring" ))# OUTLIERS# exclude outliers with an inner_join function# see manually identified outliers at "other/input/outliers/baseline_outliers.xlsx"outliers <- read.xlsx("other/input/outliers/baseline_outliers.xlsx") %>% select(activity_identifier,characteristic_name)dat <- anti_join(dat,outliers)# EXPORT PREPROCESSED DATA TO LOCAL PROJECT DIRECTORY, FOR PLOTS# write finalized data that is output form all the steps above, to be used in next steps in the scripts that follow# this export format represents the most granular presentation, nothing has been aggregated yet. The "reg_limits.qmd" script aggregates volatiles, for examplewrite.csv(dat,"other/output/analysis_format/baseline_export_format.csv",row.names = F)# ensure final result includes "ResultStatusIdentifer." Keep rejects in table download, but not displayed in plots``````{r}# stop evaluation hereknitr::knit_exit()```