contact: rpauloo@ucdavis.edu
Twitter
Website

Preferred Citation

If you wish to use this work in any way, please attribute citation to:

Pauloo, Rich (2018, April 30). An Exploratory Data Analysis of California’s Well Completion Reports. Retrieved from https://richpauloo.github.io/oswcr_1.html

If you wish to obtain clean well completion report data from a region of interest, please visit this application.


Introduction

In California, whenever a well is drilled, a log of what the driller encounters is recorded and submitted to the state of California. Until late 2017, all well logs in California were privately held data. With the passage of groundwater sustainability legislation in California, this data was released to the public. This data gives us insight into the mysterious groundwater aquifers of California, where surprisingly little data exists for this vital public resource that provides anywhere form 40-60% of the water consumed annually in the state.

The California DWR Online Well Completion Report Database contains ~1,000,000 digitized well logs. What information can be mined from these logs, and what would it tell us about aquifers in California?

A water modeler in a basin might be interested in what hydrogeologic data is available, and how they might use that data to parameterize a groundwater flow or contaminant transport model. Or imagine that a state or federal agency with limited funds to disburse needed to predict a community’s drought vulnerability, or characterize at-risk communities, to mitigate future water shortages. Which are the spatial and socioeconomic patterns in well completion, and are there disparities in who is affected by drought? Lastly, consider a local agency that wants to digitize a portion of well completion reports in their subbasin. Which well reports in their basin offer high quality data, and which should they select?

The purpose of this analysis (mirrored in the table of contents) is to:

  • Assess data quality
    • identify missing data
    • clean data relevant to this analysis
    • lat/lon
    • group wells in Bulletin 118 subbasins
    • well type lemmatization
    • identify areas requiring further attention
  • Answer some interesting questions
    • When do Californians drill wells?
    • What types of wells have been completed?
    • Where were the domestic and agricultural wells in the 1977 and 1991 droughts drilled?
    • Are domestic wells drilled in socioeconomically disadvantaged areas?
    • Where have monitoring and injection wells been compeleted?
    • Does well construction follow a seasonal trend?
    • Have trends in well construction changed over time?
    • Who drills wells in California?
    • What Subbains are the most “data dense”?
  • Calculate and visualize regional-scale aquifer parameters important in hydrologic modeling
    • Aquifer Transmissivity
    • Perforated Interval Thickness, Top, Bottom for Domestic and Agricultural Wells

Assess Data Quality

To get started, first load up some packages for the analysis and visualization.

library(sp) # spatial tools
library(rgdal) # spatial tools
library(rgeos) # spatial tools
library(raster) # spatial tools
library(scales) # for plotting with geom_sf
library(tidyverse) # general purpose data science toolkit
library(sf) # spatial tools, and plotting spatial data
library(here) # convenience functions to set machine and system agnostic working directory
library(maptools) # more spatial tools
library(viridis) # color scales for visualization
library(leaflet) # interactive maps
library(ggplot2) # for visualization, especially geom_sf
library(ggthemes) # for plotting themes
library(stringr) # for dealing with strings
library(knitr) # for tables
library(colormap) # for color palettes

theme_set(theme_bw(base_size = 20)) # set global theme for plots

Bring in the raw data and inspect each column. We have close to 1,000,000 observations of wells logs, and 46 variables per well log.

# dat <- read_csv("OSWCR_201801.csv") # save as Rds for efficient loading
# saveRDS(dat, "OSWCR_201801.Rds")
dat <- readRDS("OSWCR_201801.Rds")
glimpse(dat)
## Observations: 943,469
## Variables: 46
## $ WCRNumber                    <chr> "WCR2016-003980", "WCR2016-004287...
## $ LegacyLogNumber              <chr> NA, "e0306175", "e0303164", "e030...
## $ RegionOffice                 <chr> "DWR Northern Region Office", "DW...
## $ CountyName                   <chr> "Colusa", "Colusa", "Colusa", "Co...
## $ LocalPermitAgency            <chr> "Colusa County Environmental Heal...
## $ PermitDate                   <chr> "04/12/2016", "1/29/2016", NA, "3...
## $ PermitNumber                 <chr> "WP 896", "WP860", NA, "WO 088", ...
## $ OwnerAssignedWellNumber      <chr> NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ WellLocation                 <chr> NA, "Ohm RD", "Dry Slough RD", "S...
## $ City                         <chr> NA, "Arbuckle", "Grimes", "Arbuck...
## $ PlannedUseFormerUse          <chr> "Water Supply Irrigation - Agricu...
## $ DrillerName                  <chr> "SULLIVAN DRILLING INC", "SULLIVA...
## $ DrillerLicenseNumber         <chr> "656504", "656504", "656504", "65...
## $ RecordType                   <chr> "WellCompletion/New/Production or...
## $ DecimalLatitude              <dbl> 39.06429, 39.06410, 39.10716, 39....
## $ DecimalLongitude             <dbl> -122.0507, -122.0321, -121.9406, ...
## $ MethodofDeterminationLL      <chr> "Derived from TRS", "Derived from...
## $ LLAccuracy                   <chr> "Centroid of Section", "Centroid ...
## $ HorizontalDatum              <chr> "WGS84", "WGS84", "WGS84", "WGS84...
## $ GroundSurfaceElevation       <chr> NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ ElevationAccuracy            <chr> NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ ElevationDeterminationMethod <chr> NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ VerticalDatum                <chr> NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ Township                     <chr> "14N", "14N", "15N", "14N", "14N"...
## $ Range                        <chr> "02W", "02W", "01W", "02W", "02W"...
## $ Section                      <chr> "14", "13", "35", "22", "34", "31...
## $ BaselineMeridian             <chr> "Mount Diablo", "Mount Diablo", "...
## $ APN                          <chr> "018.180.052", "018-180-035", "01...
## $ DateWorkEnded                <chr> "5/24/2016 0:00:00", "4/7/2016 0:...
## $ WorkflowStatus               <chr> "Completeness Review - Complete -...
## $ ReceivedDate                 <chr> "6/16/2016 0:00:00", "5/12/2016 0...
## $ TotalDrillDepth              <int> 1000, 1000, 820, 500, 910, 1200, ...
## $ TotalCompletedDepth          <dbl> 860, 960, 620, 295, 700, 1160, 18...
## $ TopOfPerforatedInterval      <dbl> NA, 280, 220, 255, 240, 420, 80, ...
## $ BottomofPerforatedInterval   <dbl> NA, 940, 620, 295, 680, 1130, 160...
## $ CasingDiameter               <dbl> NA, 16.000, 18.000, 8.000, 17.400...
## $ DrillingMethod               <chr> "Reverse Circulation", "Reverse C...
## $ Fluid                        <chr> "Bentonite", "Bentonite", "Benton...
## $ StaticWaterLevel             <dbl> 77, 48, 24, 86, 120, 229, 4, NA, ...
## $ TotalDrawDown                <int> NA, NA, NA, NA, NA, 135, 53, NA, ...
## $ TestType                     <chr> NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ PumpTestLength               <int> NA, NA, NA, NA, NA, 8, 4, 2, NA, ...
## $ WellYield                    <dbl> NA, NA, NA, NA, NA, NA, 60.0, NA,...
## $ WellYieldUnitofMeasure       <chr> NA, NA, NA, NA, NA, NA, "GPM", NA...
## $ OtherObservations            <chr> NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ MTRS                         <chr> "M14N02W14", "M14N02W13", "M15N01...

Identify Missing Data

There’s a wealth of data here, but how much of it is missing? The plot below shows data that is present in purple, and missing data in yellow Each horizontal bar represents one variable in the dataset.

get_na <- function(var){
  temp <- vector(length = 2)
  temp[1] <- dat %>% filter(is.na(get(var))) %>% nrow() / nrow(dat)
  temp[2] <- 1 - temp[1]
  return(temp)
}

lapply(colnames(dat), get_na) -> temp

na_plot <- data.frame(variable = rep(colnames(dat), each = 2), 
                      state = factor(c("missing", "present"), ordered = F, 
                                     levels =  c("present","missing")),
                      val = do.call(c, temp),
                      order = do.call(rbind, temp)[,2] %>% rep(each=2))

library(forcats) # for reordering factors
ggplot(na_plot) + 
  geom_col(aes(fct_reorder(variable,order), val, fill = state), alpha = 0.85) + 
  coord_flip() + labs(y = "% present", x = "") + 
  theme(legend.title=element_blank()) + 
  scale_fill_manual(values=colormap(colormaps$viridis, nshades=2)) +
  scale_y_reverse(breaks = c(1,.75,.5,.25,0), labels = c(0,25,50,75,100)) + theme_bw(base_size = 18)

To summarize a few key points:

  • About 95% of well coordinates are known, although some are not exact, but determined by some method
  • specs regarding well construction and use are known to a lesser degree
  • About 26% of well types (PlannedUseFormerUse) are missing. These values might be computationally imputated based on existing variables, and variables that may be engineered such as:
  • depth of the well and thickness of screened interval
  • distance to the closest city (public v domestic v agricultural wells)
  • the driller’s permit number (some drillers might specialize in one or a few types of well)
  • whether or not the well’s lat/lon show it overlapping with a residential neighborhood (domestic well)
  • the NDVI at the lat/long (agriculture)
  • the type of nearby wells (wells types may cluster together)
  • There’s no information at all on the ground surface elevation at each well, but this is easily obtained by joining the coordinates to a Digital Elevation Model.
  • The biggest gap in the data is a variable that isn’t even here: geology.
  • What rock types are encountered in each well log, and at what depth intervals?
  • Groundwater flow and transport models depend on this data
  • Identifying sites to maximize groundwater recharge rates also depends on this data
  • This information can only be obtained in the digitized individual well completion reports, a painstaking task that requires interpretation and geologic understanding. This may be a task well suited for the right combination of geologic insight and image processing neural networks.

The plot above shows that about half of our dataset are missing values. However, the glass is half full. The data that is present will be used to answer some interesting questions.


Clean Data Relevant to this Analysis

I’m going to start by renaming some variables that will be used a lot in this analysis to make typing easier.

# first make a copy of the column names
dat_colnames <- colnames(dat) 

dat <- dat %>% rename(lat = DecimalLatitude, lon = DecimalLongitude, 
                      type = PlannedUseFormerUse, driller_name = DrillerName,
                      township = Township, range = Range, section = Section,
                      top = TopOfPerforatedInterval, bot = BottomofPerforatedInterval)

Latitude and Longitude

About 5% percent of wells are missing a latitude or longitude.

dat %>% 
  filter(is.na(lat) | is.na(lon)) %>% nrow() / nrow(dat)
## [1] 0.05356509

We may be tempeted to throw these observations out because they’re a small fraction of the overall data, but let’s see if we can sensibly impute where wells “should” be from their Township, Range and Section.

For all but 1 of our missing lat/lon, we have a township, range, and section! This is good news, but we’ll see later that Township, Range and Section are pretty messy data.

dat %>% filter(is.na(lat) | is.na(lon)) %>% nrow() - 
  dat %>% filter(is.na(lat) | is.na(lon) & !is.na(township) & !is.na(range) & !is.na(section)) %>% nrow()
## [1] 1

Before we impute lat/lon, let’s clean the rest of our lat/lon data, so we only need to do one round of lat/lon imputation.

We’re in California, so all of our latitudes and longitudes should fall within a bounding box:

-124.4096 < Longitude < -114.1308 32.5343 < Latitude < 42.0095

Does our data fall within this bounding box? 100% of our latitudes line up with reality, and more than 99% of our longitudes do so as well. That’s good.

lat_min <- 32.5343 - 1
lat_max <- 42.0095 + 1
lon_min <- -124.4096 - 1
lon_max <- -114.1308 + 1

# Longitude
dat %>% 
  filter(!is.na(lon)) %>% 
  filter(lon < lon_max  & lon > lon_min) %>% 
  nrow() / nrow(dat %>% filter(!is.na(lon))) * 100
## [1] 99.98253
# Latitude
dat %>% 
  filter(!is.na(lat)) %>% 
  filter(lat < lat_max | lat > lat_min) %>% 
  nrow() / nrow(dat %>% filter(!is.na(lat))) * 100
## [1] 100

Of the < 1% of longitudes that fall outside of California, 141 are recorded as positive. We will multiply these values by -1.

# filter for positive longitudes
dat %>% 
  filter(lon > 0) %>%
  select(lat, lon)
## # A tibble: 149 x 2
##      lat   lon
##    <dbl> <dbl>
##  1  40.5  124.
##  2  40.6  124.
##  3  40.6  124.
##  4  40.6  124.
##  5  35.3  119.
##  6  36.9  120.
##  7  38.1  121.
##  8  38.1  121.
##  9  36.0  121.
## 10  33.6  117.
## # ... with 139 more rows
# fix incorrect longitudes that should be positive
clean_dat <- dat %>% 
  mutate(lon = ifelse(lon > 0, lon * -1, lon)) 

There are 7 very large negative longitudes. These can be fixed by cross-referencing the township, range and section of these points and identifying the most sensible Longitude. This error is likely to have occured when someone was manually inputting data and entered an extra number. Moreover, there’s one extra large latitude that is clearly a mistakenly placed decimal. It’s only one record, so it didn’t show up earlier in the rounded percentage of longitudes within out bounding box. We’ll correct that too.

clean_dat %>% 
  filter(lon < lon_min) %>% 
  select(lat, lon)
## # A tibble: 7 x 2
##     lat    lon
##   <dbl>  <dbl>
## 1  38.0 -1233.
## 2  37.9 -1322.
## 3  39.8 -1234.
## 4  38.2 -3122.
## 5  38.9 -1321.
## 6  37.3 -1211.
## 7  38.8 -1523.
clean_dat %>% 
  filter(lat > lat_max) %>% 
  select(lat, lon)
## # A tibble: 1 x 2
##     lat   lon
##   <dbl> <dbl>
## 1  388. -122.
# get index of big latitude
big_lat <- clean_dat %>% rownames_to_column() %>% filter(lat > lat_max) %>% select(rowname) %>% `[[`("rowname") %>% as.numeric()

clean_dat$lat[big_lat] <- 38.777 # fix the incorrect value

To fix the 7 big longitudes, first grab the Township, Range and Section for observations with big longitudes.

big_lon <- clean_dat %>% 
  rownames_to_column() %>%
  filter(lon < lon_min) %>% 
  select(rowname, lat, lon, township, range, section) 

big_lon_index <- big_lon %>% `[[`("rowname") %>% as.numeric() # get indices of large longitudes

There are some errors in the township, range, and section data (which is outside of the scope of this analysis), but the very large negative longitudes belong to townships, ranges and sections that are correct. Comparing the median longitudes of all wells in the missing township, range and section, we can see where input errors took place and sensibly impute what longitude these observations should be.

dat %>% 
  filter(lon > -125) %>% 
  filter(township %in% big_lon$township & 
         range %in% big_lon$range & 
         section %in% big_lon$section) %>% 
  group_by(township, range, section) %>% 
  summarise(median_lon = median(lon)) 
## # A tibble: 197 x 4
## # Groups:   township, range [?]
##    township range section median_lon
##    <chr>    <chr> <chr>        <dbl>
##  1 01S      01E   03           -122.
##  2 01S      01E   04           -122.
##  3 01S      01E   20           -122.
##  4 01S      01E   21           -117.
##  5 01S      04W   03           -122.
##  6 01S      04W   04           -117.
##  7 01S      04W   11           -117.
##  8 01S      04W   19           -117.
##  9 01S      04W   20           -117.
## 10 01S      04W   21           -117.
## # ... with 187 more rows
# overwrite with correct values
clean_dat$lon[big_lon_index] <- c(-122.506, -122.491, -123.551, -122.256, -121.330, -121.865, -123.001) 

How were these coordinates determined, and how accurate are those predicitons? Nearly all coordinates in this database were derived from the township, range and section, and furthermore are accurate to the centroid of a section. Sections in the Public Land Survey System (Rectangular Survey System) are defined as 1 square mile squares. That means that the great majority of wells in this dataset have points at the center of section centroids, and assuming they are all correctly encoded to the right section, the true location of the well should be within .50 - .71 miles from the given location.

We will assume that these spatial coordinates are close enough to the true values that we can aggregate points into larger subbains, and perform spatial analysis at that aggregated scale, and we will not do any sort of point-pattern analysis at the level of the point.

clean_dat %>% 
  count(MethodofDeterminationLL) %>% 
  #filter(MethodofDeterminationLL %in% c(NA, "Derived from TRS", "Derived from Address", "Unknown")) %>% 
  ggplot(aes(reorder(MethodofDeterminationLL, n), n, fill = n)) + 
  geom_col() + 
  coord_flip() +
  scale_fill_viridis_c(guide = FALSE) +
  labs(title = "Count of Method of Determination for Lat/Lon",
       subtitle = "All Wells in the OSWCR Database",
       x = "Method",
       y = "Count")  +
  theme_bw(base_size = 20)

# accuracy
clean_dat %>% 
  count(LLAccuracy) %>% 
  #filter(MethodofDeterminationLL %in% c(NA, "Derived from TRS", "Derived from Address", "Unknown")) %>% 
  ggplot(aes(reorder(LLAccuracy, n), n, fill = n)) + 
  geom_col() + 
  coord_flip() +
  scale_fill_viridis_c(guide = FALSE) +
  labs(title = "Count of Accuracy for Lat/Lon",
       subtitle = "All Wells in the OSWCR Database",
       x = "Accuracy",
       y = "Count")  +
  theme_bw(base_size = 20)

Latitude and Longitude: Summary

After cleaning, all of our data nicely plots in California, which is what we expect, and we didn’t have to throw any data out. Some points plot on islands off the California coast; these will be omitted when the analysis is focused on groundwater basins.

# california basemap
states <- sf::st_as_sf(maps::map("state", plot = FALSE, fill = TRUE)) # state level polygon data
cabm <- states %>% filter(ID == "california") # subset to include only california
#geom_sf(data = cabm, alpha = 0, color = "black", size = .5) + # ca basemap
ggca <- map_data("state") %>% filter(region == "california")
# geom_polygon(data = ggca, 
#                aes(long, lat, group = group), 
#                alpha = 0, color = "black", size = .5)

# convert spatial pts to sf object
pts <- st_as_sf(clean_dat %>% filter(!is.na(lon) & !is.na(lat)),
                coords = c("lon","lat"), # for point data
                remove = F, # don't remove these lat/lon cols from df
                crs = 4326)
# define projection (crs)
pts %>% sf::st_transform(4326) -> pts_trans

pts %>% ggplot() + 
  geom_point(aes(lon, lat), size = 0.001, alpha = 0.2) + 
  geom_polygon(data = ggca, 
               aes(long, lat, group = group), 
               alpha = 0, color = "black", size = .5) +
  coord_fixed(1.3) + 
  labs(x = "Longitude", y = "Latitude") + 
  theme_bw(base_size = 20) 

  • 100% of non-missing latitude and longitudes now fall within California.
  • 5.3% of the data are missing lat/long values.
  • It’s possible to impute approximate latitude and longitude via other variables.
  • Variables that should relate to lat/lon include: RegionOffice, CountyName, LocalPermitAgency, Township, Range, Section, City, PermitOffice, type, and TotalCompletedDepth.
  • Statistical imputation requires non-missing data to build and imputation model.
  • Positional accuracy is at the centroid of square mile sections, or ±1137 meters (0.707 miles)

Group Wells into Bulletin 118 Subbasins

We’re interested in groundwater management, so using our clean coordinates we will group wells into subbasins and focus all further analysis on this subset of the data.

# Bulletin 118 GW subbasins
s <- shapefile(here('shapefiles','I08_B118_CA_GroundwaterBasins.shp')) 
# convert to sf object
gwb <- st_as_sf(s,
       coords = c("lon", "lat"), 
       remove = F, # don't remove coords
       crs = 4326) 
# define projection
gwb %>% sf::st_transform(4326) -> gwb_trans

gwb_trans %>% ggplot() + 
  geom_sf() + 
  geom_sf(data = cabm, alpha = 0, color = "black", size = .5) + # ca basemap
  labs(title = "Bulletin 118 Groundwater Basins",
       x = "Longitude",
       y = "Latitude") + theme_bw(base_size = 20)

To aggregrate our well points into Bulletin 118 ploygons, we’ll perform an intersection. For the math to work, we’ve made sure that our polygons and pts have been transformed into the same projection, or coordinate reference system (crs). About 300,000 wells fall outside of the Bulletin 118 subbasins, and will be omitted from this analysis.

# perform intersection: join points to polygons, and filter points that fall outside of polygons
#sf::st_intersection(pts_trans,gwb_trans) -> gwb_pts

# remove polygon geometries and write pts to a csv because this operation takes a while
#gwb_pts %>% as.data.frame() %>% select(-geometry) %>% write_csv("R_intersection_gwb_pts.csv")
#gwb_pts %>% as.data.frame() %>% select(-geometry) %>% save(., file = "R_intersection_gwb_pts.Rdata")

Summary: Group Wells in Bulletin 118 Subbasins

Looks like our intersection worked. Compared to the entire well database, wells are now isolated to the Bulletin 118 groundwater subregions. In the process we trimmed our dataset of wells by about 1/3.

# gwb_pts <- read_csv("R_intersection_gwb_pts.csv")
# saveRDS(gwb_pts, "R_intersection_gwb_pts.Rds") # save as RDS for efficient loading
gwb_pts <- readRDS("R_intersection_gwb_pts.Rds") 
gwb_pts <- st_as_sf(gwb_pts,
                    coords = c("lon", "lat"), # for point data
                    remove = F, # don't remove these lat/lon cols from df
                    crs = 4326) 
# define projection
gwb_pts <- gwb_pts %>% sf::st_transform(4326) 

gwb_pts %>% 
  ggplot() + 
  geom_polygon(data = ggca, 
               aes(long, lat, group = group), 
               alpha = 0, color = "black", size = .5) +
  geom_point(aes(lon,lat), size = 0.001, alpha = 0.2) + 
  coord_fixed(1.3) + 
  labs(x = "Longitude", y = "Latitude") + 
  theme_bw(base_size = 20)

# save our bulletin 118 wells as a dataframe
gwb_df <- gwb_pts %>% as.data.frame()

Well Type Lemmatization

For each variable, duplicate entires that encoded differently should be aggregated. A lemmatization of all well types will increase the utility of the data for statistical models and a deeper understanding of the system.

Here we will clean the variable PlannedUseFormerUse, or what I’ve renamed as type for all data, including data outside of Bulletin 118 subbasins. This varible has 245 unique levels but many of these have the same meaning. Take for instance the first 10 unique entries:

clean_dat$type %>% unique() %>% head(10)
##  [1] "Water Supply Irrigation - Agriculture" 
##  [2] "Water Supply Industrial"               
##  [3] "Water Supply Domestic"                 
##  [4] "Other  Unknown"                        
##  [5] "Other Unknown"                         
##  [6] NA                                      
##  [7] "Water Supply Irrigation - Agricultural"
##  [8] "Cathodic Protection"                   
##  [9] "Unknown"                               
## [10] "Other Not Specified"
  • 2 are agricultural
  • 5 are unknown
  • 1 is industrial
  • 1 is domestic
  • 1 is a cathodic protection well

This list can be grouped into many fewer categories than 245.

The most common well types provides insight into how to begin grouping wells. Here are the top 10 most common well types.

# view the ranked well types
clean_dat %>% count(type) %>% mutate(type = tolower(type)) %>% arrange(-n) %>% head(10) 
## # A tibble: 10 x 2
##    type                                        n
##    <chr>                                   <int>
##  1 water supply domestic                  355502
##  2 <NA>                                   200341
##  3 monitoring                             126173
##  4 other  unused                           61280
##  5 water supply irrigation - agriculture   36023
##  6 water supply irrigation - agricultural  31247
##  7 other unknown                           21365
##  8 water supply public                     14722
##  9 water supply irrigation agricultural    14631
## 10 test well                               12011

We can use words indicative of a group to index entries containing those words, and programatically rename them. All categories containing more than 1,000 well types (~0.1% of the data) were perserved as unique categories.

# cleaning function
replace_words <- function(vec, detect_this, replace_with){
  if(!is.vector(vec)){
    stop("`vec` must be a vector of words to search within")
  }
  if(!is.vector(detect_this)){
    stop("`detect_this` must be a vector of words to search for")
  }
  
  my_list = list() # temporary list to fill
  
  # fill out lists of indices with words to detect
  for(i in 1:length(detect_this)){
    my_list[[i]] = str_which(vec, detect_this[i])
  }
  do.call(c, my_list) -> temp # combine lists into one vector (with duplicates, which is okay)
  vec[temp] <- replace_with # replace incides with detected words with replacement word

  # return vector with replaced values
  return(vec)
}

# get vector of well types
well_types <- clean_dat %>% .[["type"]] %>% tolower() 


# agriculture 
ag_words <- c("agriculture","agricultural","irrigation","iriigation","irrigation - landscape")
well_types <- replace_words(well_types, ag_words, "agriculture")

# domestic
dom_words <- "domestic"
well_types <- replace_words(well_types, dom_words, "domestic")

# Sparging - Vapor Extraction
remed_words <- c("vapor","sparging","sparge","oxygen","multi-phase","remediation")
well_types <- replace_words(well_types, remed_words, "remediation")

# public
public_words <- c("public","municipal","municipial","city") # humans spell things wrong 
well_types <- replace_words(well_types, public_words, "public")

# monitoring 
mon_words <- c("monitoring","observation","piezometer","oberservation") # humans spell things wrong 
well_types <- replace_words(well_types, mon_words, "monitoring")

# NA
na_words <- c("not specified","unknown","other na","other n/a")
well_types <- replace_words(well_types, na_words, "missing")
well_types <- replace_na(well_types, "missing")

# animal/stock
animal_words <- c("animal","stock","stockwater","dairy")
well_types <- replace_words(well_types, animal_words, "stock")

# unused
unused_words <- c("unused","destroyed","destruction")
well_types <- replace_words(well_types, unused_words, "unused")

# injection
injection_words <- "injection"
well_types <- replace_words(well_types, injection_words, "injection")

# industrial
ind_words <- c("industrial")
well_types <- replace_words(well_types, ind_words, "industrial")

# test well
unused_words <- c("unused","destroyed","destruction")
well_types <- replace_words(well_types, unused_words, "unused")

# cathodic
cat_words <- c("cathodic")
well_types <- replace_words(well_types, cat_words, "cathodic")

# geothermal
geo_words <- c("geothermal")
well_types <- replace_words(well_types, geo_words, "geothermal")

At this point, after covering the top 13 well types, less than 1% of wells remain classified as some form of “other”. Let’s lump them into one category called “other”.

# less than 1% of remaining wells fall into miscellaneous 'other' categories
1 - data.frame(type = well_types) %>% count(type) %>% arrange(-n) %>% head(13) %>% `[[`("n") %>% sum() / nrow(clean_dat)
## [1] 0.003260308
# other
oth_words <- c("other", "dewatering", "water supply")
well_types <- replace_words(well_types, oth_words, "other")

# save these changes to our clean data frame
clean_dat <- clean_dat %>% mutate(type = well_types)

Knowing the date the well was completed also gives us valuable information. Let’s create those columns, and save the cleaned data.

# Permit date has too many missing values to be useful, and the WCR number which embeds a date within only covers 65% of wells. The Date work ended offers the highest number of observations covering ~80% of wells.
clean_dat %>% filter(!is.na(DateWorkEnded)) %>% nrow() / nrow(clean_dat)
## [1] 0.8051181
# separate DateWorkEnded into month, day and year
clean_dat <- separate(clean_dat, DateWorkEnded, into = c("month","day","year"))
#save(clean_dat, file = here("clean_dat.RData"))
#load(here("clean_dat.RData"))

Summary: Well Type Lemmatization

After grouping well types, we can look at wells in Bulletin 118 subbasins and see that most wells are categorized.

# replace gwb_pts, gwb_df well types with updated ones to avoid another intersection!

# df
gwb_df <- left_join(gwb_df, clean_dat %>% select(type, WCRNumber), by = "WCRNumber") %>% 
  select(-c(type.x, geometry)) %>% rename(type = type.y)

# sf pts
gwb_pts <- left_join(gwb_pts, clean_dat %>% select(type, WCRNumber), by = "WCRNumber") %>% 
  select(-c(type.x)) %>% rename(type = type.y)

#save
#gwb_pts %>% st_write("gwb_pts.csv", layer_options = "GEOMETRY=AS_WKT")
#x <- st_read("gwb_pts.csv", options = "GEOM_POSSIBLE_NAMES=WKT")

# plot from the df
gwb_df %>% filter(!is.na(bot)) %>% 
  count(type) %>% 
  ggplot(aes(x=fct_reorder(factor(type),n), y=n)) + 
  geom_col(fill = c(rep("grey50",9), "darkred", rep("grey50",3))) +
  coord_flip() + 
  labs(title = "Well Type Counts in Bulletin 118 Subbasins",
       y = "", 
       x = "numer of wells") +
  theme_bw(base_size = 20)

Now we can save this clean data frame for futre anaylsis.

# all of the clean data
clean_dat %>% write_rds(., here("clean_data","clean_dat_all.Rds"))

# bulletin 118 data
gwb_df %>% write_rds(., here("clean_data","clean_dat_118.Rds"))

A future analysis might impute missing well types based on:

  • permit agency
  • completed depth
  • casing diameter
  • drilling method
  • well yield

Outside datasets that would help in imputing these values include:

  • land cover
  • proximity to a city

It’s important to remember that the wells in this dataset only represent reported wells. The number of unreported wells not found in this database are imaginably in the 10s to 100s of thousands.


Areas Requiring Further Attention

Ground Surface Elevation

A simple join with a digital elevation model will add much value to this dataset for relatively little work. This will go some way in resolving where exactly the top and bottom of the perforated interval rests. Any measurement in the z direction is subject sources of uncertainy, including:

  • the effects of land subsidence on land surface elevation over time
  • differences in how well completion reports define their datum, e.g.:
    • from mean sea level
    • from ground surface elevation
    • from the top of the well head

For these reasons, in this report, I’ll focus on the thickness of the perforated interval, which is less subject to these uncertainties because it’s simply the difference between the top and bottom of the screen interval.

Imputation

Where missing values are present (i.e. - PlannedUseFutureUse, well casing diameter, well yield, total drawdown) imputation via sensible value selection or statistical models may offer a way to improve data coverage in areas that lack adequate data. Imputation is a method to complete missing data with statistical models.


Answer Some Interesting Questions

Now that we’ve done some data cleaning, let’s ask some interesting questions of the Bulletin 118 wells from 1900 onward, and see what the data might tell us.

When do Californians Drill Wells?

We would expect that the number of wells constructed has grown over time to meet the water demands of a growing Californian population. It would also make sense that more wells are drilled during or after droughts (to see what this report considers a drought, please refer to the Appendix). In fact, the 3 driest years on California record occur in 1977, 1991, and 2015, and the data show an increase in wells drilled during those years.

# California droughts defined by the USGS (see Appendix)
ca_droughts <- data.frame(xmin = c(1928, 1976, 1987, 2001, 2007, 2012),
                          xmax = c(1934, 1977, 1992, 2002, 2009, 2016),
                          ymin = 0, ymax = 20000, drought = "drought")

# summarise the data for plotting
wells_over_time <- gwb_df %>% 
  filter(!is.na(type)) %>% 
  count(year = as.numeric(year)) %>%
  mutate(year = as.numeric(year)) %>% 
  filter(year > 1900 & year < 2015) 

missing <- gwb_df %>% 
  filter(type == "missing") %>% 
  count(year = as.numeric(year)) %>% 
  mutate(year = as.numeric(year)) %>% 
  filter(year > 1857 & year < 2015) %>% mutate(color = "missing")
  
# plot
pts <- data.frame(x = c(1977,1991,2015), y = c(13500,16000,15000))
seg <- data.frame(x = c(1977,1991,2015), xend = c(1977,1991,2015), y = c(0,0,0), yend = c(13500,16000,15000))

ggplot(wells_over_time) +
  geom_line(mapping = aes(x = year, y = n), size = 1) +
  geom_rect(data = ca_droughts, 
            aes(xmin=xmin,xmax=xmax+1,ymin=ymin,
                ymax=ymax, fill = drought), alpha=0.2) +
  geom_segment(data = seg, mapping = aes(x= x, xend = xend, y=y, yend = yend), color = "red", linetype= "dashed") + 
  geom_point(data = pts, mapping = aes(x,y), color = "red", size = 3) +
  annotate("text", x = 1977, y = 14500, label = "1977", size = 7) +
  annotate("text", x = 1991, y = 17000, label = "1991", size = 7) +
  annotate("text", x = 2015, y = 16000, label = "2015", size = 7) +
  annotate("text", x = 1930, y = 14000, label = "The three driest years on California", size = 5) +
  annotate("text", x = 1930, y = 13000, label = "record are shown as red lines.", size = 5) +
  labs(y = "Count of Wells Completed", x = "Year", "Wells Completed in Bulletin 118 Subbasins") +
  ggthemes::theme_few(base_size = 20) + 
  theme(legend.title=element_blank()) + scale_color_manual(values = "#984ea3") -> p_1

p_1  

In 1977 and 1991, two of the top three driest years on California record, we see spikes in the number of wells drilled. Following the 2001-2002 drought, from 2003-2005 there’s another spike in well construction. Interestingly, the most recent 2012-2016 drought doesn’t show such a marked spike in well construction as previous events.

Diffenbaugh, Swain, and Touma (2015) show that the simulated probability of dry years has been increasing in California, beginning around ~1980. Interestingly, the number of wells drilled per year hasn’t fallen below levels seen since ~1980.


What types of wells have been completed?

What are the major well types that aren’t missing? Defined by counts alone, those are domestic, agricultural, monitoring and unused wells.

  • Domestic and agricultural well completion reports spike in notably dry years.
  • More domestic wells are drilled around each dry year than agricultural wells.
  • A wave of monitoring wells were completed in the late 80s and early 90s, corresponding with the USGS NAWQA program and similar initatives by the California Department of Public Health.
  • The number of unused wells is rising. These wells, if left unsealed, may provide vertical pathways for contaminat transport.
# consider only the top 4, plus missing values
gwb_df %>% count(type) %>% arrange(-n) %>% `[[`("type") %>% head(5) -> top_5
top_4 <- top_5[-1]

library(scales)
library(colormap)

pts <- data.frame(x = c(1977,1991,2015), y = c(8000,7000,6000))
seg <- data.frame(x = c(1977,1991,2015), xend = c(1977,1991,2015), y = c(0,0,0), yend = c(8000,7000,6000))


gwb_df %>% 
  mutate(year = as.numeric(year)) %>% 
  filter(year > 1900 & year < 2015 &
         type %in% top_4) %>% 
  group_by(type, year) %>% 
  summarise(count = n()) %>% ungroup() %>% 
  ggplot() +
  geom_rect(data = ca_droughts, 
            aes(xmin=xmin,xmax=xmax+1,ymin=ymin,
                ymax=ymax-11000, fill = drought), alpha=0.2) +
  geom_line(aes(x = year, y = count, color = type), size = 1) +
  #geom_line(data = wells_over_time, mapping = aes(x = as.numeric(year), y= n), size = 1, inherit.aes = FALSE) +
  geom_segment(data = seg, mapping = aes(x= x, xend = xend, y=y, yend = yend), color = "red", linetype = "dashed") + 
  geom_point(data = pts, mapping = aes(x,y), color = "red", size = 3) +
  annotate("text", x = 1977, y = 8500, label = "1977", size = 7) +
  annotate("text", x = 1991, y = 7500, label = "1991", size = 7) +
  annotate("text", x = 2015, y = 6500, label = "2015", size = 7) +
  labs(y = "Count of Wells Completed", x = "Year",title = "Major Wells Completed in Bulletin 118 Subbasins") +
  ggthemes::theme_few(base_size = 20) +
  theme(legend.title=element_text("well type")) + 
  scale_fill_manual(values = "#F8766D", name = "") +
  scale_color_manual(breaks = c("domestic","agriculture","monitoring","unused"), 
                     values = c("#4daf4a","#377eb8","#984ea3","#ff7f00"),
                     name = "Well Type") -> p_2

p_2  

What about minor well types? Defined again by count, these include Public, injection and remediation wells.

  • Public wells tend to be completed a deeper depths, and with longer screened intervals. The consumptive demand of the public is also far less than that of agriculture, so a few deep wells takes care of this population.
  • The relatively new technology of injection wells has picked up, but the data doesn’t suggest that it’s taking off.
  • Remediation wells have steadily increased in number, suggesting an increase of documented groundwater quality contamination issues and attempts to remediate such sites.
# consider only the top 4, plus missing values
other <- c("public","injection", "remediation")

library(scales)
library(colormap)

pts <- data.frame(x = c(1977,1991,2015), y = c(950,950,950))
seg <- data.frame(x = c(1977,1991,2015), xend = c(1977,1991,2015), y = c(0,0,0), yend = c(950,950,950))


gwb_df %>% 
  mutate(year = as.numeric(year)) %>% 
  filter(year > 1857 & year < 2015 &
         type %in% other) %>% 
  group_by(type, year) %>% 
  summarise(count = n()) %>% ungroup() %>% 
  ggplot() +
  geom_rect(data = ca_droughts, 
            aes(xmin=xmin,xmax=xmax+1,ymin=ymin,
                ymax=ymax-19000, fill = drought), alpha=0.2) +
  geom_line(aes(x = year, y = count, color = factor(type)), size = 1) +
  geom_segment(data = seg, mapping = aes(x= x, xend = xend, y=y, yend = yend), color = "red", linetype = "dashed") + 
  geom_point(data = pts, mapping = aes(x,y), color = "red", size = 3) +
  annotate("text", x = 1977, y = 1000, label = "1977", size = 7) +
  annotate("text", x = 1991, y = 1000, label = "1991", size = 7) +
  annotate("text", x = 2015, y = 1000, label = "2015", size = 7) +
  labs(y = "Count of Wells Completed", x = "Year", title = "Minor Wells Completed in Bulletin 118 Subbasins") +
  ggthemes::theme_few(base_size = 20) +
  scale_fill_manual(values = "#F8766D", name = "") +
  scale_color_manual(breaks = c("public","injection","remediation"), 
                     values = c("#4daf4a","#377eb8","#984ea3"),
                     name = "Well Type") -> p_5

p_5  


Where were the domestic and agricultural wells in the 1977 and 1991 droughts drilled?

Let’s first take a look at Agricultural wells. Almost all of these subbasins are the in the southern Central Valley’s Tulare Basin.

# get top 15 subbasins, in terms of number of domestic and agricultural wells drilled in both 1977 and 1991
gwb_df %>% filter(year %in% c(1977,1991) & type %in% c("agriculture","domestic")) %>% 
  group_by(Basin_Subb, year, type) %>% 
  summarise(sum_type = n()) %>% 
  group_by(Basin_Subb) %>% 
  summarise(wt = sum(sum_type)) %>% 
  arrange(-wt) %>%  top_n(15, wt = wt) -> top_15
  
# filter all contending subbasins by those in the top 15
gwb_df %>% filter(year %in% c(1977,1991) & type == "agriculture") %>% 
  group_by(Basin_Subb, year, type) %>% 
  summarise(sum_type = n()) %>% 
  filter(Basin_Subb %in% top_15$Basin_Subb) %>% 
  ggplot() +
  geom_col(aes(x = fct_reorder(factor(Basin_Subb),sum_type), y = sum_type, fill = Basin_Subb)) +
  facet_grid(~ as.numeric(year)) + coord_flip() +
  labs(y = "", x = "Count of Wells Completed", 
       title = "Agricultural wells completed in 1977 and 1991", 
       subtitle = "The subbasins with highest well completion reporting") +
  scale_fill_manual(values = c(rep("grey50", 11),"darkred", rep("grey50",3))) + 
  theme(legend.position = "none") -> p_6

p_6  

And now we look at domestic wells. Again we see that with the exception of one subbasin, more wells are completed in southern Central Valley subbasins than any other region.

gwb_df %>% filter(year %in% c(1977,1991) & type == "domestic") %>% 
  group_by(Basin_Subb, year, type) %>% 
  summarise(sum_type = n()) %>% 
  filter(Basin_Subb %in% top_15$Basin_Subb) %>% 
  ggplot() +
  geom_col(aes(x = fct_reorder(factor(Basin_Subb),sum_type), y = sum_type, fill = Basin_Subb)) +
  facet_grid(~ as.numeric(year)) + coord_flip(ylim = c(0,1150)) +
  labs(y = "", x = "Count Wells Completed", 
       title = "Domestic wells completed in 1977 and 1991", 
       subtitle = "The Subbasins with Highest Well Completion Reporting") +
  scale_fill_manual(values = c(rep("grey50", 11),"darkred", rep("grey50",3))) + 
  theme(legend.position = "none") -> p_7

p_7  + theme_bw(base_size = 20)

Taking a look at the map, we see that most of these wells are drilled in California’s Central Valley, with an very large number of wells drilled in the Southern Tulare Basin.

# join Bulletin 118 polygons with the top_15 basins (by wells drilled 1977 & 1991)
left_join(gwb_trans, 
          top_15) %>% 
  mutate(area_sq_km = st_area(gwb_trans)*1E-6,
         dens = as.numeric(wt/area_sq_km)) %>% # numeric overwrites the "units" class
  ggplot() + 
  geom_sf(alpha = 0.3) + 
  geom_sf(aes(fill = dens), lwd = 0.2) +
  geom_sf(data = cabm, alpha = 0, color = "black", size = .5) + # ca basemap
  scale_fill_continuous(low="thistle2", high="darkred", 
                        guide="colorbar",na.value="white",
                        breaks=c(seq(0,1,.25)), # discrete scale
                        labels=c(seq(0,1,.25)),
                        limits=c(0,1)) + 
  labs(title = "Subbasins with the Most Wells Completed", 
       subtitle = "During the the 1977 and 1991 droughts",
       fill = expression(Density~(wells~'/'~km^{2})),
       x = "Longitude",
       y = "Latitude") -> p_8

p_8  


Are domestic wells drilled in socioeconomically disadvantaged areas?

More domestic wells are completed than any other type of well. Domestic wells tend to be screened closer to land surface, and run the risk of drying out before deeper agricultural and public supply wells, as we saw in the most recent drought.

QUESTION: Is the household income (from the UC Census) in the areas with the highest domestic well completion rates lower than California’s median household income?

We will use 2016 tract-level Census data to explore this relationship, and calculate percent difference from median CA household income for each tract. From census.gov:

Census tracts are small, relatively permanent statistical subdivisions of a county or county equivalent and generally have a population size between 1,200 and 8,000 people, with an optimum size of 4,000 people.

Percent difference is calculated as:

\[PD_{i} = \frac{a_{i} - b}{\frac{1}{2}(a_{i} + b)} \cdot 100\]

Where a is the household income of tract i, and b is the median household income for California. The median household income is used as a benchmark for percent difference because the distribution of incomes is left-skewed by a small handful of high income zones, shown on the first tab.

Lastly, we will calculate the weighted percent difference per tract, summed over all census tracts to answer the question above: is the household income (from the UC Census) in the areas with the highest domestic well completion rates lower than California’s median household income? We use tracts over counties or another spatial discretization because by definition, tracts have similar population density.

Tracts are weighted by the number of wells completed within them. The weighted sum over all tracts \(n\) is:

\[\sum_{i=1}^{n}{w_{i} \space p_{i}}\]

where \(w\) is the number of wells in tract \(i\), and \(p\) is the percent difference from CA median household income of that tract. The more wells in a tract, the greater its weight. If this number is very negative, that will suggest that domestic wells are primarily completed in areas with lower median annual income, but it won’t tell us much about the relative difference in areas affected.

Load packages, download data, and do some light cleaning.

library(tidycensus)
library(leaflet)
library(sf)
library(viridis) # for colorblind friendly color palettes
library(tidyverse)
library(feather) # for fast reading/writing of dataframes
library(scales) # for dollar scale

# load preprocessed data for faster knitting speed
gwb_domestic_tracts <- read_feather("gwb_domestic_tracts.feather")
perc_diff_lab <- readRDS("perc_diff_lab.rds")
med_hhi_lab <- readRDS("med_hhi_lab.rds")

# input census api key and store it for later use
#census_api_key("", install = TRUE, overwrite = TRUE)
readRenviron("~/.Renviron")
options(tigris_use_cache = TRUE)

# get county level household income for CA
ca1 <- get_acs(geography = "county", 
               variables = c(hhincome = "B19013_001"), 
               state = "CA", 
               year = 2016, 
               geometry = TRUE) %>%
  st_transform(4326)

# tract level household income for CA
ca2 <- get_acs(geography = "tract", 
               variables = c(hhincome = "B19013_001"), 
               state = "CA", 
               year = 2016, 
               geometry = TRUE) %>%
  st_transform(4326)

# remove unnecessary columns and rename variables
ca2 %>% 
  select(NAME, estimate, moe, geometry) %>% 
  rename(name = NAME, hh_income = estimate) -> ca_2

Let’s look at all US Census tracts with domestic well completion reports.

Colors correpsond to the difference from median Californian household income ($57,792): red is lower, and blue is higher.

x <- median(gwb_domestic_tracts$hh_income, na.rm = T)

gwb_domestic_tracts %>% filter(hh_income <= x) -> low_ses

gwb_domestic_tracts %>% 
  ggplot(aes(x = hh_income)) + 
  stat_density() -> p

p <- ggplot_build(p)

p$data %>% as.data.frame() %>% 
  select(x, density) %>% 
  mutate(ses = ifelse(x <= 57792, "low SES", "high SES")) -> p

p %>% 
  ggplot() +
  geom_line(aes(x, density), color= "black", size = 1) + 
  geom_area(aes(x, density, fill = ses)) + 
  geom_segment(aes(x = 57792, xend = 57792, 
                   y = 0, yend = 1.95e-05), lty = "dashed", color = "black", size = 1) + 
  geom_point(aes(x = 57792, y = 2.0e-05), size = 3) + 
  annotate("text", x = 110000, y = 2.1e-05, label = "median CA household income", size = 4) + 
  labs(title = "Household income in California", 
       subtitle = "2016 US Census: tract-level data",
       y = "Density", x = "Household Income (USD)",
       fill = "") + 
  scale_x_continuous(labels = dollar) + 
  scale_fill_manual(breaks = c("low SES", "high SES"),
                    labels = c("low SES", "high SES"),
                    values = c("#92c5de","#f4a582")) + 
  theme(plot.title = element_text(hjust = 0), 
        legend.position = c(.75, .55)) +
  theme_bw(base_size = 20)

This interactive map has two layers: the first shows only tracts with domestic well completion reports. Colors correpsond to the difference from median Californian household income ($57,792): red is lower, and blue is higher. Centers of high income (blue) tend to cluster in urban and suburban zones. The second layer shows the median household income for each tract. Lighter colors correspond to more income, and align with urban centers.

# make into leaflet 
# palettes
pal_1 = colorBin(palette = co,
                 domain = temp_cut$perc_diff, bins = c(-100,-75,-50,-25,0,25,50,75,100),
                 na.color = "white")

pal_2 <- colorBin("viridis", 
                  temp_cut$hh_income.x, 
                  bins = c(0,25000,50000,75000,100000,250001))

library(htmltools)
library(htmlwidgets)

temp_cut %>% 
  leaflet(width = "100%") %>% 
  addTiles() %>% 
  addPolygons(stroke = FALSE, smoothFactor = 0.2, 
              color = ~pal_1(perc_diff), 
              #label = ~as.character(paste0(round(perc_diff,2)," %")), 
              label = lapply(perc_diff_lab, htmltools::HTML),
              fillOpacity = 0.8,
              group = "% Diff from Med_HH Income") %>% 
  addPolygons(stroke = FALSE, smoothFactor = 0.2, 
              color = ~pal_2(hh_income.x), 
              #label = ~as.character(paste0("$ ",hh_income.x)), 
              label = lapply(med_hhi_lab, htmltools::HTML),
              fillOpacity = 0.8,
              group = "Med_HH Income") %>% 
  addMarkers(data = gwb_domestic_tracts,
             lng = ~lon, lat = ~lat, 
             popup = paste("Well ID:", gwb_domestic_tracts$WCRNumber, "<br>",
                           "Longitude:", gwb_domestic_tracts$lon, "<br>",
                           "Latitude:", gwb_domestic_tracts$lat, "<br>",
                           # "Perforated Interval Thickness:", gwb_domestic_tracts$b, "ft.", "<br>",
                           "Top of Perforated Interval:", gwb_domestic_tracts$top, "ft.", "<br>",
                           "Bottom of Perforated Interval:", gwb_domestic_tracts$bot, "ft."),
             group = "wells", clusterOptions = markerClusterOptions()) %>% 
  addProviderTiles(providers$CartoDB.Positron) %>% 
  addLayersControl(overlayGroups = c("% Diff from Med_HH Income", "Med_HH Income", "wells"), 
                   position = "topleft") %>%
  addLegend("topright", 
            colors = rev(c("#b2182b", "#d6604d", "#f4a582", "#fddbc7", "#d1e5f0", "#92c5de", "#4393c3", "#2166ac")),
            labels = c("75 to 100%", "50 to 75%", "25 to 50%", "0 to 25%", "0 to -25%", "-25 to -50%", "-50 to -75%", "-75 to -100%"),
            title = "% diff from median hh_income", # legend displays time frame and depth class as title
            labFormat = labelFormat(suffix = " %"),
            opacity = 1) %>% 
  hideGroup("Med_HH Income") %>% 
  addMiniMap(position = "bottomleft") %>% 
  addEasyButton(easyButton(
    icon="fa-globe", title="Zoom to CA",
    onClick=JS("function(btn, map){ map.setZoom(6); }"))) %>%
  addEasyButton(easyButton(
    icon="fa-crosshairs", title="Locate Me",
    onClick=JS("function(btn, map){ map.locate({setView: true}); }"))) -> socio_leaflet
  
# htmlwidgets::saveWidget(socio_leaflet, file = "socio_leaflet.html", selfcontained = FALSE, title = "Socioeconomic disparity in California domestic well completion")

socio_leaflet

Finally, we calculate the weighted percent difference from CA median household income, summed over all tracts, which results in a positive number.

\[\sum_{i=1}^{n}{w_{i} \space p_{i}}\]

This means that more domstic wells are drilled in census tracts above the CA median household income than tracts below the median household income. But how big of a difference? We can compute a weighted percent difference sum for tracts above and below, normalize them, and compare.

test %>% 
  mutate(weighted_p = count*perc_diff) %>% 
  pull(weighted_p) %>% 
  sum(na.rm=TRUE)
## [1] 184243.7
test %>% 
  mutate(weighted_p = count*perc_diff,
         pos = ifelse(weighted_p > 0, weighted_p, NA ),
         neg = ifelse(weighted_p < 0, weighted_p, NA )) -> pos_neg

# plot
data.frame(type = c("above","below"),
           # normalize the weighted sums by the greater value so it scales to 1
           value = c(pos_neg %>% pull(pos) %>% sum(., na.rm = TRUE) / pos_neg %>% pull(pos) %>% sum(., na.rm = TRUE), 
                     pos_neg %>% pull(neg) %>% sum(., na.rm = TRUE) %>% abs()/ pos_neg %>% pull(pos) %>% sum(., na.rm = TRUE))) %>% 
  ggplot(aes(type, value)) + 
  geom_col(fill = c("#92c5de","#f4a582")) +
  #coord_cartesian(ylim = c(0,1.15))
  scale_y_continuous(limits = c(0,1), 
                     breaks=c(0,.2, .4, .6, .8, 1)) +
  labs(title = "Normalized Weighted Sum of Percent Difference",
       subtitle = "comparing census tracts above and below CA median household income",
       x = "",
       y = expression(Sigma~(w[i]~p[i])))  

That the relative weighted sums of percent difference are within 10% of each other suggests that more domestic wells are drilled in wealthier tracts, and that tracts above and below the California median household income are almost equally affected by the need to drill wells.

Summary

What can we learn from exploring this map?

  • Droughts and water shortage affect both high-SES and low-SES tracts.

  • However, low-SES tracts will be less financially equipped to deal with groundwater shortages and the costs of completing a new well.

  • Low-SES tracts tend to be larger in area and geographically isolated from urban centers. When water shortages occur, these tracts are are less likely to be within range of a public water system.

  • Clusters of wells indicate that the lat/lon method of determination groups some wells in the center of counties.
    • This makes it difficult to know with confidence the discrepancy in well completion counts per area of interest.
  • Not all census zones have well completion reports.


Where have monitoring wells been completed?

A quick examination of the map reveals large clusters of monitoring wells in the urban zones of San Deigo, Los Angeles, the Bay Area, Sacramento, and to a lesser extent, the Central Valley.

gwb_pts %>% filter(type == "monitoring") %>% 
  ggplot() + 
  geom_point(aes(lon,lat), alpha = 0.2, size = .01) + 
  geom_polygon(data = ggca, 
               aes(long, lat, group = group), 
               alpha = 0, color = "black", size = .5) +
  coord_fixed(1.3) + 
  labs(title = "Monitoring Wells Completed in California",
       subtitle = "Period of Record ~1900-2015",
       x = "Longitude",
       y = "Latitude") + 
  theme_bw(base_size = 20)


Where have Injection Wells been completed?

Injection wells are a relatively new technology, and have not achieved broad use. From a map of counts, it appears that most injection well completion has taken place in Los Angeles and the South Bay, with a few clusters in the northern Sacramento region. Injection wells are both expensive and require extra water to bank, which is why they have probably seen more adoption in northern regions.

gwb_trans %>% 
  ggplot() + geom_sf() +  
  geom_sf(data = cabm, alpha = 0, color = "black", size = .5) + # ca basemap
  geom_point(data = gwb_pts %>% filter(type == "injection"), 
             aes(lon,lat), alpha = 0.2, size = .2, color = "red") +  
  labs(title = "Injection Wells Completed in California",
       subtitle = "Period of Record ~1900-2015",
       x = "Longitude",
       y = "Latitude") +
  theme_bw(base_size = 20)


Is there a seasonality to well drilling?

Are more wells drilled in the summer months when the weather favors working outside, when the sunlight hours are longer, and when groundwater is in greater demand?

# note that months indicate the date work ENDED. Assuming that most projects finish within the month they begin, this graph doesn't need to change.
clean_dat %>% mutate(month_abv = lubridate::month(as.numeric(month), label = TRUE)) %>% 
  count(month_abv) %>% 
  filter(!is.na(month_abv)) %>% 
  ggplot(aes(month_abv, n)) + 
  geom_col(fill = c(rep("grey50",5), rep("darkred", 4), rep("grey50",3))) +
  labs(title = "Wells Drilled in Califoria by Month",
       subtitle = "Period of Record ~1900-2015",
       y = "Count of Wells Drilled", x = "")  

In fact, compared to winter months, there’s around a 25-50% increase in drilling during the summer months. Not a surprising result, but reaffirming to see that the data supports this hypothesis.


What companies capture the largest market share of well completion jobs?

This is a rather straightforward question, and easy to answer given that almost all companies are listed.

# some cleaning of duplicate entires is needed to visualize the top 20 drillers in the state
clean_dat$driller_name[clean_dat$driller_name == "Cascade Drilling Cascade Drilling"] <- "CASCADE"
clean_dat$driller_name[clean_dat$driller_name == "CASCADE CASCADE"] <- "CASCADE"
clean_dat$driller_name[clean_dat$driller_name == "HEW DRILLING COMPANY HEW DRILLING COMPANY"] <- "HEW DRILLING COMPANY"
clean_dat$driller_name[clean_dat$driller_name == "WEEKS DRILLING AND PUMP C WEEKS DRILLING AND PUMP CO."] <- "WEEKS DRILLING AND PUMP CO"
clean_dat$driller_name[clean_dat$driller_name == "JOHNSON DRILLING CO JOHNSON DRILLING CO"] <- "JOHNSON DRILLING CO"
clean_dat$driller_name[clean_dat$driller_name == "BC2 Environmental Corpora BC2 Environmental Corporation"] <- "BC2 ENVIRONMENTAL"
clean_dat$driller_name[clean_dat$driller_name == "West Hazmat Drilling West Hazmat Drilling"] <- "WEST HAZMAT DRILLING CORP."   
clean_dat$driller_name[clean_dat$driller_name == "WATER DEVELOPMENT CORP. WATER DEVELOPMENT CORP."] <- "WATER DEVELOPMENT CORP." 
clean_dat$driller_name[clean_dat$driller_name == "ALL TERRAIN EXPLORATION D ALL TERRAIN EXPLORATION DRILL"] <- "ALL TERRAIN EXPLORATION DRILLING"
clean_dat$driller_name[clean_dat$driller_name == "EXPLORATION GEOSERVICES,  EXPLORATION GEOSERVICES, INC."] <- "EXPLORATION GEOSERVICES, INC."
clean_dat$driller_name[clean_dat$driller_name == "DIAMOND WELL DRILLING COM DIAMOND WELL DRILLING COMPANY"] <- "DIAMOND WELL DRILLING COMPANY"
clean_dat$driller_name[clean_dat$driller_name == "FISCH BROTHERS DRILLING,  FISCH BROTHERS DRILLING, INC."] <- "FISCH BROTHERS DRILLING, INC."
clean_dat$driller_name[clean_dat$driller_name == "PETERS' DRILLING & PUMP C PETERS' DRILLING & PUMP C0.INC"] <- "PETERS' DRILLING & PUMP CO."
clean_dat$driller_name[clean_dat$driller_name == "GREGG DRILLING & TESTING, GREGG DRILLING & TESTING, INC."] <- "GREGG DRILLING & TESTING INC."
clean_dat$driller_name[clean_dat$driller_name == "BC2 Environmental BC2 Environmental"] <- "BC2 ENVIRONMENTAL"
clean_dat$driller_name[clean_dat$driller_name == "WEST HAZMAT DRILLING CORP WEST HAZMAT DRILLING CORP."] <- "WEST HAZMAT DRILLING CORP."
clean_dat$driller_name[clean_dat$driller_name == "TANKO, GARY TANKO, GARY"] <- "TANKO, GARY"
clean_dat$driller_name[clean_dat$driller_name == "WOODWARD DRILLING COMPANY WOODWARD DRILLING COMPANY"] <- "WOODWARD DRILLING COMPANY"
clean_dat$driller_name[clean_dat$driller_name == "EATON DRILLING COMPANY EATON DRILLING COMPANY"] <- "EATON DRILLING COMPANY"
clean_dat$driller_name[clean_dat$driller_name == "WDC EXPLORATION & WELLS WDC EXPLORATION & WELLS"] <- "WDC EXPLORATION & WELLS"


# plot
clean_dat %>% filter(!is.na(driller_name)) %>% count(driller_name) %>% top_n(20, wt = n) %>% ggplot() + 
  geom_col(aes(fct_reorder(driller_name,n), n, fill = n)) + coord_flip() + 
  labs(title ="Count of Wells Completed by Driller",
       subtitle = "period of record ~1900-2015",
       x = "", y= "Number of Wells Completed") +
  scale_fill_viridis_c(guide = FALSE) + theme_bw(base_size = 18)


Which Subbasins are most Data Dense?

Data density as the number of wells per surface area of the subbasin. This metric doesn’t discriminate between high and low quality data–it’s simply a measure of abundance. We define data density (\(\rho_{i}\)) of a subbasin \(i\) as the number of wells completion reports (\(n_{i}\)) per square kilometer of basin (\(A\)). In other words,

\[\rho_{i} = \frac{n_{i}}{A}\]

We see that most subbasins have less than 10 wells per square kilometer. A few small subbasins scattered throughout the state have no well completion reports, and therefore a data density of 0. It seems highly unlikely that there are any subbasins with zero well completion reports, and this may be due to errors in latitude and longitude determination.

# calculate areas of each bulltein 118 subbasin
areas <- data.frame(Basin_Subb = gwb$Basin_Subb,
                    area = st_area(gwb))

# get counts of wells per subbasin
cts <- gwb_df %>% count(Basin_Subb)

# assemble df
data_density <- 
  left_join(areas, cts, by = "Basin_Subb") %>% # combine counts with areas df
  mutate(density = n/(as.numeric(area) / 1000 / 1000)) # compute data density (wells/km^2)

# make into sf dataframe 
data_density <- left_join(gwb_trans,
                          data_density, 
                          by = "Basin_Subb")

# cut continuous density into categories
data_density$density_bin <- cut(data_density$density, c(0,10,20,30,40,50),
                                labels = c("1-10","11-20","21-30","31-40","41-50"))

# plot
data_density %>% 
  mutate(density_bin = ifelse(is.na(density_bin), "< 1", as.character(density_bin))) %>% # replace NA values with 0
  ggplot() + 
  geom_sf(alpha = 0.3, color = NA) + 
  geom_sf(aes(fill = density_bin), alpha = 0.8, color = NA) +
  geom_sf(data = cabm, alpha = 0, color = "black", size = .5) + # ca basemap
  scale_fill_viridis_d() + # discrete scale
  labs(title = "Data Density per Subbasin", 
       subtitle = "period of record ~1900-2015",
       x = "Longitude",
       y = "Latitude",
       fill = expression(rho~(wells~'/'~km^{2}))) +
  theme_bw(base_size = 20)


What can this data tell us about hydrogeologic parameters of interest?

In particular, this data can tell us about:

  • aquifer transmissivity
  • perforated interval thickness
  • top and bottom of the screened interval

Aquifer Transmissivity

The last three parameters are relatively straightforward and included as columns in the data. Aquifer transmissivity can be predicted from emperical relationships between known aquifer transmissivities and specific capcity. Specific capacity can be calculated from pumping test parameters; specifically it is defined as the ratio of outflow to drawdown, or

\[S_{c} = \frac{Q}{s}\]

Razack and Huntley (1991) studied the relationship between transmissivity (\(T\)) and specific capacity in a large heterogeneous alluvial groundwater basin in Morocco. The study included 215 wells for which \(S_{c}\) and \(T\) were known, and defined an emperical relationship between these parameters, with a correlation coefficient of 0.63. The general form of the equation is given as

\[T = K(S_c)^{0.67}\]

where the constant \(K = 106\) for \(S_{c}\) in units of [\(\frac{gpm}{ft}\)] and \(T\) in units of [\(\frac{m^2}{d}\)].

As we saw in the first figure, not all wells have pump test data. We will estimate \(S_{c}\) and \(T\) for only wells where it is possible, and pumping test data is present.

trans_df <- gwb_pts %>% # start with spdf of points in bulletin 118 subbasins
  filter(!is.na(WellYield) & # get records with well yield
         !is.na(TotalDrawDown) & # get records with drawdown
         TotalDrawDown > 2 & # take reasonable drawndowns
         TotalDrawDown < 1000) %>% # remove the few high drawdown outliers that are clearly entry errors
  # pull(WellYieldUnitofMeasure) %>% unique() # all GPM. miners inches drop out
  mutate(specific_capacity = WellYield / TotalDrawDown, # calculate specific capacity
         transmissivity = 106 * (specific_capacity)^0.67 ) # calculate transmissivity
  
# plot
trans_df %>% 
  ggplot(aes(transmissivity)) + 
  labs(title = "Aquifer Transmissivity in Bulletin 118 Subbains",
       subtitle = "Emperically Derived from Pumping Test Data",
       y = "Count",
       x= expression(Transmissivity~(m^{2}~'/'~day))) -> p_transmissivity

p_transmissivity + 
  geom_histogram(bins = 300, color = "white", fill = colormap(colormaps$viridis, nshades = 300)) +
  coord_cartesian(xlim = c(0,2000)) + # zoom into the bulk of the data
  theme_bw(base_size = 20) 

Transmissivities appear to take a log-normal distribution, which is not uncommon for environmental data. After log transforming the x axis, it appears that transmissivities take on a normal to bimodal shape. This could be due to the fact that the emperical relationship used to compute transmissivities is based on data from an alluvial aquifer, and may not hold for volcanic fractured rock systems, which are abundant in Northern Californian subbasins. We could also be seeing a shallow and deep aquifer in the data.

p_transmissivity + 
  geom_histogram(bins = 100, color = "white", fill = colormap(colormaps$viridis, nshades = 100)) +
  scale_x_log10() + # plot on a log 10 scale
  labs(title = "Log Aquifer Transmissivity in Bulletin 118 Subbains",
       x = expression(log (Transmissivity) ~(m^{2}~'/'~day))) +
  theme_bw(base_size = 20)

How are these data spatially distributed? It appears that well completion reports have decent spatial coverage in the Northern central valley, and isolated northern fractured rock subbasins, the Delta, the North Bay basins, the Santa Barabara region, and in the eastern side Tulare Basin.

# map
gwb_trans %>% 
  ggplot() + geom_sf(lwd = 0.01) + 
  geom_point(data = trans_df, 
             aes(lon,lat), alpha = 0.2, size = .1, color = "red") + 
  geom_polygon(data = ggca, 
               aes(long, lat, group = group), 
               alpha = 0, color = "black", size = .5) +
  labs(title = "Wells with Pumping Test Data ",
       subtitle = "Period of Record ~1900-2015",
       x = "Longitude",
       y = "Latitude") +
  theme_bw(base_size = 20)

A future analysis should join these data to a map of aquifer geology and apply an appropriate emperical formula dependent on the the aquifer rock type. This calculation is outside of the scope of this analysis, but it should be noted that Mace (1997) defined an emperical relationship (with a correlation coefficient of 0.891) for determining the transmissivity of a fractured rock aquifer, albeit across a set of karstic aquifers in Texas, as

\[T = 0.76(S_c)^{1.08}\]

where \(T\) and \(S_{c}\) are both in units of [\(\frac{m^2}{d}\)].

Aquifer property data like transmissivity can be of great value to developing groundwater sustainability agencies tasked with paramaterizing groundwater flow and transport models, as well as regulatory bodies assessing the reasonability of such models. It is important to remember that these transmissivities assume alluvial material, and there is unexplained variance in the model used by Razack and Huntley (1991), as well as sample bias in that only one aquifer was sampled. The study by Mace (1997) sampled many aquifers, which partially explains the higher correlation coefficient compared to Mace (1997). Nonetheless, to view big-picture trends, we can we can aggregate points on a subbasin level and plot the median transmissivity per subbain. There are a few outliers with exceptionally high \(T\) that distort the scale, and they are not plotted so that the main trends in transmissivity can be seen.

# calculate median transmisivity 
trans_df %>% group_by(Basin_Subb) %>% summarise(med_t = median(transmissivity, na.rm = TRUE)) %>% st_set_geometry(NULL) -> gwb_med_t 

#gwb_med_t %>% ggplot(aes(med_t)) + geom_histogram() + coord_cartesian(ylim = c(0,10))

# join back into gwb
temp <- left_join(gwb, gwb_med_t, by="Basin_Subb")

# map
temp %>% 
  filter(med_t <= 1500 | is.na(med_t)) %>% # remove a few outliers that warp the scale, but keep NA values
  ggplot() +
  geom_sf(alpha = 0.3, color = NA) + 
  geom_sf(aes(fill = med_t), alpha = 0.7, color = NA) +
  geom_sf(data = cabm, alpha = 0, color = "black", size = .5) + # ca basemap
  scale_fill_viridis_c(breaks=c(500,1000,1500), # discrete scale
                       labels=c(500,1000,1500),
                       limits=c(0,1500)) + 
  labs(title = "Aquifer Transmissivity in Bulletin 118 Subbains", 
       subtitle = "Emperically Derived from Pumping Test Data",
       x = "Longitude",
       y = "Latitude",
       fill = expression(T~(m^{2}~'/'~day))) +
  theme_bw(base_size = 20)

A next step might be to use ranges of perforated interval thickness over which these tranmissivities were derived to compute upscaled hydraulic conducitivities by the emperical relationship:

\[K = \frac{T}{b}\]

Moreover, it might be useful to create a basin-by-basin report on the available aquifer properties to assist modelers in creating groundwater flow models.


Total Drill Depth

How deep are different well types drilled?

# get a df of median values to join with gwb_pts, so factor levels can be reordered by median
gwb_pts %>% 
  filter(!is.na(TotalDrillDepth) &
         TotalDrillDepth < 1500) %>% 
  group_by(type) %>% 
  summarise(median = median(TotalDrillDepth)) %>% 
  arrange(-median) -> type_df

# remove geometry for join
st_geometry(type_df) <- NULL

# set the minimum height at which to cut off tails
mh = 0.005

# rearrange to fix plotting order, and resolve median value ties
med_vec <- c("agriculture","public","industrial","cathodic","test well","domestic","missing","stock","injection","remediation","unused","other","monitoring")
  
# define color pallette
cols <- colormap(colormap = colormaps$viridis, nshades = 13)
cols <- structure(cols, names = med_vec)

# create the plot
#library(ggridges)
gwb_pts %>% 
  as.data.frame() %>% 
  filter(!is.na(TotalDrillDepth) & # remov NA
         TotalDrillDepth < 1500) %>% # remove outliers so main points can be seen
  left_join(type_df, by ="type") %>% # join to get median values
  ggplot() +
  geom_boxplot(
    aes(#x = type,
      x = fct_reorder(
        type, TotalDrillDepth), 
      y = TotalDrillDepth, 
      fill = type), 
    outlier.colour=NA, 
    alpha = .7
  ) + 
  #stat_density_ridges(geom = "density_ridges_gradient", 
                      #calc_ecdf = TRUE,
                      #quantiles = 4,
                      #quantile_lines = TRUE,
                      #rel_min_height = mh,
                      #scale = .9) + #, alpha = 0, scale = .9, rel_min_height = mh) +
  #scale_fill_viridis(discrete = TRUE, name = "Well Type") + 
  scale_fill_manual(values = cols, 
                    breaks = med_vec,
                    name = "Well Type") +
  #coord_cartesian(xlim = c(0,1500)) +
  labs(title = "Total Drill Depth by Well Type",
       subtitle = "B118 Subbasins, Period of Record ~1900-2015",
       y = "Total Drill Depth (ft.)",
       x = "Well Type") +
  theme_bw(base_size = 19) + 
  coord_flip() +
  guides(fill = FALSE)

Perforated Interval: Thickness, Top and Bottom

How much of the aquifer is screened? Where it is screened relative to land surface? Production wells are designed to be screened in aquifer materials–the highly conductive sands and gravels that facilitate fluid flow. In contrast, monitoring wells may be screened across the entire casing length. Agricultural wells which have less stringent requirements for water quality may be screened at greater lengths than public supply wells, which are more tightly regulated. Knowing how much of an aquifer is screened gives us an idea of how much an aquifer is being utilized.

# reasonable values for top and bottom of perforated interval
gwb_df %>% filter(!is.na(top) & !is.na(bot)) %>% 
  filter(top > 0 & bot > 0) %>% 
  mutate(b = bot - top) %>% filter(b > 0 & b < 2500) -> gwb_df_r

# plot
# set up breaks and fill
gwb_df_r %>% group_by(type) %>% summarise(med = median(b)) %>% arrange(-med) %>% .[["type"]] -> med_vec

# rearrange to fix plotting order, and resolve median value ties
med_vec <- c("cathodic","public","agriculture","industrial","stock","domestic","missing","other","test well","unused","monitoring","remediation","injection")
  
# define color pallette
cols <- colormap(colormap = colormaps$viridis, nshades = 13)
cols <- structure(cols, names = med_vec)

gwb_df_r %>% 
  ggplot() +
  geom_boxplot(aes(x=fct_reorder(type, b, fun = median), y = b, fill = type), 
               outlier.colour=NA, alpha = .7) + # hide outliers
  #scale_y_log10() +
  coord_flip() + 
  scale_fill_manual(values = cols, 
                    breaks = med_vec,
                    name = "Well Type") + 
  labs(title = "Perforated Interval Thickness by Well Type",
       subtitle = "B118 Subbasins, Period of Record ~1900-2015",
       x = "", 
       y = "Perforated Interval Thickness (ft.)") + 
  theme(axis.text.y=element_blank(),
        axis.ticks.y=element_blank()) -> p_4

# Compute boxplot statistics and plot
sts <- boxplot.stats(gwb_df_r$b)$stats
p_4 + coord_flip(ylim = c(0, 300)) + theme_bw(base_size = 20) + guides(fill = FALSE)

Domestic wells are also generally more shallow than public and agricultural wells. This makes them more susceptible to droughts, to top-down pollutants like LNAPLs, DNAPLs, and to non-point source pollutants like salts and nitrates.

# plot
# set up breaks and fill
gwb_df_r %>% group_by(type) %>% summarise(med = median(top)) %>% arrange(-med) %>% .[["type"]] -> med_vec

# define color pallette
cols <- colormap(colormap = colormaps$viridis, nshades = 13)
cols <- structure(cols, names = med_vec)

gwb_df_r %>% 
  ggplot() +
  geom_boxplot(aes(x=fct_reorder(type, top, fun = median), y=top, fill = type), 
               outlier.colour=NA, alpha = .7) + # hide outliers
  #scale_y_log10() +
  coord_flip() + 
  scale_fill_manual(values = cols, 
                    breaks = med_vec,
                    name = "Well Type") + 
  labs(title = "Depth to Top of Perforated Interval by Well Type",
       subtitle = "B118 Subbasins, Period of record ~1900-2015",
       x = "", 
       y = "Depth to Top of Perforated Interval (ft.)") + 
  theme(axis.text.y=element_blank(),
        axis.ticks.y=element_blank()) -> p_3

# Compute boxplot statistics and plot
sts <- boxplot.stats(gwb_df_r$top)$stats
p_3 + coord_flip(ylim = c(0, 400)) + theme_bw(base_size = 20) + guides(fill = FALSE) #max(sts)*1.05)) 

Agricultural Wells: Perforated Interval Thickness

Here we calculate mean and median perforated interval thickness, as well as the top and bottom of the perforated interval for all wells in Bulletin 188 subbasins, for domestic and agricultural wells. These data can also be visualized interactively at this blog post.

#  Group by basin and subbasin (Basin_Subb), calculate mean, median, sample size.  

# calculate thickness again, this time in gwb_pts
gwb_pts2 <- gwb_df_r %>% 
  filter(type == "agriculture")

# thickness of perforated interval mean and median
gwb_pts2 %>% group_by(Basin_Subb) %>% summarise(mean_pit = mean(b, na.rm = TRUE)) -> gwb_mean_pit
gwb_pts2 %>% group_by(Basin_Subb) %>% summarise(median_pit = median(b, na.rm = TRUE)) -> gwb_median_pit
# top of perforated interval mean and median
gwb_pts2 %>% group_by(Basin_Subb) %>% summarise(mean_top = mean(top, na.rm = TRUE)) -> gwb_mean_top
gwb_pts2 %>% group_by(Basin_Subb) %>% summarise(median_top = median(top, na.rm = TRUE)) -> gwb_median_top
# bottom of perforated interval mean and median
gwb_pts2 %>% group_by(Basin_Subb) %>% summarise(mean_bot = mean(bot, na.rm = TRUE)) -> gwb_mean_bot
gwb_pts2 %>% group_by(Basin_Subb) %>% summarise(median_bot = median(bot, na.rm = TRUE)) -> gwb_median_bot


# sample size: number of wells
gwb_pts2 %>% count(Basin_Subb) -> gwb_n_wells

# join back into polygons df
gwb_pts2 <- gwb %>% 
  left_join(gwb_mean_pit, by="Basin_Subb") %>%
  left_join(gwb_median_pit, by = "Basin_Subb") %>% 
  left_join(gwb_mean_top, by="Basin_Subb") %>%
  left_join(gwb_median_top, by = "Basin_Subb") %>%
  left_join(gwb_mean_bot, by="Basin_Subb") %>%
  left_join(gwb_median_bot, by = "Basin_Subb") %>%

  left_join(gwb_n_wells, by = "Basin_Subb")

# One last transformation. put into right crs
gwb_pts2 %>% st_transform(4326) -> gwb_pts2

# create a complete subbasin name
gwb_pts2$Subbasin_N[is.na(gwb_pts2$Subbasin_N)] <- "" # convert NA to "" for pasting
gwb_pts2$complete_subbasin_name <- paste(gwb_pts2$Basin_Name, gwb_pts2$Subbasin_N) # complete basin-subbasin names
gwb_pts2 %>% 
  ggplot() +
  geom_sf(alpha = 0.3, color = NA) + 
  geom_sf(aes(fill = median_pit), alpha = 0.7, color = NA) +
  geom_sf(data = cabm, alpha = 0, color = "black", size = .5) + # ca basemap
  scale_fill_viridis_c(breaks=c(0,300,600), # discrete scale
                       labels=c(0,300,600),
                       limits=c(0,600),
                       option = "magma") + # specific color palette
  labs(title = "Aricultural Wells: Median Perforated Interval Thickness", 
       subtitle = "Derived from Pumping Test Data",
       x = "Longitude",
       y = "Latitude",
       fill = "thickness (ft.)") + theme_bw(base_size = 20)

Agricultural Wells: Top of Perforated Interval

gwb_pts2 %>% 
  ggplot() +
  geom_sf(alpha = 0.3, color = NA) + 
  geom_sf(aes(fill = median_top), alpha = 0.7, color = NA) +
  geom_sf(data = cabm, alpha = 0, color = "black", size = .5) + # ca basemap
  scale_fill_viridis_c(breaks=c(0,500,1000), # discrete scale
                       labels=c(0,500,1000),
                       limits=c(0,1000)) + 
  labs(title = "Aricultural Wells: Median Top of Perforated Interval", 
       subtitle = "Derived from Pumping Test Data",
       x = "Longitude",
       y = "Latitude",
       fill = "depth (ft.)") + theme_bw(base_size = 20)

Agricultural Wells: Bottom of Perforated Interval

gwb_pts2 %>% 
  ggplot() +
  geom_sf(alpha = 0.3, color = NA) + 
  geom_sf(aes(fill = median_bot), alpha = 0.7, color = NA) +
  geom_sf(data = cabm, alpha = 0, color = "black", size = .5) + # ca basemap
  scale_fill_viridis_c(breaks=c(0,600,1200), # discrete scale
                       labels=c(0,600,1200),
                       limits=c(0,1200), option = "inferno") + 
  labs(title = "Aricultural Wells: Median Bottom of Perforated Interval", 
       subtitle = "Derived from Pumping Test Data",
       x = "Longitude",
       y = "Latitude",
       fill = "depth (ft.)") + theme_bw(base_size = 20)


Domestic Wells: Perforated Interval Thickness

#  Group by basin and subbasin (Basin_Subb), calculate mean, median, sample size.  

# calculate thickness again, this time in gwb_pts
gwb_pts3 <- gwb_df_r %>% 
  filter(type == "domestic")

# thickness of perforated interval mean and median
gwb_pts3 %>% group_by(Basin_Subb) %>% summarise(mean_pit = mean(b, na.rm = TRUE)) -> gwb_mean_pit
gwb_pts3 %>% group_by(Basin_Subb) %>% summarise(median_pit = median(b, na.rm = TRUE)) -> gwb_median_pit
# top of perforated interval mean and median
gwb_pts3 %>% group_by(Basin_Subb) %>% summarise(mean_top = mean(top, na.rm = TRUE)) -> gwb_mean_top
gwb_pts3 %>% group_by(Basin_Subb) %>% summarise(median_top = median(top, na.rm = TRUE)) -> gwb_median_top
# bottom of perforated interval mean and median
gwb_pts3 %>% group_by(Basin_Subb) %>% summarise(mean_bot = mean(bot, na.rm = TRUE)) -> gwb_mean_bot
gwb_pts3 %>% group_by(Basin_Subb) %>% summarise(median_bot = median(bot, na.rm = TRUE)) -> gwb_median_bot


# sample size: number of wells
gwb_pts3 %>% count(Basin_Subb) -> gwb_n_wells

# join back into polygons df
gwb_pts3 <- gwb %>% 
  left_join(gwb_mean_pit, by="Basin_Subb") %>%
  left_join(gwb_median_pit, by = "Basin_Subb") %>% 
  left_join(gwb_mean_top, by="Basin_Subb") %>%
  left_join(gwb_median_top, by = "Basin_Subb") %>%
  left_join(gwb_mean_bot, by="Basin_Subb") %>%
  left_join(gwb_median_bot, by = "Basin_Subb") %>%

  left_join(gwb_n_wells, by = "Basin_Subb")

# One last transformation. put into right crs
gwb_pts3 %>% st_transform(4326) -> gwb_pts3

# create a complete subbasin name
gwb_pts3$Subbasin_N[is.na(gwb_pts3$Subbasin_N)] <- "" # convert NA to "" for pasting
gwb_pts3$complete_subbasin_name <- paste(gwb_pts3$Basin_Name, gwb_pts3$Subbasin_N) # complete basin-subbasin names
gwb_pts3 %>% 
  ggplot() +
  geom_sf(alpha = 0.3, color = NA) + 
  geom_sf(aes(fill = median_pit), alpha = 0.7, color = NA) +
  geom_sf(data = cabm, alpha = 0, color = "black", size = .5) + # ca basemap
  scale_fill_viridis_c(breaks=c(0,200,400), 
                       labels=c(0,200,400),
                       limits=c(0,400),
                       option = "magma") + # specific color palette
  labs(title = "Domestic Wells: Median Perforated Interval Thickness", 
       subtitle = "Derived from Pumping Test Data",
       x = "Longitude",
       y = "Latitude",
       fill = "thickness (ft.)") + theme_bw(base_size = 20)

Domestic Wells: Top of Perforated Interval

gwb_pts3 %>% 
  ggplot() +
  geom_sf(alpha = 0.3, color = NA) + 
  geom_sf(aes(fill = median_top), alpha = 0.7, color = NA) +
  geom_sf(data = cabm, alpha = 0, color = "black", size = .5) + # ca basemap
  scale_fill_viridis_c(breaks=c(0,300,600), 
                       labels=c(0,300,600),
                       limits=c(0,600)) + 
  labs(title = "Domestic Wells: Median Top of Perforated Interval", 
       subtitle = "Derived from Pumping Test Data",
       x = "Longitude",
       y = "Latitude",
       fill = "depth (ft.)") + theme_bw(base_size = 20)

Domestic Wells: Bottom of Perforated Interval

gwb_pts3 %>% 
  ggplot() +
  geom_sf(alpha = 0.3, color = NA) + 
  geom_sf(aes(fill = median_bot), alpha = 0.7, color = NA) +
  geom_sf(data = cabm, alpha = 0, color = "black", size = .5) + # ca basemap
  scale_fill_viridis_c(breaks=c(0,400,800), 
                       labels=c(0,400,800),
                       limits=c(0,800), option = "inferno") + 
  labs(title = "Domestic Wells: Median Bottom of Perforated Interval", 
       subtitle = "Derived from Pumping Test Data",
       x = "Longitude",
       y = "Latitude",
       fill = "depth (ft.)") + theme_bw(base_size = 20)



Static Water Level

Lastly, we can use the static water level in domestic, agricultural, and public supply wells to visualize the change from the first recorded measurements. There is more of a negative change in domestic (shallow) wells compared to agricultural and public (deep) wells, which is not surprising. It also appears that in some groundwater basins, static water level has risen. This occurs primarily in small basins with little data to begin with, and may reflect a nuiance in the data caused by a sensitivty to low starting values.


Appendix

How are “droughts” defined?

From the United States Geologic Survey:

…droughts in California can be classified in four ways:

  • Meteorological drought is a period of one, or more water years, of below-normal precipitation;
  • Hydrological drought is a period of one, or more water years, in which there is below-normal availability of surface water and groundwater;
  • Agricultural drought is a period of one, or more water years, in which water available for agricultural production is curtailed by 25% or more; and
  • Ecological drought is a period of one, or more water years, during which deficits in natural water availability create multiple stressors across ecosystems.
    Since 1895, there have been six prolonged dry periods lasting two years or longer, which qualify as droughts under all of the above drought classifications. They are: water years (WY) 1928-34, WY 1976-77, WY 1987-92, WY 2001-02, WY 2007-09 and WY 2012-16.

References

California Online State Well Completion Reports

Diffenbaugh, Noah S., Daniel L. Swain, and Danielle Touma. 2015. “Anthropogenic warming has increased drought risk in California.” Proceedings of the National Academy of Sciences 112 (13): 3931–6. doi:10.1073/pnas.1422385112.

Mace, Robert. 1997. “Determination of Transmissivity from Specific Capacity Tests in a Karst Aquifer.” Groundwater 35 (5).

Razack, M, and David Huntley. 1991. “Assessing Transmissivity from Specific Capacity in a Large and Heterogeneous Alluvial Aquifer.” Groundwater 29 (6).