IMPROVE Logo Image
IMPROVE Anniversery Logo

Data Patching

Data Patching and Substitution for the Regional Haze Rule                                   Scott Copeland – 06/20/2020

Introduction

Two principal approaches are used to fill in missing data in the IMPROVE database.  The first follows the algorithm in the U.S. Environmental Protection Agency’s 2003 Guidance for Tracking Progress Under the Regional Haze Rule1.  This technique is routinely applied to all sites’ data in the IMPROVE network.  It uses a statistical approach, described in detail below, to analyze historical data from a site to fill in gaps at that site under certain limited circumstances.  The guidance refers to this process as “substitution”, but it is routinely called “patching”.

The other technique involves using collocated, similar measurements (e.g., hydrogen mass measured by PESA to infer OM) or scaled data from a nearby site after a demonstration of a suitable correlation.  These two techniques are collectively referred to as “substitution” and are used to replace large amounts of missing data for certain sites and certain years.  The substitution analysis itself is done on an ad hoc basis, most often by MJOs, and is not a part of the routine processing performed by IMPROVE.  Substituted data, when available, fills in missing values after the patching process.

Steps for Patching Data

  1. Gather all mass data for the seven aerosol species in the revised (RHR2) IMPROVE light extinction algorithm for a site. Include data for the target year being considered and the four previous years (or as many as are available if fewer than four).
  2. Set negative values to “0”. Make sure units are consistent.  Prior to 2011, XRF or PIXE values below MDL are set to MDL/2.
  3. Calculate the median concentration of each aerosol species for each quarter for the target year and previous four years.
    1. Only consider quarters with at least 50% valid observations and fewer than 10 consecutive invalid samples. This rule is applied to each species separately; e.g., a given quarter could be usable for sulfate but not soil.
  4. Calculate the mean of five quarterly medians for each species for the target year and four previous years.
    1. These means for each quarter become the candidate patch values to be tested.
  5. For up to five years being considered, select all days with all valid species. Calculate reconstructed light extinction, including Rayleigh scattering, using the revised IMPROVE light extinction algorithm.  Replace a single species’ actual mass measurement with its candidate patch value and recalculate the day’s reconstructed light extinction using the patch value.
  6. Determine the number of times the difference between the original light extinction and the recalculated extinction based on the candidate patch value is less than 10% of the original reconstructed extinction.
  7. If the difference calculated for a species in step 6 is less than 10% for 90% or more of the tested sample days, then patching is allowed for that species for all four quarters of the target year.
    1. The candidate patch values are applied quarterly.
  8. Patch values replace any missing occurrence of that species, not just sample days with all valid species. When patching is allowed, missing values of a species are replaced with the allowed quarterly patch value.
    1. After patching, set the “_subbed” flag for that species to “1” (“0” means valid measurement, and “2” means substitution was performed, not patching.).
  9. The data can now be used to determine final data completeness and haziest day RHR metrics as well as impairment-based RHR metrics.

Extent of Patching Changed in late 2019

Prior to 12/2019, the patching was applied to a maximum of one missing species for any sample day.  Data set versions generated beginning in 12/2019 allow up to two missing species per day to be patched.  This potentially changes the data at every site for the entire data record including natural conditions, 2064 endpoints, and RH2 and impairment metrics.  The changes are generally very small though there are cases where whole sample years now meet the RHR completeness requirement which previously did not.  All versions of the data files are date stamped.  More details are available in this Powerpoint.

Results

Patching and substitution are routinely performed on IMPROVE data that are used to generate regional haze metrics reported through FED (http://views.cira.colostate.edu/fed/) and IMPROVE (https://vista.cira.colostate.edu/Improve/).  Although the patching affects a fairly small subset of all observations, it allows some site-years that otherwise would not to meet completeness requirements.  Patched values can be identified by a “1” value in the “_subbed” flag column, substituted values by a “2”.  It is recommended that patched and substituted values be removed for comparisons with model output or analytical techniques such as PMF, which could be biased by them.

1 U.S. Environmental Protection Agency. Tracking Progress Under the Regional Haze Rule. EPA-454/B-03–004, Washington, DC, September 2003.