Skip to content

Data Blending and Interpolation

Bren Moushall edited this page Sep 23, 2020 · 7 revisions

Data Blending and Interpolation

The workflow allows for multiple datasets to be combined and then interpolated over a grid, which can then be plotted and also output as GIS files or GMT compatible data files. The entry point for the workflow is moho_workflow, which takes a JSON config file as the only argument.

Configuration

The workflow is controlled by a JSON config file. An example config is available, shown below:

  {
      "methods":
      [
          {
              "name": "ccp_1",
              "data": "sandbox/ccp_conversion/ccp_sample1.csv",
              "val_name": "Depth",
              "sw_name": "Weight",
              "weight": 1.0,
              "scale_length": 0.2
          },
          {
              "name": "ccp_2",
              "data": "sandbox/ccp_conversion/ccp_sample2.csv",
              "val_name": "Depth",
              "sw_name": "Weight",
              "weight": 1.0,
              "scale_length": 0.5
          }
      ],
      "plotting":
      {
          "output_plot": true,
          "plot_parameters":
          {
              "scale": [10.0, 60.0],
              "format": "png",
              "show": true,
              "title": "Moho depth from blended data",
              "cb_label": "Moho depth (km)"
          },
          "output_gmt": true,
          "output_gis": true
      },
      "bounds": [130.0, -20.0, 140.0, -17.0],
      "grid_interval": 0.25,
      "output_directory":  "./output"
  }
  • methods: A list of inversion/stacking/etc. methods, defining the data and parameters for blending and gridding. Each method is stored as a dictionary.
    • name: The name of the method, which is required and must be unique.
    • data: The data file. See the section below on 'Data Format' for more information.
    • val_name: Name of the column containing the value to be blended and plotted.
    • sw_name: Name of the column to be used as per-sample weighting. This is optional, and if not provided, all sample weights default to 1.
    • weight: The overall weight of the method. This is multiplied by each sample weight to produce a total relative weight for each sample. See 1 for more information about weighting.
    • scale_length: The scale length variable for the spatial spread algorithm, in decimal degrees. See 1 for more information about the spatial spread algorithm.
  • plotting: A dictionary containing plotting parameters. This is optional and if not included no plots or map data will be produced.
    • output_plot: A boolean flag. If true, then a plot of the grid and vector map will be produced. This figure will be saved as plot.{format} in the output directory.
      • plot_parameters: Parameters used for generating the plot. All of these parameters are optional.
      • scale: The minimum and maximum values to define the scale of the colormap, used when plotting the grid. If not provided, the values will be dervied from the minimum and maximum of the grid data.
      • format: Output format of the plot, must be compatible with matplotlib. By default, 'png' is used.
      • show: A boolean flag. Whether or not to display the plot as part of the workflow. If true, the plot will be displayed after generation, useful for verification and debugging purposes.
      • title: The title of the plot. Defaults to 'Moho depth from blended data'.
      • cb_label: Label for the grid colorbar. Defaults to 'Moho depth (km)'.
    • output_gmt: A boolean flag. Whether or not to produce GMT compatible data files. See section below on GMT mapping for more information.
    • output_gis: A boolean flag. Whether or not to produce GIS data. See section below on GIS mapping for more information.
    • bounds: A bounding box of format [min lon, min lat, max lon, max lat]. This defines the extent of the grid to be interpolated and the extent of plots and maps produced. If not provided, the extent is derived from the minimum and maximum bounds of the aggregate datasets.
    • grid_interval: Required. The grid interval in decimal degrees. E.g. 0.25 will interpolate the data to a grid with cell size 0.25 x 0.25 degrees.
    • output_directory: Directory to contain the output. If not provided, the current working directory is used.

python moho_workflow.py config_moho_workflow_example.json will run the workflow with the configured parameters.

File Format

The method data files must be CSV and in a particular format. See example:

# Comments can be placed anywhere in the file, so long as they are 
# preceeded by '#' and don't occur in the first two rows after the 
# 'START' flag (these rows are reserved for the optional timestamp, and
# the data header)

# We can also place miscellaneous values before the 'START' flag
# survey_location, lat, lon
GA,-35.343,149.158

# For 'TIME' below, we can use Unix epoch (e.g. 1600131276) 
# or ISO (e.g. 2020-09-15T00:54:36)

# START
# TIME 1600131276
# Net,Sta,Lon,Lat,Depth,Weight,Additional_Column,Another_Column
OA,BS24,133.036,-19.4734,45.5,1.0,this_is_a_comment,another_comment
# Comment rows can be added within the data
# These rows can be used to denote the start of profile lines,
# and other changes in the data.
OA,BW20,134.909,-19.572,47.9,1.0,this_is_a_comment,another_comment

The format is intended to be flexible and able to contain comments and additional values useful to human readers, while also containing the structured data required for point blending.

Comments

Additional comments can be placed anywhere in the file, so long as they are not placed within the first two rows after the START flag (reserved for optional timestamp and data header).

Miscellaneous data

Data values that are not part of the structured samples can be added, so long as they are placed in the rows before the START flag.

START flag and optional timestamp

This is a flag used to separate unstructured user data from the structured sample data. Data following this flag will be blended and plotted.

The following row can be an optional timestamp, formatted as # TIME <timestamp>. The time itself can given as a Unix timestamp (e.g. 1600131276) or in ISO format (e.g. 2020-09-15T00:54:46). This timestamp, if provided, can be used by quality selection algorithms (TODO: add link to section once complete).

The data header can be placed after the START flag, or after the timestamp if provided.

Data

The data header contains the column names. These columns do not need to be in any specific order. As many columns as you want can be provided. Only some columns are utilised by the data blending workflow:

  • Lon: Longitude of the data sample. If a valid network/station code pair and inventory file aren't provided, then it's nessecary to provide sample locations as lat/lon. Other acceptable forms: lon, LON, longitude, Longitude, LONGITUDE
  • Lat: Latitude of the data sample. If a valid network/station code pair and inventory file aren't provided, then it's nessecary to provide sample locations as lat/lon. Other acceptable forms: lat, LAT, latitude, Latitude, LATITUDE
  • Net: Network code for the data sample. If provided, is used as a label when outputting map data. Can also be used to derive location from an inventory file (see 'Deriving location' below). Other acceptable forms: net, NET, network, Network, NETWORK
  • Sta: Station code for the data sample. If provided, is used as a label when outputting map data. Can also be used to derive the location from an inventory file (see 'Deriving location' below). Other acceptable forms: sta, STA, station, Station, STATION

In addition, a column must be provided as the value to be blended. In this example, the Depth column would be provided as the val_name parameter in the config.

An optional sample weight column can be provided as the sw_name parameter in the config.

Deriving location

The point blending workflow requires the location of each sample point.

You can provide per-sample coordinates as latitude and longitude. Alternatively, you can provide a network & station code for each sample and an inventory file.

To provide the inventory file, you must add it as the inventory_file parameter in the config:

...
"methods":
  [
      {
          "name": "ccp_1",
          "data": "sandbox/ccp_conversion/ccp_sample1.csv",
          "val_name": "Depth",
          "sw_name": "Weight",
          "weight": 1.0,
          "scale_length": 0.2,
          "inventory_file": "ccp_inventory.xml"
      }
...

The network and station codes will be joined as 'NET.STA' and be used to lookup station location in the inventory, so it's important that these are spelt and formatted as they appear in the inventory file.

Generating this file format using Excel

Excel can be used to generate this file by saving your spreadsheet as CSV. For example, the below spreadsheet:

Results in the following CSV file:

# Comment,,,,
# Misc. values,,,,
# A,B,C,,
1,2,3,,
,,,,
# START,,,,
# TIME 1600131276,,,,
# Net,Sta,Lon,Lat,Depth
OA,BS24,133.036,-19.463,45.5
OA,BW20,134.909,-19.572,47.9

(the trailing commas are due to Excel padding the file according to the empty cells of the spreadsheet - these are ignored by the workflow when read).

GIS Data

If enabled, GIS data will be produced and stored in the output directory under gis_data. A singleband geotiff of the grid is produced. A multiband geotiff of the vector map is produced, with the bands containing U and V components respectively. This can be rendered as a vector field using ArcGIS' vector field renderer or equivalent function in your GIS software. Each method used will also produce a shapefile containg the locations of samples used, along with other information such as station name and weight.

GMT Data

If enabled, GMT compatible data files will be produced and stored in the output directory under gmt_data. A grid.txt file of format LON LAT DEPTH is produced, which can be converted to a NetCDF grid using gmt xyz2grd and plotted using gmt grdimage. A gradient.txt file of format LON LAT ANGLE MAGNITUDE can be used to plot the vector field using gmt psxy -Sv. For each method used, a text file of format LON LAT TOTAL_WEIGHT is produced, which can be plotted using gmt psxy with the weight used as symbol size or other differentiator.

Gridding other data

This workflow is developed around Moho depth, but any data can be gridded, so long as the data files follow the format of: # Sta,Lon,Lat,Value,Weight

Corrections

The workflow supports applying corrections and other preprocessing to data used. Currently CCP correction by H-K values is supported. Config example:

 {
      "data_preperation":
      [
          {
              "data": "/home/bren/data_passive/MOHO/hk_ccp_correction/ccp_weights.csv",
              "correction_data": "/home/bren/data_passive/MOHO/hk_ccp_correction/hk_weights.csv",
              "correction_func": "ccp_correction",
              "output_file": "/home/bren/data_passive/MOHO/hk_ccp_correction/ccp_corrected.csv"
          }
      ],

Adding this block will correct the CCP readings by H-k readings by finding applying the difference between the H-k median value and CCP median value to the CCP readings for each station.

The corrected data (specified by output_file) can then be blended and plotted by providing it as the data for a method.

Developer notes

The moho_config module contains constants for the configuration keys and config validation. If the config schema is modified, it's recommended to provide the new key a constant and add it to the relevant SUPPORTED_KEYS list so it can pass validation.

If further correction or preprocessing functions are added, a mapping of the config key to the correction function must be added to CORR_FUNC_MAP.