Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Attribute aggregation and transformation #31

Open
wants to merge 39 commits into
base: main
Choose a base branch
from

Conversation

glitt13
Copy link
Collaborator

@glitt13 glitt13 commented Nov 15, 2024

An approach to aggregate and transform existing attribute data to create new attribute data.

Additions

  • xssa_attrs_tform.yaml: The example configuration file describing how variables are aggregated and transformed.
  • fs_tfrm_attrs.py : this is the main processing script that calls functions inside tfrm_attr.py
  • tfrm_attr.py : contains new functions for the fs_proc package
  • test_tfrm_attr.py: unit tests
  • fs_attrs_miss.R: Reads in missing comid-attributes file sometimes generated during fs_tfrm_attrs.py and attempts to find missing attribute data. In case missing attribute data exist, this Rscript is called inside the fs_tfrm_attrs.py to attempt to acquire data and use it for attribute transformation.

Removals

Changes

Testing

  1. Unit testing with test_tfrm_attr.py has been challenging to implement under a normal unittest package approach owing to a mysterious dask.dataframe as dd error. Implemented a work-around that partially tests this package by nixing most instances of using classes.

Screenshots

Notes

Todos

Checklist

  • PR has an informative and human-readable title
  • Changes are limited to a single goal (no scope creep)
  • Code can be automatically merged (no conflicts)
  • Code follows project standards (link if applicable)
  • Passes all existing automated tests
  • Any change in functionality is tested
  • New functions are documented (with a description, list of inputs, and expected output)
  • Placeholder code is flagged / future todos are captured in comments
  • Visually tested in supported browsers and devices (see checklist below 👇)
  • Project documentation has been updated (including the "Unreleased" section of the CHANGELOG)
  • Reviewers requested with the Reviewers tool ➡️

Testing checklist

Target Environment support

  • Windows
  • Linux
  • Browser

Accessibility

  • Keyboard friendly
  • Screen reader friendly

Other

  • Is useable without CSS
  • Is useable without JS
  • Flexible from small to large screens
  • No linting errors or warnings
  • JavaScript tests are passing

Copy link
Collaborator

@Ben-Choat Ben-Choat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've left some comments, mostly (or maybe all) related to formatting and commenting. I did not execute this code or run any tests. I've approved, but recommend reviewing my comments. Let me know if you have any questions.

There are several dozen functions available in RaFTS. Getting those documented in a common location will be important for enabling users to pick it up and use it without scouring through code.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recommend refactoring long commands like "dir_std_base = list([x for x in self.attr_config['file_io'] if 'dir_std_base' in x][0].values())[0].format(dir_base=dir_base, home_dir=home_dir)". Breaking it down into multiple lines/commands would be helpful in terms of readability.

Also, if we are attempting to go with PEP8 compliance, PEP8 typically recommends lines less than 79 characters. The line above is 150 characters, which typically indicates it should be broken into multiple lines not only for formatting but also for interpretability.

mock_rglob.assert_called_once_with("*67890*")
mock_read_parquet.assert_not_called()


Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Recommend remving the commented code, or if we are leaving it for potential use later, adding a comment to indicate that.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing function docstrings with several functions. Additional commenting that cleary explains what each function does, and why each test is perfored would be helpful to lead the user/coder/reader/reviewer through the code.

raise ValueError("Expecting path to file containing comids to be csv or parquet file")
return df


Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Likely" is a bit confusing in the comments in this function. Does it mean you are not sure what is actually happening?

# Determine which column identifies the comids in a given metadata file
loc_id_col = [x for x in loc_id_cols if x in df_meta.columns]
if len(loc_id_col) != 1:
raise ValueError("Could not find any of the location ID " +
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PIP8 compliance would have us include the '+' at the begining of the following line oppoosed to the end of the current line. Also, since each of these strings is within ValueError() paranthesis, the "+" are not needed.

@@ -15,6 +15,90 @@ library(data.table)
library(pkgcond)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably want to check if packages are installed, and install them if not.

Retr_Params <- base::list(paths = base::list(
# Note that if a path is provided, ensure the
# name includes 'path'. Same for directory having variable name with 'dir'
dir_db_hydfab=dir_db_hydfab,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

consistency in formatting ... name = value, or name=value. I think most R standards suggest spaces around "=".

}

path_missing_attrs <- std_miss_path(Retr_Params$paths$dir_db_attrs)
df_miss <- utils::read.csv(path_missing_attrs,)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the comma is needed in utils::read.csv(xxxx,). Not sure if it will matter in terms of functioning.


path_missing_attrs <- std_miss_path(Retr_Params$paths$dir_db_attrs)
df_miss <- utils::read.csv(path_missing_attrs,)
if(nrow(df_miss)>0){
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggest adding spaces around operators like > and = in the following chunk.

@@ -28,32 +40,40 @@ main <- function(){
lapply( function(x) gsub(pattern = "Gage_", replacement = "",x=x)) |>
unlist()

utils::write.table(gage_ids,glue::glue('{home_dir}/noaa/camels/gagesII_wood/camels_ii_gage_ids.txt'),row.names = FALSE,col.names = FALSE)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in general, arguments to a function should be separated by a space after each ','... e.g., write.table(xxxx, row...., col...)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants