Output to a standardized data format #83

talmo · 2024-11-25T22:43:57Z

Currently, we are outputting the results of stac-mjx to Pickle files by default:

Lines 192 to 211 in f3980e4

    
           def save(fit_data, save_path: Text): 
        
               """Save data. 
        
               Save data as .p or .h5 file. 
        
               Args: 
        
                   fit_data (numpy array): Data to write out. 
        
                   save_path (Text): Path to save data. Defaults to None. 
        
               """ 
        
               if os.path.dirname(save_path) != "": 
        
                   os.makedirs(os.path.dirname(save_path), exist_ok=True) 
        
               _, file_extension = os.path.splitext(save_path) 
        
               if file_extension == ".p": 
        
                   with open(save_path, "wb") as output_file: 
        
                       pickle.dump(fit_data, output_file, protocol=2) 
        
               elif file_extension == ".h5": 
        
                   ioh5.save(save_path, fit_data) 
        
               else: 
        
                   with open(save_path + ".p", "wb") as output_file: 
        
                       pickle.dump(fit_data, output_file, protocol=2)

It supports saving out to HDF5, but using a very general purpose recursive method that sacrifices the documentation of the file format for ease of use on the implementation.

What we would like is to explicitly list out the main fields and associated metadata that we need to serialize. This should also help with documenting the specifics of the file format (shapes, dtypes, names) which makes it more straightforward to create a contract with downstream applications that use the data that this tool produces.

For example, an organization of the HDF5 file could look like:

/config: str [vlen]
/mjcf_xml: str [vlen]
/qpos: float32 [n_frames, ?]

It would probably be more portable and self-describing to break up qpos into its constituent elements, e.g.:

/root_xyz: float32 [n_frames, 3]
/root_quaternion: float32 [n_frames, 4]
/joint_angles: group
/joint_angles/spine1: float32 [n_frames, 3]  # 3 DOF joint
/joint_angles/elbowL: float32 [n_frames, 1]   # 1 DOF joint
...

But this format trades off generalizability for being more self-descriptive.

Whether we keep qpos in its flattened representation (useful for pipelining) or break it up into better described sub-elements (useful for portability and use outside of our pipelines) is a key decision point (though not mutually exclusive).

No matter what, we should also have a version key in the HDF5 file that can be used to route logic if this format evolves.

As a separate concern, we should also consider embedding the more useful values captured in this data structure that right now we compute on the fly after loading downstream via forward kinematics (see this module):

@struct.dataclass
class ReferenceClip:
    """This dataclass is used to store the trajectory in the env."""

    # qpos
    position: jp.ndarray = None
    quaternion: jp.ndarray = None
    joints: jp.ndarray = None

    # xpos
    body_positions: jp.ndarray = None

    # velocity (inferred)
    velocity: jp.ndarray = None
    joints_velocity: jp.ndarray = None
    angular_velocity: jp.ndarray = None

    # xquat
    body_quaternions: jp.ndarray = None

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Output to a standardized data format #83

Output to a standardized data format #83

talmo commented Nov 25, 2024

Output to a standardized data format #83

Output to a standardized data format #83

Comments

talmo commented Nov 25, 2024