Installation/Usage

To install msm_we:

cd </path/to/msm_we>

conda env create -f environment.yml
conda activate hamsm_env
pip install .

To use msm_we in a project:

from msm_we import msm_we

The basic workflow of this module is as follows.

Model building and preparation

  1. Run a weighted ensemble simulation, and produce a west.h5 file.

  2. Augment west.h5 with the full coordinate information in each <iterXXX/auxdata/coord>.

    msm_we/scripts/collectCoordinates/collectCoordinates.py has an example of this. This functionality will likely eventually be rolled into the main package.

  3. Define the processCoordinates(self, coords) function and monkey-patch it in.

    This function is responsible for featurization. It should take in an array of all coordinates, and produce an array of feature-coordinates.

    # A trivial example processCoordinates
    def processCoordinates(self, coords):
        # Do some stuff in here
        # This is NOT a complete example!
        return coords
    
    # Monkey-patch, i.e. replace the placeholder processCoordinates
    #   in msm_we.modelWE with your function
    msm_we.modelWE.processCoordinates = processCoordinates
    

    It’s important to do this monkey-patching at the module level, i.e. on msm_we.modelWE, rather than on an instance of a msm_we.modelWE object.

  4. Create the model object.

    model = msm_we.modelWE()
    
  5. Initialize the model.

    model.initialize(file_paths, reference_structure_file, model_name,
                    basis_pcoord_bounds, target_pcoord_bounds,
                    dim_reduce_method, tau, pcoord_ndim)
    

    file_paths is a list of paths to your WESTPA h5 files.

    reference_structure_file is a file containing a topology describing your system.

    model_name is what it sounds like, and is used to label various output files.

    basis_pcoord1_bounds is a list of [[pcoord0 lower bound, pcoord1 upper bound], [pcoord1 lower bound, pcoord1 upper bound], …] in pcoord-space for the basis state

    target_pcoord1_bounds is a list of [[pcoord0 lower bound, pcoord1 upper bound], [pcoord1 lower bound, pcoord1 upper bound], …] in pcoord-space for the target state

    dim_reduce_method is the dimensionality reduction method (“pca”, “vamp”, or “none”)

    tau is the resampling time, or the length of one WE iteration in physical units.

    pcoord_ndim is the dimensionality of the progress coordinate.

  6. Load all coords and pcoords up to the last iteration you want to use for analysis with

    model.get_iterations()
    model.get_coordSet(last_iter, streaming)
    

    where last_iter is the number of iterations you have (AKA, the last iteration it’ll load data from) and streaming enables streaming data processing, which allows large datasets to fit in memory at the cost of a (nominally) small performance hit.

  7. Prepare dimensionality reduction transformer by running

    model.dimReduce()
    
  8. Do clustering with

    model.cluster_coordinates(n_clusters, streaming,
        first_cluster_iter, use_ray, stratified,
        **_cluster_args)
    

    n_clusters is the total number of clusters if stratified=False, or the number of clusters per bin if stratified=True.

    streaming is whether or not to stream over the data.

    first_cluster_iter is the first iteration used for building the cluster centers (which may be desirable to exclude initial burn-in iterations).

    use_ray enables parallelization with the Ray work manager. If enabled, a Ray cluster must be initialized by the user before calling this.

    stratified enables stratified clustering instead of aggregate clustering. In stratified clustering, clustering is done independently within each WE bin. This is strongly recommended to ensure your clusters provide a good fine-grained description of your system.

    Note: At time of writing, stratified clustering implies streaming=True, use_ray=True and will enable this with a warning if they are not set.

    Any additional keyword arguments will be provided directly to the clustering function through **_cluster_args.

  9. Create the flux matrix with

    model.get_fluxMatrix(lag, first_iter, last_iter, use_ray)
    

    lag is the lag-time used for model-building. Currently, only 0 is supported, which corresponds to looking at transitions from each parent segment directly to its child.

    first_iter, and last_iter are the first and last iteration to use when computing the flux matrix. Note that excluding many iterations may result in limited connectivity of the flux matrix, as early events may have provided critical transitions between WE bins that may not be otherwise sampled.

    use_ray enables parallelization with the Ray work manager. If enabled, a Ray cluster must be initialized by the user before calling this.

    1. Clean disconnected states and sort the flux matrix with

    model.organize_fluxMatrix(use_ray)
    

    use_ray enables parallelization with the Ray work manager. If enabled, a Ray cluster must be initialized by the user before calling this.

Analysis

  1. Normalize the flux matrix to produce a transition matrix with

    model.get_Tmatrix()
    
  2. Obtain steady-state distribution with

    model.get_steady_state()
    

    Note: This may fail or encounter difficulties for datasets where no target flux has been obtained. This can happen with either incomplete sampling to your target state, or with equilibrium data. This is because it uses the flux estimate as a convergence criterion. If the flux is 0, then it’s not meaningful to look at convergence of 0, so it’ll just run for the maximum number of iterations. You can specify max_iters=1 to avoid unnecessary iteration, or you can use model.get_steady_state_algebraic.

  3. Update cluster structures

    model.update_cluster_structures()
    
  4. Obtain steady-state target flux with

    model.get_steady_state_target_flux()
    

Streaming

msm_we supports streaming dimensionality reduction and clustering when dimensionality reduction is done through PCA or not done.

Streaming dimensionality reduction is automatically done for PCA.

To use streaming clustering, pass streaming=True to cluster_coordinates().

Streaming is not supported for VAMP, because I don’t know of a streaming implementation of VAMP dimensionality reduction.

Parallelism

msm_we supports parallelism of many “slow” parts of model-building – namely, clustering, discretization, and flux matrix calculations. This is done through the Ray work manager.

Before invoking any function with use_ray=True, a Ray work manager must be initialized on the machine running the analysis. In the simplest case, this can just be

import ray
ray.init()

msm_we will connect to whatever Ray instance is running on the machine the analysis is being performed on. However, this can be used on a cluster to initialize a Ray cluster with workers on a number of nodes, and the msm_we running on the same node as the Ray head.