Installation/Usage
To install msm_we
:
cd </path/to/msm_we>
conda env create -f environment.yml
conda activate hamsm_env
pip install .
To use msm_we
in a project:
from msm_we import msm_we
The basic workflow of this module is as follows.
Model building and preparation
Run a weighted ensemble simulation, and produce a
west.h5
file.Augment
west.h5
with the full coordinate information in each<iterXXX/auxdata/coord>
.msm_we/scripts/collectCoordinates/collectCoordinates.py
has an example of this. This functionality will likely eventually be rolled into the main package.Define the
processCoordinates(self, coords)
function and monkey-patch it in.This function is responsible for featurization. It should take in an array of all coordinates, and produce an array of feature-coordinates.
# A trivial example processCoordinates def processCoordinates(self, coords): # Do some stuff in here # This is NOT a complete example! return coords # Monkey-patch, i.e. replace the placeholder processCoordinates # in msm_we.modelWE with your function msm_we.modelWE.processCoordinates = processCoordinates
It’s important to do this monkey-patching at the module level, i.e. on
msm_we.modelWE
, rather than on an instance of amsm_we.modelWE
object.Create the model object.
model = msm_we.modelWE()
Initialize the model.
model.initialize(file_paths, reference_structure_file, model_name, basis_pcoord_bounds, target_pcoord_bounds, dim_reduce_method, tau, pcoord_ndim)
file_paths
is a list of paths to your WESTPA h5 files.reference_structure_file
is a file containing a topology describing your system.model_name
is what it sounds like, and is used to label various output files.basis_pcoord1_bounds
is a list of [[pcoord0 lower bound, pcoord1 upper bound], [pcoord1 lower bound, pcoord1 upper bound], …] in pcoord-space for the basis statetarget_pcoord1_bounds
is a list of [[pcoord0 lower bound, pcoord1 upper bound], [pcoord1 lower bound, pcoord1 upper bound], …] in pcoord-space for the target statedim_reduce_method
is the dimensionality reduction method (“pca”, “vamp”, or “none”)tau
is the resampling time, or the length of one WE iteration in physical units.pcoord_ndim
is the dimensionality of the progress coordinate.Load all coords and pcoords up to the last iteration you want to use for analysis with
model.get_iterations() model.get_coordSet(last_iter, streaming)
where
last_iter
is the number of iterations you have (AKA, the last iteration it’ll load data from) andstreaming
enables streaming data processing, which allows large datasets to fit in memory at the cost of a (nominally) small performance hit.Prepare dimensionality reduction transformer by running
model.dimReduce()
Do clustering with
model.cluster_coordinates(n_clusters, streaming, first_cluster_iter, use_ray, stratified, **_cluster_args)
n_clusters
is the total number of clusters ifstratified=False
, or the number of clusters per bin ifstratified=True
.streaming
is whether or not to stream over the data.first_cluster_iter
is the first iteration used for building the cluster centers (which may be desirable to exclude initial burn-in iterations).use_ray
enables parallelization with the Ray work manager. If enabled, a Ray cluster must be initialized by the user before calling this.stratified
enables stratified clustering instead of aggregate clustering. In stratified clustering, clustering is done independently within each WE bin. This is strongly recommended to ensure your clusters provide a good fine-grained description of your system.Note: At time of writing, stratified clustering implies
streaming=True, use_ray=True
and will enable this with a warning if they are not set.Any additional keyword arguments will be provided directly to the clustering function through
**_cluster_args
.Create the flux matrix with
model.get_fluxMatrix(lag, first_iter, last_iter, use_ray)
lag
is the lag-time used for model-building. Currently, only 0 is supported, which corresponds to looking at transitions from each parent segment directly to its child.first_iter
, andlast_iter
are the first and last iteration to use when computing the flux matrix. Note that excluding many iterations may result in limited connectivity of the flux matrix, as early events may have provided critical transitions between WE bins that may not be otherwise sampled.use_ray
enables parallelization with the Ray work manager. If enabled, a Ray cluster must be initialized by the user before calling this.Clean disconnected states and sort the flux matrix with
model.organize_fluxMatrix(use_ray)
use_ray
enables parallelization with the Ray work manager. If enabled, a Ray cluster must be initialized by the user before calling this.
Analysis
Normalize the flux matrix to produce a transition matrix with
model.get_Tmatrix()
Obtain steady-state distribution with
model.get_steady_state()
Note: This may fail or encounter difficulties for datasets where no target flux has been obtained. This can happen with either incomplete sampling to your target state, or with equilibrium data. This is because it uses the flux estimate as a convergence criterion. If the flux is 0, then it’s not meaningful to look at convergence of 0, so it’ll just run for the maximum number of iterations. You can specify
max_iters=1
to avoid unnecessary iteration, or you can usemodel.get_steady_state_algebraic
.Update cluster structures
model.update_cluster_structures()
Obtain steady-state target flux with
model.get_steady_state_target_flux()
Streaming
msm_we
supports streaming dimensionality reduction and clustering when dimensionality reduction is
done through PCA or not done.
Streaming dimensionality reduction is automatically done for PCA.
To use streaming clustering, pass streaming=True
to cluster_coordinates()
.
Streaming is not supported for VAMP, because I don’t know of a streaming implementation of VAMP dimensionality reduction.
Parallelism
msm_we
supports parallelism of many “slow” parts of model-building – namely, clustering, discretization, and
flux matrix calculations. This is done through the Ray work manager.
Before invoking any function with use_ray=True
, a Ray work manager must be initialized on the machine running
the analysis. In the simplest case, this can just be
import ray
ray.init()
msm_we
will connect to whatever Ray instance is running on the machine the analysis is being performed on.
However, this can be used on a cluster to initialize a Ray cluster with workers on a number of nodes, and the msm_we
running on the same node as the Ray head.