Such comprehensive encapsulation enables users to develop protein machine learning solutions with one easy-to-use library. It avoids the embarrassment of gluing multiple libraries into a pipeline.
With TorchProtein, we can rapidly prototype machine learning solutions to various protein applications within 20 lines of codes, and conduct ablation studies by substituting different parts of the solution with off-the-shelf modules. Furthermore, we can easily adapt these modules to our own needs, and make systematic analyses by comparing the new results to a benchmark provided in the library.
Additionally, TorchProtein is designed to be accessible to everyone. For inexperienced users, like beginners or biological researchers, TorchProtein provides [user-friendly APIs](https://torchdrug.ai/docs/) to simplify the development of protein machine learning solutions. Meanwhile, for professional users, TorchProtein also preserves enough flexibility to satisfy their demands, supported by features like modular design of the library and on-the-fly graph construction.
Main Features
Simplify Data Processing
- It is challenging to transform raw bioinformatic protein datasets into tensor formats for machine learning. To reduce tedious operations, TorchProtein provides us with a data structure `data.Protein` and its batched extension `data.PackedProtein` to automate the data processing step.
- `data.Protein` and `data.PackedProtein` automatically gather protein data from various bio-sources and seamlessly switch between data formats like pdb files, RDKit objects and sequences. Please see the section data structures and operations for transforming from and to sequences and RDKit objects.
python
construct a data.Protein instance from a pdb file
pdb_file = ...
protein = data.Protein.from_pdb(pdb_file, atom_feature="position", bond_feature="length", residue_feature="symbol")
print(protein)
write a data.Protein instance back to a pdb file
new_pdb_file = ...
protein.to_pdb(new_pdb_file)
bash
Protein(num_atom=445, num_bond=916, num_residue=57)
- `data.Protein` and `data.PackedProtein` automatically pre-process all kinds of features of atoms, bonds and residues, by simply setting up several arguments.
python
pdb_file = ...
protein = data.Protein.from_pdb(pdb_file, atom_feature="position", bond_feature="length", residue_feature="symbol")
feature
print(protein.residue_feature.shape)
print(protein.atom_feature.shape)
print(protein.bond_feature.shape)
bash
torch.Size([57, 21])
torch.Size([445, 3])
torch.Size([916, 1])
- `data.Protein` and `data.PackedProtein` automatically keeps track of numerous attributes associated with atoms, bonds, residues and the whole protein.
- For example, reference offers a way to register new attributes as node, edge or graph property, and in this way, the new attributes would automatically go along with the node, edge or graph themself. More in-built attributes are listed in the section data structures and operations.
python
protein = ...
with protein.node():
protein.node_id = torch.tensor([i for i in range(0, protein.num_node)])
with protein.edge():
protein.edge_cost = torch.rand(protein.num_edge)
with protein.graph():
protein.graph_feature = torch.randn(128)
- Even more, reference can be utilized to maintain the correspondence between two well related objects. For example, the mapping `atom2residue` maintains relationship between atoms and residues, and enables indexing on either of them.
python
protein = ...
create a mask indices for atoms in a glutamine (GLN)
is_glutamine = protein.residue_type[protein.atom2residue] == protein.residue2id["GLN"]
mask_indices = is_glutamine.nonzero().squeeze(-1)
print(mask_indices)
map the masked atoms back to the glutamine residue
residue_type = protein.residue_type[protein.atom2residue[mask_indices]]
print([protein.id2residue[r] for r in residue_type.tolist()])
bash
tensor([ 26, 27, 28, 29, 30, 31, 32, 33, 34, 307, 308, 309, 310, 311,
312, 313, 314, 315, 384, 385, 386, 387, 388, 389, 390, 391, 392])
['GLN', 'GLN', 'GLN', 'GLN', 'GLN', 'GLN', 'GLN', 'GLN', 'GLN', 'GLN', 'GLN', 'GLN', 'GLN', 'GLN', 'GLN', 'GLN', 'GLN', 'GLN', 'GLN', 'GLN', 'GLN', 'GLN', 'GLN', 'GLN', 'GLN', 'GLN', 'GLN']
- It is useful to augment protein data by modifying protein graphs or constructing new ones. With the protein operations and the graph construction layers provided in TorchProtein,
- we can easily modify proteins on the fly by batching, slicing sequences, masking out side chains, etc. Please see the tutorials for more details on masking.
python
pdb_file = ...
protein = data.Protein.from_pdb(pdb_file, atom_feature="position", bond_feature="length", residue_feature="symbol")
batch
proteins = data.Protein.pack([protein, protein, protein])
slice sequences
use indexing to extract consecutive residues of a particular protein
two_residues = protein[[0,2]]
two_residues.visualize()

- we can construct protein graphs on the fly with GPU acceleration, which offers users flexible choices rather than using fixed pre-processed graphs. Below is an example to build a graph with only alpha carbon atoms, please check tutorials for more cases, such as adding spatial / KNN / sequential edges.
python
protein = ...
transfer from CPU to GPU
protein = protein.cuda()
print(protein)
build a graph with only alpha carbon (CA) atoms
node_layers = [geometry.AlphaCarbonNode()]
graph_construction_model = layers.GraphConstruction(node_layers=node_layers)
original_protein = data.Protein.pack([protein])
CA_protein = graph_construction_model(_protein)
print("Graph before:", original_protein)
print("Graph after:", CA_protein)
bash
Protein(num_atom=445, num_bond=916, num_residue=57, device='cuda:0')
Graph before: PackedProtein(batch_size=1, num_atoms=[2639], num_bonds=[5368], num_residues=[350])
Graph after: PackedProtein(batch_size=1, num_atoms=[350], num_bonds=[0], num_residues=[350])
Easy to Prototype Solutions
With TorchProtein, common protein tasks can be finished within 20 lines of codes, such as **sequence-based protein property prediction task**. Below is an example and more examples of different popular protein tasks and models can be found in Protein Tasks, Models and Tutorials.
python
import torch
from torchdrug import datasets, transforms, models, tasks, core
truncate_transform = transforms.TruncateProtein(max_length=200, random=False)
protein_view_transform = transforms.ProteinView(view="residue")
transform = transforms.Compose([truncate_transform, protein_view_transform])
dataset = datasets.BetaLactamase("~/protein-datasets/", residue_only=True, transform=transform)
train_set, valid_set, test_set = dataset.split()
model = models.ProteinCNN(input_dim=21,
hidden_dims=[1024, 1024],
kernel_size=5, padding=2, readout="max")
task = tasks.PropertyPrediction(model, task=dataset.tasks,
criterion="mse", metric=("mae", "rmse", "spearmanr"),
normalization=False, num_mlp_layer=2)
optimizer = torch.optim.Adam(task.parameters(), lr=1e-4)
solver = core.Engine(task, train_set, valid_set, test_set, optimizer,
gpus=[0], batch_size=64)
solver.train(num_epoch=10)
solver.evaluate("valid")
bash