Networks in the SFLD
![]() |
Contents:
- Network Definitions and Visualization
- Why Use Networks?
- SFLD Network Creation
- Network Caveats
- Node and Edge Attributes
Network Definitions and Visualization
A network
is an instantiation of the abstract data type called a
graph
, which
links individuals (nodes) through their connections (edges).
The SFLD provides three types of networks. In a molecule similarity network (MSN), the nodes represent molecules (reactants, products, or both) and the
edges represent
similarities in chemical structure. In a reaction similarity network (RSN), nodes represent reactions and edges represent similarity between a given
reaction pair. In a sequence similarity network (SSN), nodes represent proteins and edges indicate similarity
in amino acid sequence. SFLD SSNs are further subdivided into two types:
- One Sequence per Node (1SPN) – each node in the network represents a unique protein sequence, and each edge represents the similarity between the connected sequences
- Representative – each node in the network represents a set of related protein sequences, and each edge represents the average similarity between the sequences in the connected nodes
MSNs, RSNs, and SSNs provide a graphical view of the similarity relationships within a set of molecules, reactions, or proteins, and can facilitate large-scale analyses. Networks can also be studied analytically using a rich collection of algorithms.
The networks are provided in XGMML or CYS format for import into
the program Cytoscape
.
This program allows interactive visualization, manipulation,
and analysis of networks and their attributes
(see the
Tutorials
for step-by-step examples).
To handle larger networks, it may be necessary to
increase the memory allocation for Cytoscape.
Why Use Networks?
SSNs can be used to examine the relationships within large, diverse sets of sequences for which the costs of traditional methods of analysis such as phylogenetic trees would be prohibitive, due to the difficulty in generating accurate multiple alignments. In addition, the SSNs provide a graphical overview of interrelationships among and between sets of proteins that are not easily discerned from visual inspection of large trees and multiple alignments.
The two-dimensional distances in these networks correlate well
with the distances in phylogenetic trees,
indicating that they are mathematically reasonable.
SSNs provide the user with a flexible,
interactive view, where both nodes and edges can be overlaid with
orthogonal information (for example, functional annotation),
making networks a powerful tool for hypothesis generation.
For more information, see
Atkinson et al.,
PLoS One 4:e4345 (2009) and
Copp et al.
,
Biochemistry 57(31)4651-4662 (2018).
While the number of molecules or reactions in SFLD MSNs and RSNs are typically not large, networks still provide a useful graphical overview of molecular / reaction similarity, which is otherwise difficult to visualize.
SFLD Network Creation
Sequence Similarity Networks:
For SSNs, BLAST scores are used as the measure of sequence similarity. BLAST searches are performed by comparing batches of query sequences to a database consisting of all processed sequences in the protein set of interest (for example, a superfamily) using the blastp program. The equivalent blast2seq E-value for each pairwise comparison is then calculated and assigned to edges to create the 1SPN network.
Representative networks may be provided at different percent identity cutoffs.
The cutoffs indicate how diverse a set of proteins can be
“collapsed” into a single node.
For example, in a representative network at 50% ID,
each of the proteins associated with a given node have
≥50% sequence identity with the protein used to seed that node.
Thus, for the same set of protein sequences, a lower percent identity cutoff
gives a representative network with fewer nodes.
Clustering by sequence identity is done with
CD-HIT
. An in-house version of the
Pythoscape
framework
, tailored for use with available hardware,
is used to generate the representative network and
to compute various node and edge statistics
which are then assigned as attributes.
In general, networks of up to 250,000-500,000 (250-500K) edges are manageable on recent laptops, with the memory allocation for Cytoscape increased as needed. As many of our superfamily networks would be too large to be opened or readily manipulated as 1SPN networks using Cytoscape, we provide the following:
- 1SPN networks are created for groups with <2500 sequences (nodes), at a BLAST bit score cutoff of 60. If necessary, edges are pruned to ensure the network contains <= 250K edges.
- 50% ID representative SSNs are provided for all groups, at a -log10(E-Value) cutoff of 20. If necessary, edges are pruned to ensure the network contains <= 250K edges.
- Additional SSNs are available for those groups with recently published papers from the Babbitt Lab, and contain the specific sequence set, annotation information, and edge score cutoffs used in the relevant paper
Molecule Similarity Networks:
For MSNs, Small Molecule Subgraph Detector (SMSD) is used to calculate the similarity (Tanimoto coefficient) between each pair of molecules in the relevant set. For more
information, see Rahman et al., Journal of Cheminformatics 1(1):12 (2009).
Reaction Similarity Networks:
For RSNs, Reaction Decoder Tool (RDT) is used to calculate the similarity of each pair of reactions
in the relevant set. For more information, see Rahman et al., Bioinformatics 32(13):2065-6
(2016).
Network Caveats
- SSNs from the SFLD are created using full-length sequences. In some cases, nodes within the network may represent sequences with multiple functional domains. Thus, users should be aware that a given edge in a network may represent the similarity between any (or all) regions of the sequences represented by the connected nodes, even parts outside of the functional domain of interest. A network created using only the sequences of the functional domains of interest may have a different topology than a network based on the full-length sequences.
- SSNs are not based on an explicit evolutionary model. Networks are not a substitute for phylogenetic trees, as they cannot be used to infer evolutionary history.
- 1SPN networks may not provide an even coverage of sequence space. Certain well-studied organisms such as E. coli have many highly similar sequenced strains. Because 1SPN networks have a node for each nonidentical sequence in the protein set of interest, they may be dense in areas representing sequences from well-studied organisms, but sparse in other areas.
- Depending on the percent identity cutoff used to generate a representative SSN, a single node may represent a very diverse collection of sequences. Thus a single representative node may contain sequences of different species, and even different functions. While such networks may be ideal for examining the relationships between different subgroups within a diverse superfamily, a 1SPN network may be more suited to examining the relationships between smaller, more closely related groups.
- Network topology, as visualized in cytoscape, is not exactly reproducible as most cytoscape layouts are non-deterministic. Further, addition and/or deletion of sequences in a given group due to routine updates may introduce slight changes in the topology of the associated network.
- The graphical views of networks are two-dimensional representations of an N-dimensional space. Some information is lost in projecting N-dimensional data down to two dimensions.
- The similarity cutoff for networks in the SFLD is chosen in an automated manner designed to retain as much information as possible while still providing networks that can be viewed and manipulated on standard computers. Consequently, the similarity cutoff may not provide a view of the network that is immediately useful. You can learn how to filter edges to obtain a more useful view in the Part 2 of the SFLD/Cytoscape tutorial series.
- The Reaction Decoder Tool used in the creation of SFLD RSNs can only compare those reactions for which there is full molecular detail and a balanced reaction. Thus, incomplete / unbalanced reactions will be missing from RSNs.
- As subgroups and families typically do not have many unique reactions defined within them, molecule and reaction similarity networks are only available at the superfamily level.
Node and Edge Attributes
The nodes (proteins) and edges (similarities) in networks from the SFLD
are annotated with attributes that can be used for filtering, coloring,
etc. in Cytoscape
.
Individual attributes are described in the tables below for
1SPN and representative SSNs, RSNs, and MSNs. Note that attributes in networks derived from publications may differ somewhat from
those described below.
Attributes in One Sequence per Node (1SPN) Networks
Node Attribute Name | Description |
---|---|
id | A unique identifier for the node. |
Name | The name of the protein in the SFLD. |
gi |
A list of all valid
NCBI
![]() ![]() |
EFD ID | SFLD unique numerical identifier for the enzyme functional domain (EFD); searching the SFLD by EFD ID is faster than searching by GI number. |
NYSGRC target ID |
Identifier of the protein in the
New York
Structural Genomics Research Consortium
![]() |
Family | The name of the family of the EFD in the SFLD; “unclassified” if none. |
PDB IDs |
A list of the
Protein Data Bank
![]() |
Species | A list of the organisms associated with the protein sequence. There can be multiple species with proteins of the same sequence. |
Subgroup | The name of the subgroup of the EFD in the SFLD; “unclassified” if none. |
full length | The length of the entire protein in which the enzyme functional domain (EFD) is found. |
Domain length | The length of the enzyme functional domain (EFD); may be the same as full length. |
functional residues | A list of residues identified as being important for function; blank if none. |
Family evidence code | Family assignment evidence code describing how the EFD was assigned to a family in the SFLD; blank if no family assignment. |
Microbes Online |
Identifier for the protein in the
Microbes Online database
![]() |
The Seed |
Identifier for the protein in the
SEED database
![]() |
Superfamily evidence code | Superfamily assignment evidence code describing how the EFD was assigned to a superfamily in the SFLD. |
Swissprot |
Identifier for the protein in the Swiss-Prot database
(the manually curated section of the
UniProt Knowledgebase
![]() |
Swissprot protein name |
The name of the protein sequence in the Swiss-Prot database
(the manually curated section of the
UniProt Knowledgebase
![]() |
UniProtKB |
Identifier for the protein sequence in the
UniProt Knowledgebase
![]() |
Taxons |
Taxon identifier(s) for the protein sequence in the
NCBI Taxonomy database
![]() |
Type of life | Type(s) of life corresponding to the Taxons (division of life, if available, otherwise domain of life); blank if none. |
DNA Source |
Sources of DNA for the protein; possible values are MTDF
(Macromolecular Therapeutics Development Facility
![]() ![]() |
Attributes in Representative Networks
Node Attribute Name | Description |
---|---|
Dominant Family name | The SFLD family assignment that applies to the largest number of enzyme functional domains (EFDs) represented by the node. |
EFD IDs | A list of SFLD enzyme functional domains (EFDs) identifiers in the repnode. |
EFD ID (len) | A list of SFLD enzyme functional domain (EFD) identifiers (length of full sequence in parentheses). These identifiers can be used to search the SFLD for specific sequences via the Search by Enzyme page. |
Dominant Kingdom | The Kingdom that applies to the largest number of sequences represented by the node. |
Dominant Species | The species name that applies to the largest number of sequences represented by the node. |
Dominant Subgroup name | The SFLD subgroup assignment that applies to the largest number of EFDs represented by the node. |
Dominant Type of Life | The Type of Life that applies to the largest number of sequences represented by the node. |
DNA Source |
Sources of DNA for the sequences represented by the node;
possible values are MTDF
(Macromolecular Therapeutics Development Facility
![]() ![]() |
Family evidence code | A list of family assignment evidence codes (which define how an EFD was associated with a family) corresponding to EFDs represented by the node, along with the percentage of sequences within the node each evidence code corresponds to. |
Family name | A list of the family names corresponding to EFDs represented by the node, along with the percentage of sequences within the node each family name corresponds to. |
Has DNA Source |
A value of True indicates that the node represents at least one sequence
that has DNA available from the
Macromolecular Therapeutics Development Facility
![]() ![]() |
Has experimental Family evidence code | A value of True indicates that the node represents at least one sequence that has an experimental family assignment evidence code (CFM, IES, IGS). |
Has FSM Superfamily evidence code | A value of True indicates that the node represents at least one sequence that has a superfamily assignment evidence code of FSM (Founding Superfamily Member). |
Has Microbes online |
A value of True indicates that the node represents at least one sequence
that has information available in the
Microbes Online database
![]() |
Has PDB |
A value of True indicates that the node represents at least one sequence
with ≥ 95% ID to a structure in the
Protein Data Bank
![]() |
Has Swissprot |
A value of True indicates that the node represents at least one sequence
that is in the Swiss-Prot database
(the manually curated section of the
UniProt Knowledgebase
![]() |
Has The Seed |
A value of True indicates that the node represents at least one sequence
that has information available in the
SEED database
![]() |
Kingdom | A list of the kingdoms corresponding to sequences represented by the node, along with the percentage of sequences within the node each kingdom corresponds to. |
Microbes online |
A list of identifiers from the
Microbes Online database
![]() |
PDB IDs |
A list of the
Protein Data Bank
![]() |
Species | A list of the species names corresponding to sequences represented by the node, along with the percentage of sequences within the node each species name corresponds to. |
Subgroup name | A list of the subgroup names corresponding to EFDs represented by the node, along with the percentage of sequences within the node each subgroup name corresponds to. |
Superfamily evidence code | A list of superfamily assignment evidence codes (which define how a given EFD was associated with a superfamily) corresponding to EFDs represented by the node, along with the percentage of sequences within the node each evidence code corresponds to. |
Swissprot |
A list of Swiss-Prot identifiers corresponding to sequences
represented by the node.
Swiss-Prot is the manually curated section of the
UniProt Knowledgebase
![]() |
Swissprot protein name | A list of Swiss-Prot protein names corresponding to the sequences represented by the node. |
The Seed |
A list of identifiers from the
SEED database
![]() |
Type of Life |
For each sequence represented by a given node, this field lists the type
of life designation (division, if it exists, otherwise domain)
corresponding to the taxon ID in the
NCBI Taxonomy database
![]() |
UniProtKB |
A list of
UniProt Knowledgebase
![]() |
gi |
A list of
NCBI
![]() |
node size | The number of protein sequences represented by the node. |
Attributes in Reaction Similarity Networks (RSNs)
Attributes in Molecule Similarity Networks (MSNs)
Edge Attribute Name | Description |
---|---|
tanimoto | The Tanimoto coefficient for a comparison of two molecules. |