SFLD - Sequence Similarity Networks

Networks in the SFLD

Contents:

Network Definitions and Visualization

Network Definitions and Visualization

A network is an instantiation of the abstract data type called a graph , which links individuals (nodes) through their connections (edges). The SFLD provides three types of networks. In a molecule similarity network (MSN), the nodes represent molecules (reactants, products, or both) and the edges represent similarities in chemical structure. In a reaction similarity network (RSN), nodes represent reactions and edges represent similarity between a given reaction pair. In a sequence similarity network (SSN), nodes represent proteins and edges indicate similarity in amino acid sequence. SFLD SSNs are further subdivided into two types:

One Sequence per Node (1SPN) – each node in the network represents a unique protein sequence, and each edge represents the similarity between the connected sequences
Representative – each node in the network represents a set of related protein sequences, and each edge represents the average similarity between the sequences in the connected nodes

MSNs, RSNs, and SSNs provide a graphical view of the similarity relationships within a set of molecules, reactions, or proteins, and can facilitate large-scale analyses. Networks can also be studied analytically using a rich collection of algorithms.

The networks are provided in XGMML or CYS format for import into the program Cytoscape . This program allows interactive visualization, manipulation, and analysis of networks and their attributes (see the Tutorials for step-by-step examples). To handle larger networks, it may be necessary to increase the memory allocation for Cytoscape.

Why Use Networks?

SSNs can be used to examine the relationships within large, diverse sets of sequences for which the costs of traditional methods of analysis such as phylogenetic trees would be prohibitive, due to the difficulty in generating accurate multiple alignments. In addition, the SSNs provide a graphical overview of interrelationships among and between sets of proteins that are not easily discerned from visual inspection of large trees and multiple alignments.

The two-dimensional distances in these networks correlate well with the distances in phylogenetic trees, indicating that they are mathematically reasonable. SSNs provide the user with a flexible, interactive view, where both nodes and edges can be overlaid with orthogonal information (for example, functional annotation), making networks a powerful tool for hypothesis generation. For more information, see Atkinson et al., PLoS One 4:e4345 (2009) and Copp et al., Biochemistry 57(31)4651-4662 (2018).

While the number of molecules or reactions in SFLD MSNs and RSNs are typically not large, networks still provide a useful graphical overview of molecular / reaction similarity, which is otherwise difficult to visualize.

SFLD Network Creation

Sequence Similarity Networks:

For SSNs, BLAST scores are used as the measure of sequence similarity. BLAST searches are performed by comparing batches of query sequences to a database consisting of all processed sequences in the protein set of interest (for example, a superfamily) using the blastp program. The equivalent blast2seq E-value for each pairwise comparison is then calculated and assigned to edges to create the 1SPN network.

Representative networks may be provided at different percent identity cutoffs. The cutoffs indicate how diverse a set of proteins can be “collapsed” into a single node. For example, in a representative network at 50% ID, each of the proteins associated with a given node have ≥50% sequence identity with the protein used to seed that node. Thus, for the same set of protein sequences, a lower percent identity cutoff gives a representative network with fewer nodes. Clustering by sequence identity is done with CD-HIT . An in-house version of the Pythoscape framework , tailored for use with available hardware, is used to generate the representative network and to compute various node and edge statistics which are then assigned as attributes.

In general, networks of up to 250,000-500,000 (250-500K) edges are manageable on recent laptops, with the memory allocation for Cytoscape increased as needed. As many of our superfamily networks would be too large to be opened or readily manipulated as 1SPN networks using Cytoscape, we provide the following:

1SPN networks are created for groups with <2500 sequences (nodes), at a BLAST bit score cutoff of 60. If necessary, edges are pruned to ensure the network contains <= 250K edges.
50% ID representative SSNs are provided for all groups, at a -log10(E-Value) cutoff of 20. If necessary, edges are pruned to ensure the network contains <= 250K edges.
Additional SSNs are available for those groups with recently published papers from the Babbitt Lab, and contain the specific sequence set, annotation information, and edge score cutoffs used in the relevant paper

Molecule Similarity Networks:

For MSNs, Small Molecule Subgraph Detector (SMSD) is used to calculate the similarity (Tanimoto coefficient) between each pair of molecules in the relevant set. For more information, see Rahman et al., Journal of Cheminformatics 1(1):12 (2009).

Reaction Similarity Networks:

For RSNs, Reaction Decoder Tool (RDT) is used to calculate the similarity of each pair of reactions in the relevant set. For more information, see Rahman et al., Bioinformatics 32(13):2065-6 (2016).

Network Caveats

SSNs from the SFLD are created using full-length sequences. In some cases, nodes within the network may represent sequences with multiple functional domains. Thus, users should be aware that a given edge in a network may represent the similarity between any (or all) regions of the sequences represented by the connected nodes, even parts outside of the functional domain of interest. A network created using only the sequences of the functional domains of interest may have a different topology than a network based on the full-length sequences.
SSNs are not based on an explicit evolutionary model. Networks are not a substitute for phylogenetic trees, as they cannot be used to infer evolutionary history.
1SPN networks may not provide an even coverage of sequence space. Certain well-studied organisms such as E. coli have many highly similar sequenced strains. Because 1SPN networks have a node for each nonidentical sequence in the protein set of interest, they may be dense in areas representing sequences from well-studied organisms, but sparse in other areas.
Depending on the percent identity cutoff used to generate a representative SSN, a single node may represent a very diverse collection of sequences. Thus a single representative node may contain sequences of different species, and even different functions. While such networks may be ideal for examining the relationships between different subgroups within a diverse superfamily, a 1SPN network may be more suited to examining the relationships between smaller, more closely related groups.
Network topology, as visualized in cytoscape, is not exactly reproducible as most cytoscape layouts are non-deterministic. Further, addition and/or deletion of sequences in a given group due to routine updates may introduce slight changes in the topology of the associated network.
The graphical views of networks are two-dimensional representations of an N-dimensional space. Some information is lost in projecting N-dimensional data down to two dimensions.
The similarity cutoff for networks in the SFLD is chosen in an automated manner designed to retain as much information as possible while still providing networks that can be viewed and manipulated on standard computers. Consequently, the similarity cutoff may not provide a view of the network that is immediately useful. You can learn how to filter edges to obtain a more useful view in the Part 2 of the SFLD/Cytoscape tutorial series.
The Reaction Decoder Tool used in the creation of SFLD RSNs can only compare those reactions for which there is full molecular detail and a balanced reaction. Thus, incomplete / unbalanced reactions will be missing from RSNs.
As subgroups and families typically do not have many unique reactions defined within them, molecule and reaction similarity networks are only available at the superfamily level.

Node and Edge Attributes

The nodes (proteins) and edges (similarities) in networks from the SFLD are annotated with attributes that can be used for filtering, coloring, etc. in Cytoscape . Individual attributes are described in the tables below for 1SPN and representative SSNs, RSNs, and MSNs. Note that attributes in networks derived from publications may differ somewhat from those described below.

Attributes in One Sequence per Node (1SPN) Networks

Node Attribute Name	Description
id	A unique identifier for the node.
Name	The name of the protein in the SFLD.
gi	A list of all valid NCBI GI numbers for the protein; blank if none.
EFD ID	SFLD unique numerical identifier for the enzyme functional domain (EFD); searching the SFLD by EFD ID is faster than searching by GI number.
NYSGRC target ID	Identifier of the protein in the New York Structural Genomics Research Consortium pipeline; blank if none.
Family	The name of the family of the EFD in the SFLD; “unclassified” if none.
PDB IDs	A list of the Protein Data Bank identifiers of structures with at least 95% sequence identity with the protein represented by the node.
Species	A list of the organisms associated with the protein sequence. There can be multiple species with proteins of the same sequence.
Subgroup	The name of the subgroup of the EFD in the SFLD; “unclassified” if none.
full length	The length of the entire protein in which the enzyme functional domain (EFD) is found.
Domain length	The length of the enzyme functional domain (EFD); may be the same as full length.
functional residues	A list of residues identified as being important for function; blank if none.
Family evidence code	Family assignment evidence code describing how the EFD was assigned to a family in the SFLD; blank if no family assignment.
Microbes Online	Identifier for the protein in the Microbes Online database ; blank if none.
The Seed	Identifier for the protein in the SEED database ; blank if none.
Superfamily evidence code	Superfamily assignment evidence code describing how the EFD was assigned to a superfamily in the SFLD.
Swissprot	Identifier for the protein in the Swiss-Prot database (the manually curated section of the UniProt Knowledgebase ); blank if none.
Swissprot protein name	The name of the protein sequence in the Swiss-Prot database (the manually curated section of the UniProt Knowledgebase ); blank if none.
UniProtKB	Identifier for the protein sequence in the UniProt Knowledgebase ; blank if none. There can be multiple identifiers for the same sequence.
Taxons	Taxon identifier(s) for the protein sequence in the NCBI Taxonomy database . There can be multiple taxa with proteins of the same sequence.
Type of life	Type(s) of life corresponding to the Taxons (division of life, if available, otherwise domain of life); blank if none.
DNA Source	Sources of DNA for the protein; possible values are MTDF (Macromolecular Therapeutics Development Facility , “in house” NYSGRC) and/or ATCC (American Type Culture Collection ) or blank (none).

Edge Attribute Name	Description
alignment_len	The length of the sequence alignment found by BLAST.
% id	The fraction (range 0–1) of the aligned positions that are identical between the two sequences.
bit score	The BLAST bit score
blast e-value	The BLAST E-value

Attributes in Representative Networks

Node Attribute Name	Description
Dominant Family name	The SFLD family assignment that applies to the largest number of enzyme functional domains (EFDs) represented by the node.
EFD IDs	A list of SFLD enzyme functional domains (EFDs) identifiers in the repnode.
EFD ID (len)	A list of SFLD enzyme functional domain (EFD) identifiers (length of full sequence in parentheses). These identifiers can be used to search the SFLD for specific sequences via the Search by Enzyme page.
Dominant Kingdom	The Kingdom that applies to the largest number of sequences represented by the node.
Dominant Species	The species name that applies to the largest number of sequences represented by the node.
Dominant Subgroup name	The SFLD subgroup assignment that applies to the largest number of EFDs represented by the node.
Dominant Type of Life	The Type of Life that applies to the largest number of sequences represented by the node.
DNA Source	Sources of DNA for the sequences represented by the node; possible values are MTDF (Macromolecular Therapeutics Development Facility , “in house” NYSGRC) and/or ATCC (American Type Culture Collection ) or blank (none).
Family evidence code	A list of family assignment evidence codes (which define how an EFD was associated with a family) corresponding to EFDs represented by the node, along with the percentage of sequences within the node each evidence code corresponds to.
Family name	A list of the family names corresponding to EFDs represented by the node, along with the percentage of sequences within the node each family name corresponds to.
Has DNA Source	A value of True indicates that the node represents at least one sequence that has DNA available from the Macromolecular Therapeutics Development Facility (“in house” NYSGRC) or the American Type Culture Collection .
Has experimental Family evidence code	A value of True indicates that the node represents at least one sequence that has an experimental family assignment evidence code (CFM, IES, IGS).
Has FSM Superfamily evidence code	A value of True indicates that the node represents at least one sequence that has a superfamily assignment evidence code of FSM (Founding Superfamily Member).
Has Microbes online	A value of True indicates that the node represents at least one sequence that has information available in the Microbes Online database .
Has PDB	A value of True indicates that the node represents at least one sequence with ≥ 95% ID to a structure in the Protein Data Bank .
Has Swissprot	A value of True indicates that the node represents at least one sequence that is in the Swiss-Prot database (the manually curated section of the UniProt Knowledgebase ).
Has The Seed	A value of True indicates that the node represents at least one sequence that has information available in the SEED database .
Kingdom	A list of the kingdoms corresponding to sequences represented by the node, along with the percentage of sequences within the node each kingdom corresponds to.
Microbes online	A list of identifiers from the Microbes Online database corresponding to the sequences represented by the node.
PDB IDs	A list of the Protein Data Bank identifiers corresponding to sequences represented by the node (where the match between PDB and sequence has at least 95% ID over at least 95% of the length of the shorter sequence).
Species	A list of the species names corresponding to sequences represented by the node, along with the percentage of sequences within the node each species name corresponds to.
Subgroup name	A list of the subgroup names corresponding to EFDs represented by the node, along with the percentage of sequences within the node each subgroup name corresponds to.
Superfamily evidence code	A list of superfamily assignment evidence codes (which define how a given EFD was associated with a superfamily) corresponding to EFDs represented by the node, along with the percentage of sequences within the node each evidence code corresponds to.
Swissprot	A list of Swiss-Prot identifiers corresponding to sequences represented by the node. Swiss-Prot is the manually curated section of the UniProt Knowledgebase .
Swissprot protein name	A list of Swiss-Prot protein names corresponding to the sequences represented by the node.
The Seed	A list of identifiers from the SEED database corresponding to the sequences represented by the node.
Type of Life	For each sequence represented by a given node, this field lists the type of life designation (division, if it exists, otherwise domain) corresponding to the taxon ID in the NCBI Taxonomy database , along with the percentage of sequences within the node each type of life designation corresponds to.
UniProtKB	A list of UniProt Knowledgebase identifiers corresponding to sequences represented by the node.
gi	A list of NCBI gi numbers corresponding to the sequences represented by the node.
node size	The number of protein sequences represented by the node.

Edge Attribute Name	Description
rep-net-count	The number of BLAST connections summarized by the representative edge.
rep-net-max	The –log₁₀ of the most significant BLAST E-value connection between any two sequences in the connected nodes.
rep-net-mean	The average –log₁₀ of all of the BLAST E-value connections between all the sequences in the connected nodes.

Attributes in Reaction Similarity Networks (RSNs)

Node Attribute Name	Description
reaction_name	The name of the reaction.
reaction_type	The reaction type. Allowed values are: cognate, other, unknown, generic.
EFD count	An integer specifying the number of EFDs associated with the reaction.
Families	A list of SFLD families with EFDs that catalyze the reaction.
Description	A text description of the reaction.
kegg_id	The reaction identifier from the KEGG database
EFDs	A list of SFLD identifiers for EFDs that catalyze the reaction.
Family count	An integer specifying the number of SFLD families with EFDs that catalyze the reaction.
Subgroup count	An integer specifying the number of SFLD subgroups with EFDs that catalyze the reaction.
ec	The Enzyme Commission number for the reaction.
Superfamilies	A list of the SFLD superfamilies with EFDs that catalyze the reaction.
directionality	The directionality of the reaction. Allowed values are: reversible, forward, backward, unknown
evidence code (ref ids)	A list of evidence codes and the associated SFLD reference IDs for EFDs that catalyze the reaction.
products	A list of the reaction products and their stoichiometry.
rhea_id	The reaction identifier from the RHEA database.
Superfamily_count	An integer specifying the number of superfamilies with EFDs that catalyze the reaction.
substrates	A list of the reaction substrates and their stoichiometry.
Subgroups	A list of SFLD subgroups with EFDs that catalyze the reaction.

Edge Attribute Name	Description
bond	The bond change similarity between the two reactions, based on comparison of the bond changes (bonds formed/cleaved, order changes, and stereo changes) calculated by RDT.
reaction_2	The SFLD reaction identifier for the second reaction being compared.
center	The reaction center change similarity between the two reactions, calculated by RDT.
structure	The small molecule change similarity between the two reactions (based on comparison of the chemical structure of the small molecule moieties in the reactions) calculated by RDT.
reaction_1	The SFLD reaction identifier for the first reaction being compared.

Attributes in Molecule Similarity Networks (MSNs)

Node Attribute Name	Description
smiles	The SMILES string for the molecule.
Reaction count	An integer specifying the number of SFLD reactions that include the molecule.
Subgroup count	An integer specifying the number of SFLD subgroups with EFDs that catalyze a reaction containing the molecule.
Superfamilies	A list of the SFLD superfamilies with EFDs that catalyze a reaction containing the molecule.
Superfamily_count	An integer specifying the number of superfamilies with EFDs that catalyze a reaction containing the molecule.
role	The role of the molecule within its reaction (ex. substrate, product).
compound_name	The name of the molecule.
Reaction IDs	A list of SFLD reaction identifiers for all reactions that contain the molecule.
Subgroups	A list of the SFLD subgroups with EFDs that catalyze a reaction containing the molecule.
Families	A list of the SFLD families with EFDs that catalyze a reaction containing the molecule.
Family count	An integer specifying the number of SFLD families with EFDs that catalyze a reaction containing the molecule.

Edge Attribute Name	Description
tanimoto	The Tanimoto coefficient for a comparison of two molecules.