Sequence Similarity Networks in the SFLD

protein similarity network

Contents:

Network Definitions and Visualization

A network is an instantiation of the abstract data type called a graph , which links individuals (nodes) through their connections (edges). In a protein similarity network, the nodes represet proteins and the edges represent protein-protein similarity. Similarity can be based on a variety of protein properties, ranging from complex multi-domain architectures to amino acid sequences. A sequence similarity network (SSN), the type of network available from the SFLD, is a protein similarity network in which the edges indicate similarity in amino acid sequence. Such networks provide a graphical view of the similarity relationships within a set of proteins and can facilitate large-scale analyses. Networks can also be studied analytically using a rich collection of algorithms.

Two types of SSNs are available from the SFLD:

  • One Sequence per Node (1SPN) – each node in the network represents a unique protein sequence, and each edge represents the similarity between the connected sequences
  • Representative – each node in the network represents a set of related protein sequences, and each edge represents the average similarity between the sequences in the connected nodes

The networks are provided in XGMML format for import into the program Cytoscape . This program allows interactive visualization, manipulation, and analysis of networks and their attributes (see the Tutorials for step-by-step examples). To handle larger networks, it may be necessary to increase the memory allocation for Cytoscape.

"Why Use Networks?

SSNs can be used to examine the relationships within large, diverse sets of sequences for which the costs of traditional methods of analysis such as phylogenetic trees would be prohibitive, due to the difficulty in generating accurate multiple alignments. In addition, the networks provide a graphical overview of interrelationships among and between sets of proteins that are not easily discerned from visual inspection of large trees and multiple alignments.

The two-dimensional distances in these networks correlate well with the distances in phylogenetic trees, indicating that they are mathematically reasonable. SSNs provide the user with a flexible, interactive view, where both nodes and edges can be overlaid with orthogonal information (for example, functional annotation), making networks a powerful tool for hypothesis generation. For more information, see Atkinson et al., PLoS One 4:e4345 (2009).

SFLD Network Creation

BLAST scores are used as the measure of sequence similarity. BLAST searches are performed by comparing batches of query sequences to a database consisting of all processed sequences in the protein set of interest (for example, a superfamily) using the blastp program. The equivalent blast2seq E-value for each pairwise comparison is then calculated and assigned to edges to create the 1SPN network.

Representative networks may be provided at different percent identity cutoffs. The cutoffs indicate how diverse a set of proteins can be “collapsed” into a single node. For example, in a representative network at 50% ID, each of the proteins associated with a given node have ≥50% sequence identity with the protein used to seed that node. Thus, for the same set of protein sequences, a lower percent identity cutoff gives a representative network with fewer nodes. Clustering by sequence identity is done with CD-HIT , and the Pythoscape framework is used to generate the representative network and to compute various node and edge statistics which are then assigned as attributes.

In general, networks of up to 250,000-500,000 (250-500K) edges are manageable on recent laptops, with the memory allocation for Cytoscape increased as needed. Thus, as many of our superfamily networks are too large to be opened or readily manipulated using Cytoscape, we only provide networks of manageable size:

  • 1SPN networks are not created at the superfamily level for superfamilies with >2500 sequences (nodes), as these will contain too many edges at E-values representing biologically relevant connections.
  • In order to further decrease the size of networks, users may specify an e-value cutoff (edges representing BLAST connections with e-value less significant than the cutoff will not be included in the xgmml file download). Alternatively, users may specify the maximum number of network edges directly (N = 250K, 500K, or 750K). In this case, only the top N edges after ranking by BLAST bit-score will be included in the xgmml file download. Note that it is possible for an edge to be omitted despite having the same bit-score as an edge that is retained.

Network Caveats

  • SSNs from the SFLD are created using full-length sequences. In some cases, nodes within the network may represent sequences with multiple functional domains. Thus, users should be aware that a given edge in a network may represent the similarity between any (or all) regions of the sequences represented by the connected nodes, even parts outside of the functional domain of interest. A network created using only the sequences of the functional domains of interest may have a different topology than a network based on the full-length sequences.
  • SSNs are not based on an explicit evolutionary model. Networks are not a substitute for phylogenetic trees, as they cannot be used to infer evolutionary history.
  • The graphical views of SSNs are two-dimensional representations of an N-dimensional space. Some information is lost in projecting N-dimensional data down to two dimensions.
  • 1SPN networks may not provide an even coverage of sequence space. Certain well-studied organisms such as E. coli have many highly similar sequenced strains. Because 1SPN networks have a node for each nonidentical sequence in the protein set of interest, they may be dense in areas representing sequences from well-studied organisms, but sparse in other areas.
  • Depending on the sequence identity cutoff used to generate a representative network, a single node may represent a very diverse collection of sequences. Thus a single representative node may contain sequences of different species, and even different functions. While such networks may be ideal for examining the relationships between different subgroups within a diverse superfamily, a 1SPN network may be more suited to examining the relationships between smaller, more closely related groups.
  • Network topology, as visualized in cytoscape, is not exactly reproducible as most cytoscape layouts are non-deterministic. Further, addition and/or deletion of sequences in a given group due to routine updates may introduce slight changes in the topology of the associated network.

Files in Network Downloads

Network downloads include multiple files compressed into a .zip file. Unzipping the file generates a folder containing the data. Besides the networks, plots of percent identity and alignment length vs. E-value are provided to help users decide which thresholded networks to examine in more detail. For example, perhaps only networks based on alignments covering at least the length of the domain in common and with >30% sequence identity are of interest.

One Sequence per Node (1SPN) network download:

  • network*.xgmml: A sequence similarity network that can be opened with Cytoscape , an open source platform for network visualization. Each node represents a unique sequence, and edges represent BLAST connections between sequences.
  • fullnetDownloads.README.txt: A description of the files included in the download.
  • node_length_histogram.pdf: A histogram showing the distribution of sequence lengths of all sequences in the network. Note the square root scale.
  • quartile_plots.pdf: Page 1: A plot of alignment length (y-axis) versus –log10(E-value) (x-axis) for all BLAST connections represented by the network. Page 2: A plot of percent identity (y-axis) versus –log10(E-value) (x-axis) for all BLAST connections represented by the network. Page 3: A histogram showing the distribution of –log10(E-value)s in the network. This data is shown in tabular form in quartiles_table.txt.
  • quartiles_table.txt: A table listing the median percent identity, mean alignment length, median alignment length, mean percent identity, and total number of edges in each –log10(E-value) bin. This data is shown in graphical form in quartile_plots.pdf.
  • fullnetAttributes.txt: A table describing the attributes available for nodes and edges in 1SPN networks (also shown below).

Representative network download:

  • repnetwork*.xgmml: A representative node similarity network that can be opened with Cytoscape , an open source platform for network visualization. In representative node networks, nodes represent one or more sequences (where sequences within a node have a specified percentage of sequence similarity, as calculated by CD-HIT and edges represent BLAST connections between sequences within the connected nodes that have an E-value more significant than 0.01.
  • repnetDownloads.README.txt: A description of the files included in the download.
  • node_length_histogram.pdf: A histogram showing the distribution of sequence lengths of all sequences in the network. Note the square root scale.
  • quartile_plots.pdf: Page 1: A plot of alignment length (y-axis) versus –log10(E-value) (x-axis) for all BLAST connections represented by the network. Page 2: A plot of percent identity (y-axis) versus –log10(E-value) (x-axis) for all BLAST connections represented by the network. Page 3: A histogram showing the distribution of –log10(E-value)s in the network. This data is shown in tabular form in quartiles_table.txt.
  • quartiles_table.txt: A table listing the median percent identity, mean alignment length, median alignment length, mean percent identity, and total number of edges in each –log10(E-value) bin. This data is shown in graphical form in quartile_plots.pdf.
  • repnetAttributes.txt: A table describing the attributes available for nodes and edges in representative node networks (also shown below).

Node and Edge Attributes

The nodes (proteins) and edges (similarities) in networks from the SFLD are annotated with attributes that can be used for filtering, coloring, etc. in Cytoscape . Individual attributes are described in the tables below for 1SPN and representative networks:

Attributes in One Sequence per Node (1SPN) Networks

Node Attribute NameDescription
id A unique identifier for the node.
Name The name of the protein in the SFLD.
gi A list of all valid NCBI GI numbers for the protein; blank if none.
EFD ID SFLD unique numerical identifier for the enzyme functional domain (EFD); searching the SFLD by EFD ID is faster than searching by GI number.
NYSGRC target ID Identifier of the protein in the New York Structural Genomics Research Consortium pipeline; blank if none.
Family The name of the family of the EFD in the SFLD; “unclassified” if none.
PDB IDs A list of the Protein Data Bank identifiers of structures with at least 95% sequence identity with the protein represented by the node.
Species A list of the organisms associated with the protein sequence. There can be multiple species with proteins of the same sequence.
Subgroup The name of the subgroup of the EFD in the SFLD; “unclassified” if none.
full length The length of the entire protein in which the enzyme functional domain (EFD) is found.
Domain length The length of the enzyme functional domain (EFD); may be the same as full length.
functional residues A list of residues identified as being important for function; blank if none.
Family evidence code Family assignment evidence code describing how the EFD was assigned to a family in the SFLD; blank if no family assignment.
Microbes Online Identifier for the protein in the Microbes Online database ; blank if none.
The Seed Identifier for the protein in the SEED database ; blank if none.
Superfamily evidence code Superfamily assignment evidence code describing how the EFD was assigned to a superfamily in the SFLD.
Swissprot Identifier for the protein in the Swiss-Prot database (the manually curated section of the UniProt Knowledgebase ); blank if none.
Swissprot protein name The name of the protein sequence in the Swiss-Prot database (the manually curated section of the UniProt Knowledgebase ); blank if none.
UniProtKB Identifier for the protein sequence in the UniProt Knowledgebase ; blank if none. There can be multiple identifiers for the same sequence.
Taxons Taxon identifier(s) for the protein sequence in the NCBI Taxonomy database . There can be multiple taxa with proteins of the same sequence.
Type of life Type(s) of life corresponding to the Taxons (division of life, if available, otherwise domain of life); blank if none.
DNA Source Sources of DNA for the protein; possible values are MTDF (Macromolecular Therapeutics Development Facility , “in house” NYSGRC) and/or ATCC (American Type Culture Collection ) or blank (none).

Edge Attribute NameDescription
alignment_len The length of the sequence alignment found by BLAST.
% id The fraction (range 0–1) of the aligned positions that are identical between the two sequences.
-log10(E) The –log10(E-value) where E-value is the blast2seq equivalent.

Attributes in Representative Networks

Node Attribute NameDescription
Dominant Family name The SFLD family assignment that applies to the largest number of enzyme functional domains (EFDs) represented by the node.
EFD IDs A list of SFLD enzyme functional domains (EFDs) identifiers in the repnode.
EFD ID (len) A list of SFLD enzyme functional domain (EFD) identifiers (length of full sequence in parentheses). These identifiers can be used to search the SFLD for specific sequences via the Search by Enzyme page.
Dominant Kingdom The Kingdom that applies to the largest number of sequences represented by the node.
Dominant Species The species name that applies to the largest number of sequences represented by the node.
Dominant Subgroup name The SFLD subgroup assignment that applies to the largest number of EFDs represented by the node.
Dominant Type of Life The Type of Life that applies to the largest number of sequences represented by the node.
DNA Source Sources of DNA for the sequences represented by the node; possible values are MTDF (Macromolecular Therapeutics Development Facility , “in house” NYSGRC) and/or ATCC (American Type Culture Collection ) or blank (none).
Family evidence code A list of family assignment evidence codes (which define how an EFD was associated with a family) corresponding to EFDs represented by the node, along with the percentage of sequences within the node each evidence code corresponds to.
Family name A list of the family names corresponding to EFDs represented by the node, along with the percentage of sequences within the node each family name corresponds to.
Has DNA Source A value of True indicates that the node represents at least one sequence that has DNA available from the Macromolecular Therapeutics Development Facility (“in house” NYSGRC) or the American Type Culture Collection .
Has experimental Family evidence code A value of True indicates that the node represents at least one sequence that has an experimental family assignment evidence code (CFM, IES, IGS).
Has FSM Superfamily evidence code A value of True indicates that the node represents at least one sequence that has a superfamily assignment evidence code of FSM (Founding Superfamily Member).
Has Microbes online A value of True indicates that the node represents at least one sequence that has information available in the Microbes Online database .
Has PDB A value of True indicates that the node represents at least one sequence with ≥ 95% ID to a structure in the Protein Data Bank .
Has Swissprot A value of True indicates that the node represents at least one sequence that is in the Swiss-Prot database (the manually curated section of the UniProt Knowledgebase ).
Has The Seed A value of True indicates that the node represents at least one sequence that has information available in the SEED database .
Kingdom A list of the kingdoms corresponding to sequences represented by the node, along with the percentage of sequences within the node each kingdom corresponds to.
Microbes online A list of identifiers from the Microbes Online database corresponding to the sequences represented by the node.
PDB IDs A list of the Protein Data Bank identifiers corresponding to sequences represented by the node (where the match between PDB and sequence has at least 95% ID over at least 95% of the length of the shorter sequence).
Species A list of the species names corresponding to sequences represented by the node, along with the percentage of sequences within the node each species name corresponds to.
Subgroup name A list of the subgroup names corresponding to EFDs represented by the node, along with the percentage of sequences within the node each subgroup name corresponds to.
Superfamily evidence code A list of superfamily assignment evidence codes (which define how a given EFD was associated with a superfamily) corresponding to EFDs represented by the node, along with the percentage of sequences within the node each evidence code corresponds to.
Swissprot A list of Swiss-Prot identifiers corresponding to sequences represented by the node. Swiss-Prot is the manually curated section of the UniProt Knowledgebase .
Swissprot protein name A list of Swiss-Prot protein names corresponding to the sequences represented by the node.
The Seed A list of identifiers from the SEED database corresponding to the sequences represented by the node.
Type of Life For each sequence represented by a given node, this field lists the type of life designation (division, if it exists, otherwise domain) corresponding to the taxon ID in the NCBI Taxonomy database , along with the percentage of sequences within the node each type of life designation corresponds to.
UniProtKB A list of UniProt Knowledgebase identifiers corresponding to sequences represented by the node.
gi A list of NCBI gi numbers corresponding to the sequences represented by the node.
node size The number of protein sequences represented by the node.

Edge Attribute NameDescription
rep-net-count The number of BLAST connections summarized by the representative edge.
rep-net-max The –log10 of the most significant BLAST E-value connection between any two sequences in the connected nodes.
rep-net-mean The average –log10 of all of the BLAST E-value connections between all the sequences in the connected nodes.