An Introduction to Ezprot:
Object-Oriented Software Tools For
Protein Sequence and Structure Analysis

 

 

 

Frank K. Pettit
James U. Bowie

 

Department of Chemistry and Biochemistry
and
UCLA-DOE Laboratory of Structural Biology and Molecular Medicine
University of California, Los Angeles
Los Angeles, CA 90095

 

 

In computational molecular biology, there is much duplication of effort, with nearly identical functions being written over and over by different researchers. At present there is no set of standard software tools for protein analysis, particularly where structural analysis is concerned. The lack of software standardization reduces productivity and deters the exchange of software between researchers. Moreover, the gap between computational analysis of structure and analysis of sequence appears unbridgeable. Here we describe Ezprot, an object-oriented library of functions and data types for integrated analysis of protein sequence and structure. Ezprot functions are simple and intuitive, and yet they are flexible and powerful, and thus an apt platform from which to construct and exchange creative algorithms. We describe the practical obstacles that have so far prevented researchers from adopting software standardization, and Ezprot's approach to eliminating those obstacles.

 

Introduction

In the field of computational molecular biology, a large number of programs have been written to analyze protein sequence and structure. Examples include programs for surface area calculation [1], sequence alignment [2,3], protein fold prediction [4], and molecular docking and rational drug design [5], along with many others. Each of these programs is intended to perform creative research, and thus, each program is necessarily different from the others. However, there is a great deal of duplication of labor in computational molecular biology. Each researcher seems to write his or her own routines for common tasks, such as reading PDB files, finding neighbor atoms adjacent to each atom, etc. As one well-known example, writing a routine to read PDB files requires researchers to become intimately familiar with PDB format, when they could instead be doing something really creative.

Duplication of effort reduces productivity, and makes it excessively time-consuming to initiate computational analysis. Moreover, it is excessively painful for one researcher to read another person's source code, as each program has its' own conventions and vocabulary. At present there is no set of standardized software tools to aid researchers, particulary where protein structural analysis is concerned. Function libraries do exist in molecular biology, examples being GCG [6,7] and PROTEUS [8]. However, these routines exist primarily to aid specific tasks with pre-existing algorithms. Creative new algorithms are at present still implemented as independent, self-contained programs.

A recent editorial [9] emphasized the imposing nature of initiating computational research into molecular structure, and described the "cultural gap" between scientists who analyze sequence and those who analyze structure--a gap which only becomes larger as the size of genomic databases becomes overwhelming. The editorial suggested object-oriented programming could be one possible solution, although to date such techniques have not been generally implemented in the field.

Here we describe the Ezprot library, which is intended to fill these gaps. Ezprot is an object-oriented library of data types and functions for integrated sequence and structure analysis. It is intended as a platform for the development of new, untested algorithms.

Before describing the library, we must first ask, why has a standard library not been created previously? Let us take a look at the code in a typical protein analysis program.

       Logical Function ReadPDBFile(FilNum,PDBFile,
                               AtNam,ResNam,Chain,
                               ResNum,X,Y,Z,Rad,
                               TotalAtoms)

(1)

A call to this function is always preceded by complicated array declarations. It is immediately clear why researchers should prefer to write their own code rather than invoke a library function: standardized functions can be clumsy, complex, and have a very steep learning curve.

In order to be useful for creative research, a library function must be very flexible, which means it must have many arguments. Thus, the researcher, even if he only needs the data in one or two fields, must still declare all variables and arrays (with the appropriate dimensions) demanded for compatibility with the library function call. Worst of all, the more arguments there are to each function, the more possibilities there are for bugs to creep into the program. On the other hand, if the function call were simpler, with fewer arguments, it would be less flexible and of little use to researchers trying to do creative research. Therefore, functions similar to the one above have been written, debugged, and re-written many times by many people.

The Ezprot library is intended as a solution to all of the above problems. Ezprot functions are simple, elegant and intuitive, and are easy to learn. However, these functions are at the same time flexible and powerful, as they must be for researchers to use them in their own creative research.

Ezprot resolves the dilemma of "flexibility vs. simplicity" through the technique of object-oriented programming [10]. An "object" is a data type defined by the library, analogous to a record in FORTRAN or a struct in C. Like those data structures, an object has a set of fields which hold its' data. For example, in Ezprot, the object class called protein has data fields for the protein's PDB code, the letter names of all the amino acid chains that make up the protein, etc. The initialization and dimensioning of all these fields are handled automatically by the class library and the compiler. The researcher who uses Ezprot can access only the data he needs, ignoring the rest.

Simplicity

Suppose a researcher is using Ezprot. This is how he could declare two proteins, and instruct one to read the PDB file for superoxide dismutase:

#include "protein.h" /* Header file declares protein class */
 
protein prot1, prot2;  /* Declare two protein objects */
/* Read PDB file for superoxide dismutase: */
prot1.read_pdb_file( "PDB2SOD.ENT");

(2)

The header file protein.h is part of the Ezprot library, and it declares the protein data type, or class. The programmer using Ezprot would include this header file in his program. Thereafter he may declare specific protein objects, which in the example are called prot1 and prot2. Notice that the declaration of a protein object has exactly the same syntax as the declaration of any variable, integer, floating point, etc. Initially, proteins are empty of data. The function read_pdb_file() is then called on prot1, causing the named PDB file to be read, and virtually all the data in the file is stuffed into prot1. Dynamic memory allocation is used automatically to allocate exactly enough memory for the atoms and residues in the named PDB file (no need to redimension arrays and recompile). At this point, the programmer is free to access some of the data stored in prot1 (by calling other functions on it), and to ignore the rest of its' data. This is the heart of Ezprot's simplicity. Comparing the two function calls from code fragments (1) and (2), we see that the Ezprot version is enormously more simple and elegant. Nevertheless, the Ezprot version is in fact even more flexible and powerful than the non-object-oriented version.

Upgradeability

Below we provide several code fragments which represent Ezprot's simple and intuitive functions. However, it must be emphasized that this elegance has a crucial practical benefit: a simple function call with fewer arguments is more resistant to bugs than a clumsy function with many arguments. This resistance to bugs becomes progressively more important as an application becomes larger and more complex.

Most importantly, Ezprot programs are more compatible with future improvements to the library. Consider the non-object-oriented function call from code fragment (1). If we wanted to add new functionality to that routine--reading secondary structure, for example--we can do so only by adding more arguments to the function call. This is a disaster, because all programs dependent on that library function would need to be rewritten with new function calls, new array declarations, etc. In contrast, with the Ezprot version, we can add new fields to the protein class or change the representation of data within the protein fields. That means that the library code must be rewritten by the library designers; but in the program that uses Ezprot, the call to function read_pdb_file() would not change form. Programs that researchers have built from Ezprot will not need rewriting just because we create improved versions of the library in the future.

Readability

At present, in computational molecular biology, it is usually pretty painful for one researcher to read another person's code. The classic example is when one researcher cannot run another person's program without first changing the array dimensions, a common task which might require searching through thousands of lines of code.

Because code built from Ezprot is intuitive and highly readable, it may help to facilitate the exchange of software between molecular biologists, and reduce duplication of effort. We guess that an experienced C programmer will, with help from the Ezprot documentation, be able to write real programs with the library after about a week of preparation.

Lastly, Ezprot may also be of pedagogical use for instructors who wish to introduce graduate students to computational problems. Students introduced to computational molecular biology normally spend large amounts of time worrying about the details of formatted input/output. With Ezprot, students can begin writing their own programs relatively quickly, focusing on the induction behind algorithms rather than the details.

The Ezpot library contains functions and classes to perform standard tasks such as reading PDB files, rotating and displacing molecules in space, performing oligomer transformations to generate crystal mates, finding the neighboring atoms adjacent to each atom, calculating surface area, surface curvature, volume, etc. Future updates will add more functionality while maintaining compatibility with previous versions. Below we provide a few illustrative examples to give the reader a feel for how programs based on Ezprot are written.

Typical Program Structure

A programmer using Ezprot can declare objects with types defined in the library, then pass those objects into and out of his own routines. Here is an example of a very simple program that might be written by a programmer using Ezprot. In the example, the programmer uses the syntax protein *pro; (just like C syntax) to declare that the variable pro is a pointer to a protein.

/* SIMPLE PROGRAM WRITTEN BY USER OF EZPROT. */
#include "protein.h" /* Library declares protein class */
 
void myfunction( protein *pro);
 
main()
{
  protein prot;
  prot.read_pdb_file( "PDB6LYZ.ENT");
 
  /* Pass POINTER to protein into function. */
  myfunction( &prot);
}
 
void myfunction( protein *pro)
{
  printf("Name of protein is %s.\n", pro->name());
  printf("Number of chains = %d.\n", pro->num_chains());
}

(3)

If the protein itself were passed as an argument, the whole object would be copied to the function, which would be time-consuming. Instead, a pointer to the protein is passed, for increased performance. This is just like the rule in C for passing arguments. The user's function can then access any data stored in the protein by calling functions from the library such as name() and num_chains().

Thus, in addition to Ezprot library functions being quite simple, the routines written by users of Ezprot are likewise simplified.

 

Background: Classes and Objects

In commercial programming, the dilemma described above--flexibility vs. simplicity--is resolved through the technique of object-oriented programming [10], which is the basis for operating systems with graphical user interfaces (Windows and Macintosh OS). Such operating systems employ standard class libraries. A class is a data type defined by the library (for example, a "window" class, a "scrollbar" class).

The Ezprot library is written in C++ [11], an object-oriented language based on C (in the code fragments displayed in this work, we simplified the C++ code to make it understandable to those who know C). To create a class library requires a detailed knowledge of C++. However, once a class library has been created by someone else, using the classes does not require much knowledge beyond experience with C programming. (The Ezprot documentation is aimed at programmers familiar with C.)

An object--that is, a particular example of a class--is a lot like a record in FORTRAN. The most important difference is that the data in a record or struct can be accessed directly through the fields, but the data in an object is accessed indirectly by calling functions on the object. For example, if a protein were a struct in C, you would access its' name through the field prot.name, but if it were an object you would use the function prot.name(). (In the Appendix we provide more detail about how object-oriented libraries are constructed.)

In OOP, classes can be related to each other, and therefore they may share common sets of functions. For the programmer who uses Ezprot, the important point is that there is a high degree of function standardization. The same functions are re-used by many different classes (sometimes with slightly modified functionality, as appropriate.) Ezprot functions are consistently named and intuitive. Classes protein, aa_chain, amino_acid, etc., all share functions like num_atoms() and name() which do what you would expect them to do.

Function standardization makes it easy for programmers experienced with C to quickly learn enough about Ezprot to put the classes into use in real programs.

Logical Structure of Protein Objects

Programs that analyze protein sequence usually represent a protein as an array of amino acids. Programs that analyze protein structure usually represent a protein as an array of atoms. In Ezprot, protein objects have a logical structure that can handle sequence or atomic-level detail equally well.

As illustrated in Fig. 1, a protein object can be treated like an array of chains. An aa_chain object can be treated like an array of amino_acid objects. An amino_acid object can be treated like an array of atom objects. Normally, memory for such collections is allocated dynamically and automatically by Ezprot functions. Thus, if the user of Ezprot wants to analyze sequence but not structure, no memory is allocated for atom objects.

Here we the print the three-letter code for the type of a single amino acid from superoxide dismutase.

#include "protein.h" /* Header file declares protein class */
protein prot;
int     n_chain, n_aa;
 
/* Read PDB file for superoxide dismutase: */
prot.read_pdb_file( "PDB2SOD.ENT");
 
n_chain = 1;  n_aa = 20;
printf("Residue:%s\n", prot[ n_chain][ n_aa].name());

(4)

A protein is not really an array. When the bracket operator [] is applied to a protein, it invokes an Ezprot library function that returns the chain with the right index. Operator [n_chain] applied to a protein returns the n_chainth. chain, and then operator [n_aa] applied to that chain returns the amino acid. The function name() was applied to the n_aath. amino acid of the n_chainth. chain, and will return its' three-letter code.

Here we access a particular atom:

at_z = prot[ n_chain][ n_aa][ n_at].z();

(5)

The function z() was applied to the n_atth. atom of the n_aath. amino acid of the n_chainth. chain, and it returns that atom's z coordinate.

Furthermore, if you applied another function, num_elems(), to a protein, it would return a count of how many chains the protein contains; and num_elems() applied to an aa_chain will return a count of how many amino_acids the chain contains, etc. Using functions like num_elems() and operator [] it is possible to loop through all atoms in a protein, which would require three nested loops. The essential point is that proteins, chains, and amino acids all respond to common functions.

In practice, however, the bracket operator and integer counters are almost never used to iterate. The code fragment (5) is only presented for illustrative purposes, to get the reader used to the idea of a protein as a collection of chains, etc. The preferred method in C to iterate through a collection is with pointers to elements, and this is the convention in C++ as well.

The classes also share standardized, simple functions that initialize and increment pointers to their elements. Below we use the syntax amino_acid *aa; to declare that the variable aa is a pointer to an amino_acid. We can then set it to point to the first amino acid in the first chain:

protein    prot;  /* protein object, not pointer */
aa_chain   *aac;  /* pointer to aa_chain */
amino_acid *aa;   /* pointer to amino_acid */
 
prot.read_pdb_file( "PDB2SOD.ENT");
 
aac = prot.first();  /* aac points to first chain */
aa  = aac->first();  /* aa points to first amino acid */

(6)

When the function first() is applied to a protein, it returns a pointer to the first chain in the protein; when first() is applied to an aa_chain, it returns a pointer to the first amino acid in the chain, etc. The functions to initialize and increment pointers are standardized across classes, thus making it easy to loop over the components.

Because some readers may have forgotten C syntax, we remind the reader that, in C and C++, the operator "." is applied to an object to access its' fields or functions, while operator "->" is applied to a pointer to an object to access the object's fields or functions. Since prot is a protein object, we write prot.first(), but since aac is a pointer to an aa_chain, we write aac->first().

The logical structure of protein objects permits one to access a protein on a residue-by-residue or on an atom-by-atom basis, and to switch back and forth. Ultimately, Ezprot is intended to integrate the analysis of protein structure and sequence.

PDB File Input/Output

We have already described the function read_pdb_file() that takes a filename as argument, reads the named PDB file and stores its' data in a protein object. This function is quite robust; it reads almost everything in a PDB file--residue types, atomic coordinates, heteroatoms, metal ions, and even multiple sidechain conformations. There is also a function fprint_pdb() that writes a protein to standard output in PDB format. In combination, these functions make it easy to read a protein structure, process it, and output an altered structure.

read_pdb_file() can read different parts of PDB files. By default, it will read the PDB file header, the sequence and the atomic data. But you can control what parts of the PDB file are read by passing in an extra argument, a bitmask. The bitmasks that can be passed are defined as macros in the file protein.h. For example, protein.h defines a bitmask, ProtIO_secstruc_Mask, which will make the function read the secondary structure records in PDB files (and read the sequence, but ignore the atomic data). Below we read the file for lysozyme and print the names and sequence numbers of all residues in alpha helices as identified by the file.

#include "protein.h"  /* Header file declares bitmasks */
protein  prot;
/* POINTERS to amino acids and a.a. chains */
aa_chain   *aac;
amino_acid *aa;
 
/* Read lysozyme's sequence and secondary structure. */
prot.read_pdb_file("PDB6LYZ.ENT", ProtIO_secstruc_Mask);
 
for (aac = prot.first(); prot.not_at_end(aac); 
     aac = prot.next(aac))
  {
    printf("Helix residues in chain %c:\n", aac->name());
 
    for (aa = aac->first(); aac->not_at_end(aa); 
         aa = aac->next(aa))
 
      if (aa->in_helix())  /* Is this residue in helix? */
        printf("%s %d\n", aa->name(), aa->seq_num());
  }

(7)

Function in_helix() tells if a residue is in a helix. As another example, protein.h defines a bitmask, ProtIO_header_Mask, which will make the function read just the header in PDB files. Next, the bitmasks can be combined with bitwise operators to make compound commands. For instance, to make read_pdb_file() read both the header information and the secondary structure records, we could combine two bitmasks with the bitwise OR operator "|":

prot.read_pdb_file("PDB6LYZ.ENT", 
                   ProtIO_header_Mask | ProtIO_secstruc_Mask);

 

The output from function fprint_pdb() can be controlled in the same way.

Automated Oligomer Transformations

As an example of sophisticated functions that can be performed elegantly with Ezprot, we discuss how to perform oligomer transformations on a protein object. Nearly half of all structures in Protein Data Bank files require some transformation to give them the same form as the protein in vivo. Often, for a protein that is a dimer in vivo, the file will give the coordinates of just one chain, so a duplicate chain should be created, rotated and displaced. Even more often, multiple copies of a monomer will appear in the crystal asymmetric unit, and one or more chains should be deleted to make the monomer. Missing or incorrect oligomer interfaces can introduce errors into some kinds of analyses.

The quaternary structure of a protein in vivo is sometimes mentioned in the comment section of PDB file headers, but this is in a non-standardized format, and anyway many files lack any mention of the true quaternary structure. We have therefore introduced the following convention: for each PDB file, the transformation necessary to make the in vivo structure is stored in an auxiliary file with a standard format, called an Oligomer-Generating Matrix (OGM) file. For example, the PDB file for enolase is called PDB7ENL.ENT and contains one chain, but enolase is a dimer. The OGM file for enolase is called 7ENL.OGM and contains one rotation matrix and a displacement vector, and looks like this:

# This is the OGM file 7enl.ogm (ENOLASE).
    0.0  -1.0   0.0      124.1
   -1.0   0.0   0.0      124.1
    0.0   0.0  -1.0       66.9
 

Next, superoxide dismutase is a dimer in vivo, but the file PDB2SOD.ENT contains four chains. The file 2SOD.OGM will delete two chains:


# This is the OGM file 2sod.ogm (SUPEROXIDE DISMUTASE)
# This means "Delete chains B and G"
~B
~G

 

Such files must unfortunately be created by hand, but the Ezprot library includes more than 100 OGM files for popular proteins. We find this configuration to be especially convenient when analyzing a large database of many proteins.

An oligomer transformation is clearly a complicated thing, involving one or more matrices, displacement vectors, or instructions on chain deletion. There are many good reasons why, inside the program, it is better to represent an oligomer transformation as a class, rather than as a set of complicated arrays passed as arguments to functions. The Ezprot library defines a class called olig_transfrm whose purpose is to read OGM files, store the data and transform protein objects. By defining a transformation as an object, we can also keep a transformation around and use it later on other objects besides proteins. The code fragment below will read a PDB file, read the corresponding OGM file, transform the protein, and output the transformed protein in PDB format.

/* GENERATE OLIGOMER FOR SUPEROXIDE DISMUTASE, 
   OUTPUT OLIGOMER IN PDB FORMAT.  */
 
#include "protein.h"         /* Declares protein */
#include "olig_transfrm.h"   /* Declares olig_transfrm */
 
protein       prot;    /* protein object */
olig_transfrm ogm;     /* oligomer transform object */
 
/* Read PDB file for superoxide dismutase: */
prot.read_pdb_file( "PDB2SOD.ENT");
 
/* Read OGM file for superoxide dismutase: */
ogm.read_ogm_file( "2sod.ogm");
 
/* TRANSFORM TO FULL OLIGOMER */
prot.transform( ogm);
 
/* Output full oligomer in PDB format to standard out */
prot.fprint_pdb();

(8)

The function fprint_pdb() writes the protein object to standard output in normal PDB format. Notice that the function prot.transform(ogm) may involve the creation or deletion of chains. Memory for chains is allocated and deallocated automatically by Ezprot. The equivalent transformation functions in a non-OOP application would require more than a dozen arguments with appropriate array dimensioning, and would still not be as flexible as Ezprot functions.

Analyzing Subparts of Proteins

Most Ezprot functions can be customized to analyze parts of a protein, rather than the whole molecule. As an example, we first calculate the total accessible surface area of an object prot, assuming it has already been initialized:

float prot_area;
prot_area = total_surface_area( prot, "accessible");

(9)

Other Ezprot functions would return the individual areas of all atoms stored in an array, but function total_surface_area() by default returns the whole surface area of the protein. However, total_surface_area() can selectively exclude atoms from analysis if the atoms are "hidden". Classes like protein, aa_chain, amino_acid, etc., have a function hide() that marks their constituent atoms as hidden. The effect is reversed by the function show().

In the code fragment below, we calculate the area of the tripeptide from residue number 37 to 39 of the first chain of lysozyme--just the tripeptide in isolation, as if the rest of the protein structure did not exist:

 
#include "protein.h"    /* Declares protein class */
#include "ezp_area.h"   /* Declares area functions */
 
protein prot("PDB6LYZ.ENT"); /* Initialize & read file*/
float   tripep_area;
 
prot.hide();              /* Hide all atoms */
prot[0]( 37, 3).show();  /* Show atoms in tripeptide */
 
tripep_area = total_surface_area( prot, "accessible");

(10)

We used a different syntax for declaring the protein, in which the object is initialized and the PDB file is read in a single step. All atoms in prot are then hidden. Operator [0] applied to the protein returns the first (zeroth) chain, then the parentheses operator (37,3) extracts from that chain a substring consisting of the three residues from 37 to 39. Their atoms are shown while all other atoms remain hidden. Thus, function total_surface_area() returns the area of the tripeptide structure in isolation.

Ezprot functions can be customized to analyze any subpart of a protein, but this additional flexibility is achieved without adding extra arguments to function calls.

Ezprot Utility Programs

The Ezprot library includes a number of small programs that perform common tasks. Examples are generating full oligomers from PDB file structures, deleting hydrogen atoms or heteroatoms, identifying oligomeric interfaces or the structural epitope of a substrate, calculating interface areas, etc.

By default, these programs read from standard input and write to standard output, often in PDB format. In an operating system like UNIX, the output of one utility can be piped into another utility with "|". Here we generate the in vivo oligomeric structure for superoxide dismutase, strip off heteroatoms and hydrogen atoms, and calculate the area buried at the oligomeric interface:

gen_olig -pdb pdb2sod.ent -ogm 2sod.ogm | striphets  \
   | stripatoms -elem "H" | oligburial

 

The utilities are bundled with Ezprot. The source code is provided to serve as simple examples for those learning to write programs with the library.

Web Site Available

The library includes many functions not mentioned in this paper. But the current version (v.1.0) also lacks many functions that might be needed, such as routines for indentifying hydrogen bonds. We have established a site on the World Wide Web for distributing the library. Moreover, through this Web site other researchers can make suggestions as to which routines are most needed by Ezprot, and can contribute source code. We can translate submitted code to C++, or write original routines, and incorporate the new code into future versions of the library. This can be done while maintaining compatibility with previous versions. However, we emphasize that the essential infrastructure is already in place for creative research that integrates the analysis of sequence and structure.

Conclusion

Ezprot classes and functions are very elegant and intuitive, but the benefits of elegance are more than aesthetic. Elegant functions are more immune to bugs. Non-OOP function calls often consist of long strings of arguments. As non-OOP programs become larger and more complicated, arguments lists get longer; and in FORTRAN, common blocks likewise tend to grow. This makes it easy to switch two variables or to dimension an array incorrectly. Beyond a certain point, large non-OOP applications become unmanageable, which is why commercial operating systems use OOP. In our experience with Ezprot, large and complex applications are much more resistant to bugs.

Programs using Ezprot are highly readable. With clutter out of the way, one can read another person's code and immediately recognize the thinking behind his or her algorithm. Thus, Ezprot can facilitate the exchange of software between molecular biologists.

The object-oriented basis of Ezprot also has disadvantages that may make the library inappropriate for some purposes. Ezprot represents atoms by the coordinates of their centers and their Van der Waals radii. The library has no representation for electron density, and thus it would be of limited value to crystallographers trying to solve structures (in principle it could be used for model building.) We have carefully designed the library for optimum performance, with in-line functions and other tricks; however, some older C++ compilers have a reputation for producing slow executables. Ezprot would therefore be inappropriate for very computationally intensive tasks, like molecular dynamics. (For our research we have found Ezprot's performance to be reasonable. It's easy to write small test programs in Ezprot to compare the performance on your system.)

We expect Ezprot to be most useful for protein analysis objectives such as protein fold prediction, sequence alignment, surface area calculation, molecular docking and rational drug design, and similar tasks. Ezprot is a flexible platform from which a variety of algorithms can be built.

To sum up, the virtues of Ezprot are that it is intuitive and easy to learn, it is flexible and powerful, it enhances immunity to bugs, it guarantees compatibility with future updates of the library, and it facilitates the exchange of source code between researchers. The library can serve as a standard platform for the rapid development of new algorithms for analysis of protein sequence and structure. Ezprot is built to grow.

Appendix: Object-Oriented Programming

As explained above, a class is a data type defined by the library, and an object is a specific example (instance) of a class. The difference between an object and an ordinary FORTRAN record is this: a programmer can directly access the data fields in a record; but the programmer who uses a class cannot access its' fields directly. Instead, for each class there is a set of functions called "member functions," which come with the library, and which are the only functions allowed to directly access that class' internal fields. Only the protein class' member functions can access the protein class' fields; only the amino_acid class' member functions can access the amino_acid class' fields, etc. Non-member routines have to use member functions to get at the class' data.

The library includes all member functions that the classes need. Most member functions are very simple, but for more advanced tasks they can be complicated. Programmers using the library write normal (non-member) routines that can manipulate objects, call member functions, and build more complicated algorithms from these elements.

In C++, regular function calls look like function calls in C. Member functions, however, are applied to an object with the operators "." or "->", which are the same as the operators used in C to access the fields of a struct. In code fragment (1), the function read_pdb_file() is a member function of class protein. The member function is applied to object prot1, and it can directly store or retrieve data in prot1's fields. The program that uses the class, however, cannot store data in the fields directly.

This rule may appear excessively strict. The motivation for the rule is that, when a class or library is improved, the way the data is represented usually changes, with new data types or different fields. If every function in every program that uses a class were permitted to access its' fields, then, when the fields are improved, every program dependent on the fields would need to be rewritten. On the other hand, consider programs that only access fields indirectly by calling member functions. When the class is improved and the internal representation of the data changes, only the member functions of that class must be rewritten by the library designers. Old programs that used the library do not need to be rewritten, because they call member functions, which add an extra layer of protection.

As an example, consider the Year 2000 problem caused by the huge number of programs that represented dates as a set of data fields. When the data needs to be represented in a new way--when a year must be represented as four digits instead of two--all programs dependent on a two-digit year field must be rewritten. But if instead dates had been coded as objects, only the date class' member functions would need rewriting, while the programs using the date class would still be compatible with a new library. Such a problem could be reliably fixed in half an hour instead of half a decade.

Sometimes a data structure is better off as an ordinary struct or record, while many times data is better off designed as a class. Even if a programmer does not have a religious devotion to the orthodoxy of OOP, a pragmatic attitude toward object-oriented programming may still lead to large benefits. In practice, programs using OOP are subjected to many small, gradual improvements which have a large cumulative effect. Non-OOP programs are subject to inertia: there is a general resistance to making improvements to routines even when they are needed, because improvements necessitate rewriting all dependent programs.

One might wonder if so many function calls in OOP imply a loss of efficiency. If a member function is very simple and short, the designers of a library can improve performance by declaring the function to be "in-line". Put very simply, such functions are like macros in C. The compiler will expand calls to the functions in-line, thus avoiding the overhead of regular function calls. Generally, the designers of a class have the responsibility of optimizing performance by declaring simple functions to be in-line, while the users of the class write and invoke functions in the normal way.

In OOP, member functions can be re-used in multiple classes. This makes it easier for the library designers, who can re-use the same code; and this also makes it easier for people using the library, who have fewer functions to learn about. The most common relationship between classes is called "class derivation" [10]. Member functions and data of one class, called the "base class", are implicitly also member functions and data in any classes derived from it, the "derived classes". An example is illustrated in Fig. 2. The amino_acid class and the hetgroup class (which represents compounds of heteroatoms) are siblings. Both have a function seq_num() that returns their sequence number within a chain. They share some functions because both classes are derived from the same base class, called compound. amino_acid and hetgroup picked up function seq_num() from the class compound. In addition to the inherited functions, each derived class can also add new functions of its' own. So the amino_acid class adds functions, like in_helix(), which tells whether the residue is in a helix. Likewise, class hetgroup adds a function, is_water(), which indicates whether a hetgroup is a water molecule. Large OOP applications are built by adding new functions layer by layer onto derived classes.

These details about class derivation are important for the programmer who writes the class library, but they are not very important for the programmer who uses the library. For the user of Ezprot, the main point is that there is a high degree of function standardization. Often, the same functions are used by different classes, thus making them easier to remember without consulting the documentation.

References:

1. Connolly, M. Analytical molecular surface calculation. J. Appl. Crystallogr. 16, 548 (1983).

2. Smith, T.F. & Waterman, M.S. Identification of common molecular subsequences. J. Mol. Biol. 147, 195-197 (1981).

3. Altschul, S.F., Gish, W., Miller, W., Myers, E.W. & Lipman, D.J. Basic Local Alignment Search Tool. J. Mol. Biol. 215, 403-410 (1990).

4. Bowie, J.U., Luthy, R. & Eisenberg, D. S. A method to identify protein sequences that fold into a known three-dimensional structure. Science 253 (5016), 164-170 (1991).

5. Kuntz, I. D., Blaney, J. M., Oatley, S. J., Langridge, R. and Ferrin, T. E., A geometric approach to macromolecule-ligand interactions. J. Mol. Biol. 161, 269-288 (1982).

6. Devereux, J., Haeberli, P., & Smithies, O. A comprehensive set of sequence analysis programs for the VAX. Nucl. Acids Res. 12 (1), 387-395 (1984).

7. Devereux, J. The GCG procedure library, version 6.0. Genetics Computer Group, Inc., University Research Park, 575 Science Drive, Suite B, Madison, Wisconsin, 53711 (1989).

8. Pabo, C.O. & Suchanek, E.G. Computer-aided model-building strategies for protein design. Biochem. 25 (20), 5987-91 (1986).

9. Anonymous. Structure and the genome. Nature Struc. Biol. 4 (5), 329-330 (1997).

10. Holub, A.I. C + C++: Programming with objects in C and C++ (McGraw-Hill, New York; 1992).

11. Ellis, M.A. & Stroustrup, B. The Annotated C++ Reference Manual (Addison-Wesley, New York; 1994).

Features

Capabilities

Input/Output PDB

Read PDB file header, atomic data, secondary structure,
and SITE records
Output in PDB format

Protein Transformations

Rotations and displacements in space
Oligomer transformation with new chain generation

Area and Volume

Calculate surface area (accessible, Van der Waals, contact)

Total volume
Read output from Connolly's MSP program

Oligomer Interfaces

Identify oligomeric interface; calculate interface area

Active Sites

Identify structural epitope near substrate
Identify active site from PDB file SITE records

Characerize Surface

Calculate surface curvature
Identify concavities
Calculate surface roughness (fractal dimension)

Analyze Locally-Defined Quantities

Find neighbor atoms near each atom
Identify local clusters where some quantity is maximized
Smooth a local quantity over atomic neighborhoods

 

 

Figure Captions:

Figure 1. The logical structure of protein objects. A protein object can be treated like an array of chains. An aa_chain object can be treated like an array of amino_acid objects. An amino_acid can be treated like an array of atoms.

Figure 2. An example of class derivation. The amino_acid class and the hetgroup class (which represents compounds of heteroatoms) are both derived from the class called compound. A compound object has a member function seq_num(), which returns the compound's sequence number within a chain. That function and all member functions of class compound are implicitly also member functions of classes amino_acid and hetgroup. In addition, each derived class can also add new functions of its' own. The amino_acid class adds functions, like in_helix(), which tells whether the residue is in a helix. Likewise, class hetgroup adds a function, is_water(), which indicates whether a hetgroup is a water molecule.

 

 

 

 

compound (base class)

Functions: seq_num(), num_atoms(),...

amino_acid hetgroup

seq_num(), num_atoms(),... seq_num(), num_atoms(),...

in_helix(), code(),... is_water(),...