I AM AJ: 2011

Tuesday, January 25, 2011

EXCEL

Introduction

Linear Regression

In statistics, linear regression is an approach to modeling the relationship between a scalar variable y and one or more variables denoted X. In linear regression, models of the unknown parameters are estimated from the data using linear functions. Such models are called linear models. Most commonly, linear regression refers to a model in which the conditional mean of y given the value of X is an affine function of X. Less commonly, linear regression could refer to a model in which the median, or some other quantile of the conditional distribution of ygiven X is expressed as a linear function of X. Like all forms of regression analysis, linear regression focuses on the conditional probability distribution of y given X, rather than on the joint probability distribution of y and X, which is the domain of multivariate analysis.

Linear regression was the first type of regression analysis to be studied rigorously, and to be used extensively in practical applications. This is because models which depend linearly on their unknown parameters are easier to fit than models which are non-linearly related to their parameters and because the statistical properties of the resulting estimators are easier to determine.

Linear regression has many practical uses. Most applications of linear regression fall into one of the following two broad categories:

If the goal is prediction, or forecasting, linear regression can be used to fit a predictive model to an observed data set of y and X values. After developing such a model, if an additional value of X is then given without its accompanying value of y, the fitted model can be used to make a prediction of the value of y.
Given a variable y and a number of variables X₁, ..., X_p that may be related to y, then linear regression analysis can be applied to quantify the strength of the relationship between yand the X_j, to assess which X_j may have no relationship with y at all, and to identify which subsets of the X_j contain redundant information about y, thus once one of them is known, the others are no longer informative.

Linear regression models are often fitted using the least squares approach, but they may also be fitted in other ways, such as by minimizing the “lack of fit” in some other norm, or by minimizing a penalized version of the least squares loss function as in ridge regression. Conversely, the least squares approach can be used to fit models that are not linear models. Thus, while the terms “least squares” and linear model are closely linked, they are not synonymous.

Quadratic or Polynomial Regression

In statistics, polynomial regression is a form of linear regression in which the relationship between the independent variable x and the dependent variable y is modeled as an nth orderpolynomial. Polynomial regression fits a nonlinear relationship between the value of x and the corresponding conditional mean of y, denoted E(y|x), and has been used to describe nonlinear phenomena such as the growth rate of tissues^[1], the distribution of carbon isotopes in lake sediments ^[2], and the progression of disease epidemics^[3]. Although polynomial regression fits a nonlinear model to the data, as a statistical estimation problem it is linear, in the sense that the regression function E(y|x) is linear in the unknown parameters that are estimated from the data. For this reason, polynomial regression is considered to be a special case of multiple linear regression.

Polynomial regression models are usually fit using the method of least squares. The least-squares method minimizes the variance of the unbiased estimators of the coefficients, under the conditions of the Gauss–Markov theorem. The least-squares method was published in 1805 by Legendre and in 1809 by Gauss. The first design of an experiment for polynomial regression appeared in an 1815 paper of Gergonne^[4]^[5]. In the twentieth century, polynomial regression played an important role in the development of regression analysis, with a greater emphasis on issues of design and inference^[6]. More recently, the use of polynomial models has been complemented by other methods, with non-polynomial models having advantages for some classes of problems.

For more information, visit:http://en.wikipedia.org/wiki/Main_Page

Tuesday, January 11, 2011

SMILES

What Made SMILES?

Daylight provides enterprise-level cheminformatics software technologies to life science companies. Our superior chemistry, high performance, and open architecture have earned Daylight a reputation for delivering the state-of-the-art in chemical information processing since 1987.

Daylight Chemical Information Systems, Inc. is a privately held company with corporate offices in Aliso Viejo, CA and research offices in Santa Fe, NM and Cambridge, England.

What is SMILES?
SMILES
Simplified Molecular Input Line Entry System

SMILES^TM as a simple yet comprehensive chemical language in which molecules and reactions can be specified using ASCII characters representing atom and bond symbols. SMILES^TM contains the same information as is found in an extended connection table but with several advantages. A SMILES^TM string is human understandable, very compact, and if canonicalized represents a unique string that can be used as a universal identifier for a specific chemical structure. In addition, a chemically correct and comprehensible depiction can be made from any SMILES^TM string symbolizing either a molecule or reaction.

SMILES^TM development was initiated by David Weininger in the late 1980s using the concept of a graph with nodes as atoms and edges as bonds to represent a molecule. Parentheses are used to indicate branching points and numeric labels designate ring connection points. The basic SMILES^TM grammar also includes as well as isotopic information, configuration about double bonds, and chirality leading to what is known as isomeric SMILES^TM.

Acknowledgments

Development of SMILES was initiated by the author, David Weininger, at the Environmental Research Laboratory, U.S.E.P.A., Duluth, MN; the design was completed at Pomona College in Claremont, CA. It was embodied in the Daylight Toolkit with the assistance of Cedar River Software.

Introduction

SMILES (Simplified Molecular Input Line Entry System) is a line notation (a typographical method using printable characters) for entering and representing molecules and reactions. Some examples are:

SMILES contains the same information as might be found in an extended connection table. The primary reason SMILES is more useful than a connection table is that it is a linguistic construct, rather than a computer data structure. SMILES is a true language, albeit with a simple vocabulary (atom and bond symbols) and only a few grammar rules. SMILES representations of structure can in turn be used as "words" in the vocabulary of other languages designed for storage of chemical information (information about chemicals) and chemical intelligence (information about chemistry).
Part of the power of SMILES is that unique SMILES exist. With standard SMILES, the name of a molecule is synonymous with its structure; with unique SMILES, the name is universal. Anyone in the world who uses unique SMILES to name a molecule will choose the exact same name.
One other important property of SMILES is that it is quite compact compared to most other methods of representing structure. A typical SMILES will take 50% to 70% less space than an equivalent connection table, even binary connection tables. For example, a database of 23,137 structures, with an average of 20 atoms per structure, uses only 1.6 bytes per atom when represented with SMILES. In addition, ordinary compression of SMILES is extremely effective. The same database cited above was reduced to 27% of its original size by Ziv-Lempel compression (i.e. 0.42 bytes per atom).
These properties open many doors to the chemical information programmer. Examples of uses for SMILES are:

Keys for database access
Mechanism for researchers to exchange chemical information
Entry system for chemical data
Part of languages for artificial intelligence or expert systems in chemistry

The rest of this chapter is a concise exposition of the SMILES encoding rules. For further information, the reader is referred to "SMILES 1. Introduction and Encoding Rules", Weininger, D., J.Chem. Inf. Comput. Sci. 1988, 28,31.

Branches

Branches are specified by enclosing them in parentheses, and can be nested or stacked. In all cases, the implicit connection to a parenthesized expression (a "branch") is to the left. Examples are:

Cyclic Structures

Cyclic structures are represented by breaking one bond in each ring. The bonds are numbered in any order, designating ring opening (or ring closure) bonds by a digit immediately following the atomic symbol at each ring closure. This leaves a connected non-cyclic graph which is written as a non-cyclic structure using the three rules described above. Cyclohexane is a typical example:

Isomeric SMILES

This section describes the SMILES rules used to specify isotopism, configuration about double bonds, and chirality. The term isomeric SMILES collectively refers to SMILES written using these rules. The SMILES isomer specification rules allow chirality to be completely specified for any structure, if it is known. Unlike most existing chemical nomenclatures such as CIP and IUPAC, these rules are also designed to allow rigorous partial specification of chirality. Aside from use in macros, substructure searching, and other pattern matching operations, this is important because much of the world's available chemical information is known for structures with incompletely resolved chiralities (not all possible chiral centers are separated, known, or reported).
All isomer specification rules in SMILES are therefore optional. The absence of a specification for any attribute implies that the value of that attribute is unspecified.

Aromaticity

Aromaticity must be deduced in a system such as SMILES which generates an unambiguous chemical nomenclature because of the fundamental requirement to characterize the symmetry of a molecule. Given effective aromaticity-detection algorithms, it is not necessary to enter any structure as aromatic if the user prefers to enter an aliphatic (Kekulé-like) structure. Entering structures as aromatic directly (i.e., by using lower case atomic symbols) provides a shortcut to accurate chemical specification and is closer to the mental molecular model used by most chemists. The SMILES algorithm uses an extended version of Hueckel's rule to identify aromatic molecules and ions. To qualify as aromatic, all atoms in the ring must be sp² hybridized and the number of available "excess" p-electrons must satisfy Hueckel's 4N+2 criterion. As an example, benzene is written c1ccccc1, but an entry of C1=CC=CC=C1 - cyclohexatriene, the Kekulé form - leads to detection of aromaticity and results in an internal structural conversion to aromatic representation. Conversely, entries of c1ccc1 and c1ccccccc1 will produce the correct anti-aromatic structures for cyclobutadiene and cyclooctatetraene, C1=CC=C1 and C1=CC=CC=CC=C1. In such cases the SMILES system looks for a structure that preserves the implied sp² hybridization, the implied hydrogen count, and the specified formal charge, if any. Some inputs, however, may not only be incorrect but also impossible, such as c1cccc1. Here c1cccc1 cannot be converted to C1=CCC=C1 since one of the carbon atoms would be sp³ with two attached hydrogens. In such a structure alternating single and double bond assignments cannot be made. The SMILES system will flag this as an "impossible" input. Please note that only atoms on the following list can be considered aromatic: C, N, O, P, S, As, Se, and * (wildcard). In addition, exocyclic double bonds do not break aromaticity.

Hydrogens

Hydrogens in reactions are handled as with molecules; they are suppressed unless "special". Recall that for molecules, hydrogens are special if they are: charged, isotopic, bonded to another hydrogen, or multiply bonded. With reactions, there is an additional case which will make a hydrogen special. It is often desirable (eg. 1,5-hydride shift) to store information about the location of hydrogens as part of the atom map of a reaction. Hydrogens with a supplied atom map are considered "special" and these hydrogens are not suppressed. These mapped hydrogens appear explicitly in Absolute SMILES for reactions. Otherwise, atom-mapped hydrogens do not appear in Unique SMILES.

For More Information on SMILES, visit
http://www.daylight.com/

Tuesday, January 4, 2011

Protein Data Bank

What is Protein Data Bank?

A repository for 3-D biological macromolecular structure
All data are available to the public
It includes proteins, nucleic acids and viruses
Obtained by X-Ray crystallography (80%) or NMR spectroscopy (16%)
Submitted by biologists and biochemists from around the world

History of Protein Data Bank

Founded in 1971 by Brookhaven National Laboratory, New York
First set of data were entered on punched cards. Then with magnetic tapes
Transferred to the Research Collaborators for Structural Bioinformatics (RCSB) in 1998
Currently it holds 29,000 released structures

FtsH peptidase

The signal recognition particle (SRP) is a multimeric protein, which along with its conjugate receptor (SR), is involved in targeting secretory proteins to the rough endoplasmic reticulum (RER) membrane in eukaryotes, or to the plasma membrane in prokaryotes PUBMED:17622352, PUBMED:16469117. SRP recognises the signal sequence of the nascent polypeptide on the ribosome, retards its elongation, and docks the SRP-ribosome-polypeptide complex to the RER membrane via the SR receptor. SRP consists of six polypeptides (SRP9, SRP14, SRP19, SRP54, SRP68 and SRP72) and a single 300 nucleotide 7S RNA molecule. The RNA component catalyses the interaction of SRP with its SR receptor PUBMED:17507650. In higher eukaryotes, the SRP complex consists of the Alu domain and the S domain linked by the SRP RNA. The Alu domain consists of a heterodimer of SRP9 and SRP14 bound to the 5' and 3' terminal sequences of SRP RNA. This domain is necessary for retarding the elongation of the nascent polypeptide chain, which gives SRP time to dock the ribosome-polypeptide complex to the RER membrane.
This entry represents the N-terminal helical bundle domain of the 54 kDa SRP54 component, a GTP-binding protein that interacts with the signal sequence when it emerges from the ribosome. SRP54 of the signal recognition particle has a three-domain structure: an N-terminal helical bundle domain, a GTPase domain, and the M-domain that binds the 7s RNA and also binds the signal sequence. The extreme C-terminal region is glycine-rich and lower in complexity and poorly conserved between species.
These proteins include Escherichia coli and Bacillus subtilis ffh protein (P48), which seems to be the prokaryotic counterpart of SRP54; signal recognition particle receptor alpha subunit (docking protein), an integral membrane GTP-binding protein which ensures, in conjunction with SRP, the correct targeting of nascent secretory proteins to the endoplasmic reticulum membrane; bacterial FtsY protein, which is believed to play a similar role to that of the docking protein in eukaryotes; the pilA protein from Neisseria gonorrhoeae, the homologue of ftsY; and bacterial flagellar biosynthesis protein flhF.

Primary Citation

Cryo-Em Structure of the E. Coli Translating Ribosome in Complex with Srp and its Receptor.
Author:Estrozi, L.F., Boehringer, D., Shan, S.-O., Ban, N., Schaffitzel, C.
Journal: (2010) Nat.Struct.Mol.Biol.
Not in PubMed

Molecular Description

Classification:	Protein Transport
Structure Weight:	110291.10

Molecule:

SIGNAL RECOGNITION PARTICLE PROTEIN

Polymer:

Type:

polypeptide(L)

Length:

294

Chains:

EC#:

3.6.5.4

Fragment:

NG DOMAIN, RESIDUES 1-294

Molecule:

4.5S RNA

Polymer:

Type:

polyribonucleotide

Length:

114

Chains:

Other Details:

ONLY THE PART OF THE 4.5S RNA THAT IS VISIBLE IN THE EM RECONSTRUCTION IS INCLUDED

Molecule:

SIGNAL RECOGNITION PARTICLE PROTEIN

Polymer:

Type:

polypeptide(L)

Length:

Chains:

Fragment:

M DOMAIN, RESIDUES 329-430

Other Details:

ONLY THE PART OF THE M DOMAIN THAT IS VISIBLE IN THE EM RECONSTRUCTION IS INCLUDED

Molecule:

CELL DIVISION PROTEIN FTSY

Polymer:

Type:

polypeptide(L)

Length:

303

Chains:

Source

Polymer: 1

Scientific Name:

Escherichia coli

Expression System:

Escherichia coli

Polymer: 2

Scientific Name:

Escherichia coli

Expression System:

Escherichia coli

Polymer: 3

Scientific Name:

Escherichia coli

Polymer: 4

Scientific Name:

Escherichia coli

Expression System:

Escherichia coli

Experiment Details

Method: ELECTRON MICROSCOPY

Resolution [Å]: 13.5
Aggregation State: PARTICLE
Reconstruction Method: SINGLE PARTICLE
Specimen Type: VITREOUS ICE

Gene Ontology

Type	Synonym
narrow:	protein folding chaperone
related:	protein tagging activity
related:	protein degradation tagging activity
exact:	protein amino acid binding
related:	alpha-2 macroglobulin receptor-associated protein activity

Thermolysin

Metalloproteases are the most diverse of the four main types of protease, with more than 50 families identified to date. In these enzymes, a divalent cation, usually zinc, activates the water molecule. The metal ion is held in place by amino acid ligands, usually three in number. The known metal ligands are His, Glu, Asp or Lys and at least one other residue is required for catalysis, which may play an electrophillic role. Of the known metalloproteases, around half contain an HEXXH motif, which has been shown in crystallographic studies to form part of the metal-binding site PUBMED:7674922. The HEXXH motif is relatively common, but can be more stringently defined for metalloproteases as 'abXHEbbHbc', where 'a' is most often valine or threonine and forms part of the S1' subsite in thermolysin and neprilysin, 'b' is an uncharged residue, and 'c' a hydrophobic residue. Proline is never found in this site, possibly because it would break the helical structure adopted by this motif in metalloproteases PUBMED:7674922.
In the MEROPS database peptidases and peptidase homologues are grouped into clans and families. Clans are groups of families for which there is evidence of common ancestry based on a common structural fold:

Each clan is identified with two letters, the first representing the catalytic type of the families included in the clan (with the letter 'P' being used for a clan containing families of more than one of the catalytic types serine, threonine and cysteine). Some families cannot yet be assigned to clans, and when a formal assignment is required, such a family is described as belonging to clan A-, C-, M-, S-, T- or U-, according to the catalytic type. Some clans are divided into subclans because there is evidence of a very ancient divergence within the clan, for example MA(E), the gluzincins, and MA(M), the metzincins.
Peptidase families are grouped by their catalytic type, the first character representing the catalytic type: A, aspartic; C, cysteine; G, glutamic acid; M, metallo; S, serine; T, threonine; and U, unknown. The serine, threonine and cysteine peptidases utilise the amino acid as a nucleophile and form an acyl intermediate - these peptidases can also readily act as transferases. In the case of aspartic, glutamic and metallopeptidases, the nucleophile is an activated water molecule.

In many instances the structural protein fold that characterises the clan or family may have lost its catalytic activity, yet retain its function in protein recognition and binding.
This group of metallopeptidases constitutes the MEROPS peptidase family M4 (thermolysin family, clan MA(E)). The protein fold of the peptidase domain of thermolysin, is the type example for members of the clan MA. The thermolysin family is composed only of secreted eubacterial endopeptidases. The zinc-binding residues are H-142, H-146 and E-166, with E-143 acting as the catalytic residue. Thermolysin also contains 4 calcium-binding sites, which contribute to its unusual thermostability. The family also includes enzymes from a number of pathogens, including Legionella and Listeria, and the protein pseudolysin, all with a substrate specificity for an aromatic residue in the P1' position. Three-dimensional structure analysis has shown that the enzymes undergo a hinge-bend motion during catalysis. Pseudolysin has a broader specificity, acting on large molecules such as elastin and collagen, possibly due to its wider active site cleft PUBMED:7674922.

Authors: Juers, D.H.,   Weik, M.

Experiment Details
Method:   X-RAY DIFFRACTION
Exp. Data:
Structure Factors
EDS

Unit Cell:
	Length [Å]	Angles [°]
	a = 93.26	α = 90.00
	b = 93.26	β = 90.00
	c = 128.69	γ = 120.00

Gene Ontology

Type	Synonym
exact :	metalloendoproteinase activity
exact :	metalloendoprotease activity

Primary Citation
Radiation damage study of thermolysin - 100K structure B (2.5 MGy)
Juers, D.H., Weik, M.
Journal: To be Published
Not in PubMed
Molecular Description

Classification:	Hydrolase
Structure Weight:	34588.30

Molecule:

Thermolysin

Polymer:

Type:

polypeptide(L)

Length:

316

Chains:

EC#:

3.4.24.27

Fragment:

UNP residues 233-548

Source
Polymer: 1
Scientific Name Bacillus thermoproteolyticus

Leucyl Aminopeptidase

Aminopeptidases are exopeptidases involved in the processing and regular turnover of intracellular proteins, although their precise role in cellular metabolism is unclear PUBMED:1555602, PUBMED:2395881. Leucine aminopeptidases cleave leucine residues from the N-terminal of polypeptide chains, but substantial rates are evident for all amino acids PUBMED:2395881.
The enzymes exist as homo-hexamers, comprising 2 trimers stacked on top of one another PUBMED:2395881. Each monomer binds 2 zinc ions and folds into 2 alpha/beta-type quasi-spherical globular domains, producing a comma-like shape PUBMED:2395881. The N-terminal 150 residues form a 5-stranded beta-sheet with 4 parallel and 1 anti-parallel strand sandwiched between 4 alpha-helices PUBMED:2395881. An alpha-helix extends into the C-terminal domain, which comprises a central 8-stranded saddle-shaped beta-sheet sandwiched between groups of helices, forming the monomer hydrophobic core PUBMED:2395881. A 3-stranded beta-sheet resides on the surface of the monomer, where it interacts with other members of the hexamer PUBMED:2395881. The two zinc ions and the active site are entirely located in the C-terminal catalytic domain PUBMED:2395881.

Authors : Natarajan, S.,   Huynh, K.-H.,   Kang, L.W.

Experimental Details
Method:   X-RAY DIFFRACTION
Exp. Data:
Structure Factors
EDS

Unit Cell:
	Length [Å]	Angles [°]
	a = 152.13	α = 90.00
	b = 152.13	β = 90.00
	c = 152.13	γ = 90.00

Gene Ontology
Type      Synonym
related : nucleocytoplasm
exact   : internal to cell
related : protoplast
exact : protoplasm

Primary Citation
Crystal structure of Leucyl Aminopeptidase (pepA) from Xoo0834,Xanthomonas oryzae pv. oryzae KACC10331
Natarajan, S.,   Huynh, K.-H.,   Kang, L.W.
Journal: to be published
Not in PubMed
Molecular Description

Classification:	Hydrolase
Structure Weight:	102628.49

Molecule:Probable cytosol aminopeptidase
Polymer:1 Type:polypeptide(L)
Length:490

Chains:

A, B

EC#:

3.4.11.1

Source
Polymer: 1
Scientific Name: Xanthomonas oryzae pv. oryzae
Expression System: Escherichia coli

For More Further Information, visit:
http://www.rcsb.org/pdb/home/home.do