R (programming language)

The R Project for Statistical Computing

R is a language and environment for statistical computing and graphics. It is a GNU project which is similar to the S language and environment which was developed at Bell Laboratories (formerly AT&T, now Lucent Technologies) by John Chambers and colleagues. R can be considered as a different implementation of S. There are some important differences, but much code written for S runs unaltered under R.

R provides a wide variety of statistical (linear and nonlinear modelling, classical statistical tests, time-series analysis, classification, clustering, ...) and graphical techniques, and is highly extensible. The S language is often the vehicle of choice for research in statistical methodology, and R provides an Open Source route to participation in that activity.

R Screenshots

MacOS X RAqua desktop

Unix desktop

Graphics Examples

box and whisker plots

piechart

pairs plot

coplot

another coplot that shows nice interactions

3d plot of a surface

image and 3d plot of a volcano

mathematical annotation in plots

forest plot (plot of confidence intervals in a meta-analysis)

All images on this site are Copyright (C) the R Foundation and may be reproduced for any purpose provided they are credited to the R statistical software using an attribution like "(C) R Foundation, from http://www.r-project.org".

One of R's strengths is the ease with which well-designed publication-quality plots can be produced, including mathematical symbols and formulae where needed. Great care has been taken over the defaults for the minor design choices in graphics, but the user retains full control.

The R environment

R is an integrated suite of software facilities for data manipulation, calculation and graphical display. It includes

an effective data handling and storage facility,
a suite of operators for calculations on arrays, in particular matrices,
a large, coherent, integrated collection of intermediate tools for data analysis,
graphical facilities for data analysis and display either on-screen or on hardcopy, and
a well-developed, simple and effective programming language which includes conditionals, loops, user-defined recursive functions and input and output facilities.

The term "environment" is intended to characterize it as a fully planned and coherent system, rather than an incremental accretion of very specific and inflexible tools, as is frequently the case with other data analysis software.

R, like S, is designed around a true computer language, and it allows users to add additional functionality by defining new functions. Much of the system is itself written in the R dialect of S, which makes it easy for users to follow the algorithmic choices made. For computationally-intensive tasks, C, C++ and Fortran code can be linked and called at run time. Advanced users can write C code to manipulate R objects directly.

Many users think of R as a statistics system. We prefer to think of it of an environment within which statistical techniques are implemented. R can be extended (easily) via packages. There are about eight packages supplied with the R distribution and many more are available through the CRAN family of Internet sites covering a very wide range of modern statistics.

R has its own LaTeX-like documentation format, which is used to supply comprehensive documentation, both on-line in a number of formats and in hardcopy.

Moving Towards Open Cloud?

Cloud Computing Interoperability Forum (CCIF) is drafting an Open Cloud Manifesto. Idea is to embrace the fundamental principles of an open cloud with the help of worldwide cloud community. Open Cloud Manifesto will describe principles and guidelines for interoperability in cloud computing. Unfortunately two major cloud computing players, Amazon and Microsoft have not given any positive commitment about Open Cloud Manifesto. On his blog Steve Martin (Microsoft Azure product manager) writes,

We were admittedly disappointed by the lack of openness in the development of the Cloud Manifesto. What we heard was that there was no desire to discuss, much less implement, enhancements to the document despite the fact that we have learned through direct experience. Very recently we were privately shown a copy of the document, warned that it was a secret, and told that it must be signed "as is," without modifications or additional input. It appears to us that one company, or just a few companies, would prefer to control the evolution of cloud computing, as opposed to reaching a consensus across key stakeholders (including cloud users) through an “open” process. An open Manifesto emerging from a closed process is at least mildly ironic.

Well all I can say that the idea is worth to try but the intentions are more important. Now a days a lot of self promotional open movements are playing around, you can not stake on every and any effort. Suddenly there is flood of "Open" movements, people are really confused what's going on. To me it looks like a rat race to win Novel Prize for "Open" movement

Lowering Pharma firewalls: Just for Bioinformatics or Chemoinformatics also

Notion of pre-competitive collaboration has been in under experiment steadily for quite sometime now. Notable examples are the Airbus consortium of European aircraft manufacturers, the Sematech consortium of US semiconductor manufacturers, banks working together to launch Visa and Mastercard, our recent moon lust and many more. But this was never a case for pharmaceutical industry until now which is now lowering industry firewalls to shift funding and focus from early- to late-stage projects by developing cooperation in the areas with little potential for differentiation most notably a shared informatics infrastructure through public–private partnerships. Pre-competitive collaboration in this process means that everyone will have same common pool of data and resources. Competition will be still there but for better ideas, for better models and to discover first.

Pre-competitive informatics initiatives
A very interesting opinion piece appeared in September issue of Nature Reviews Drug Discovery discussing the importance of pre-competitive informatics initiatives in drug discovery. Article suggest that many companies are already beginning to embrace this idea, and that for some companies
This was a very timely review in the wake of several initiatives such a Innovative Medicines Initiative (IMI), EBI industry programme, Pistoia Alliance and many others. The idea of lowering industry firewalls caught more attention after the announcement of Sage Bionetworks, a non-profit medical research organization established this year with initiatives of Merck duo Eric Schadt and Stephen Friend. A similar kind of effortOpen Source Drug Discovery(OSDD) was launched by CSIR, India earlier this year with a initial investment of US $38 million. OSDD consortium is trying to implement open source model for Drug Discovery and public-private partnership is one of the major focus of this initiative. Exciting isn't it? But wait there is a twist in story, there are no definitive answers for what type of data is pre-competitive and what is not? The definition of of pre-competitive is fluidic and it depends on several factors, one of them is whether data belongs to biology or chemistry. Article suggests that any data and tools used by biologists should be under consideration for pre-competitive sharing but those used by chemist should remain the competitive or proprietary (which is very much according to current trends). I could not find any rational reason behind this argument except the fact that there is overwhelming amount of public data in biology domain and day by day companies and institution are finding it hard to manage, integrate and use them for drug discovery. I will go further and suggest that much of these initiatives serves no benefits unless otherwise the data and tools belonging to chemistry domain is also considered as pre-competitive. Ironically much of the data and tools released by pharmaceutical companies under these initiatives are yet to proof their importance. For instance much hyped Life Science Grid released by Eli Lilly (which went open source in year 2008) failed to attract even an average user base. Lilly released only the biology side of the grid which includes a selected group of non proprietary plug-ins, including those for Gene Browser, NCBI Entrez, and Gene Ontology. Forgive me but there are already better tools for the biology in public domain. In my opinion unhindered access to data and tool is prerequisite for the success of the pre-competitive landscape which require more active contributions from the industry participants. Currently the systems is evolving and for now something is better than nothing.
Apart from the issues related to definition of pre-competitive boundaries there are several other bottlenecks, for instance the who will fund the long-term maintainability of such an infrastructure, those remain unresolved .

CIRCOS - VISUALIZING THE GENOME, AMONG OTHER THINGS

Circos is designed for visualizing genomic data such as alignments, conservation, and generalized 2D data, such as line, scatter, heatmap and histogram plots. Circos is very flexible — you can use it to visualize any kind of data, not just genomics. Circos has been used to visualize customer flow in the auto industry, volume of courier shipments, database schemas, and presidential debates.

The creation of Circos was motivated by a need to visualize intra- and inter-chromosomal relationships within one or more genomes, or between any two or more sets of objects with a corresponding distance scale. Circos is similar to chromowheel and, to a lesser extent, genopix.

Circos uses a circular composition of ideograms to mitigate the fact that some data, like combinations of intra- and inter-chromosomal relationships (alignments, duplications, assembly paired-ends, etc) are very difficult to organize when the underlying ideograms (or contigs) are arranged as lines. In many cases, it is impossible to keep the relationship lines from crossing other structures and this deteriorates the effectiveness of the graphic.

Specific features are included to help viewing data on the genome. The genome is a large structure with localized regions of interest, frequently separated by large oceans of uninteresting sequence. To help visualize data in this context, Circos can create images with variable axis scaling, permitting local magnification of genomic regions to be controlled without cropping. Scale smoothing ensures that the magnification level changes smoothly. In combination with axis breaks and custom ideogram order, the final image can be easily tuned to offer the clearest illustration of your data.

All aspects of the output image are tunable, making Circos a flexible and extensible tool for the generation of publication-quality, circularly composited renditions of genomic data and related annotations.

Circos is written in Perl and produces bitmap (PNG) and vector (SVG) images using plain text configuration and input files.

Evloving computing definitions and realms

Melanie Swan has written an interesting piece about Expanding notion of Computing on her blog Broader Perspective. In response to the question how computing definitions and realms are evolving, she writes

the traditional linear Von Neumann model is being extended with new materials, 3D architectures, molecular electronics and solar transistors. Novel computing models are being investigated such as quantum computing, parallel architectures, cloud computing, liquid computing and the cell broadband architecture like that used in the IBM Roadrunner supercomputer. Biological computing models and biology as a substrate are also under exploration with 3D DNA nanotechnology, DNA computing, biosensors, synthetic biology, cellular colonies and bacterial intelligence, and the discovery of novel computing paradigms existing in biology such as the topological equations by which ciliate DNA is encrypted.

A related but more broader presentation about future of life sciences

Future of Life Sciences

View more presentations from lablogga.

Backup and fault tolerance in systems biology: Striking similarity with Cloud computing

Striking similarity between biological systems and computing paradigms is not new, and in past there have been several attempts to draw an analogy between systems biology and computing systems. For interested readers I will recommend my last post which examine how systems biology of human can be describes asa grid of super-computers. Over the time researchers have developed several bio-inspired fault-tolerance methods to support fault detection and removal in both hardware and softwares systems, such as fault-tolerant hardware inspired by ideas of embryology and immune systems. Fault tolerance is the ability of a system to retain intended functionality even in the presence of faults, and in case of living cells fault-tolerance is due to the intrinsic robustness of their gene regulatory networks which can be easily observed in case of mutation-insensitivity expression of genes with phenotypic feature. In recent issue of journal Molecular Systems Biology, Anthony Gitter and other co-authors suggest that gene regulatory networks also have backup plans very much like cloud computing networks or MapReduce framework where failure of a computing node is managed by by re-executing its task on another node. Fault-tolerant is seen as mechanism to retain the functionality of master gene in very extreme circumstances through a controller mechanism, while backup plan employs another gene with reasonable sequence similarity to master gene in order to perform the tasks which are key for the survival of cell itself. Their finding suggest that

[T]he overwhelming majority of genes bound by a particular transcription factor (TF) are not affected when that factor is knocked out. Here, we show that this surprising result can be partially explained by considering the broader cellular context in which TFs operate. Factors whose functions are not backed up by redundant paralogs show a fourfold increase in the agreement between their bound targets and the expression levels of those targets.

The yellow TF which has sequence similarity as well as shared interactions with green TF can replace the green TF when it is knocked out and is able to recruit the transcription machinery leading to only small overlap between binding and knockout results

In order to understand the systems biology of robustness provided by redundant TFs and their role in broader cellular context authors explored dependence of findings on the TFs' homology relationships and shared protein interaction network. They observed that TFs with the most similar paralogs had no overlap between their binding and knockout data, while protein interaction networks provide physical support for knockout effects.
Further Gitter describes importance of his research as,

It's extremely rare in nature that a cell would lose both a master gene and its backup, so for the most part cells are very robust machines. We now have reason to think of cells as robust computational devices, employing redundancy in the same way that enables large computing systems, such as Amazon, to keep operating despite the fact that servers routinely fail

A simple backup mechanism in MapReduce framework

BioPAX or SBML?

Katherine Mejia asked this question on twitter, well just now I want to keep it short,

Both BioPAX (Level 3) and SBML (Level 2) can encode signaling pathways, metabolic pathways and regulatory pathways, although SBML can represent finer details.
If your aim is simulation and prediction then SBML seems to be reasonable choice.
SBML has better software and API support than BioPAX, no doubt this one is major advantage with SBML.
Similarly BioPAX is rich standard when it comes to encode the interactions such as genetics, molecular or non-molecular.

Visual comparison of SBML, BioPAX, CellML and PSI

Bioinformatics and Systems Biology: Multidisciplinary scientists versus Interdisciplinary scientist

Same old wine now in new bottle, a debate which bioinformatics community has followed for the years now became a major agenda for systems biology community as well. During recent Systems Biology Inter-DTC Conference at University of Manchester Systems Biology Centre, there was a special session on the subject “can scientists be multidisciplinary?”. Steve Checkley has written andetailed opinion about this secession and following debate. Current status of systems biology is quite similar what we have seen 5 year back in bioinformatics community, both have very similar fundamental issues. Part of the problem is interchangeable definitions of interdisciplinarity and multidisciplinarity. According to Wikipedia, multidisciplinarity is defined as

Multidisciplinarity is the act of joining together two or more disciplines without integration. Each discipline yields discipline specific results while any integration would be left to a third party observer. An example of multidisciplinarity would be a panel presentation on the many facets of the AIDS pandemic (medicine, politics, epidemiology) in which each section is given as a stand-alone presentation.
A multidisciplinary community or project is made up of people from different disciplines and professions who are engaged in working together as equal stakeholders in addressing a common challenge. The key question is how well can the challenge be decomposed into nearly separable subparts, and then addressed via the distributed knowledge in the community or project team. The lack of shared vocabulary between people and communication overhead is an additional challenge in these communities and projects. However, if similar challenges of a particular type need to be repeatedly addressed, and each challenge can be properly decomposed, a multidisciplinary community can be exceptionally efficient and effective. A multidisciplinary person is a person with degrees from two or more academic disciplines, so one person can take the place of two or more people in a multidisciplinary community or project team. Over time, multidisciplinary work does not typically lead to an increase nor a decrease in the number of academic disciplines.

while interdisciplinarity can be described as

"Interdisciplinarity" in referring to an approach to organizing intellectual inquiry is an evolving field, and stable, consensus definitions are not yet established for some subordinate or closely related fields.
An interdisciplinary community or project is made up of people from multiple disciplines and professions who are engaged in creating and applying new knowledge as they work together as equal stakeholders in addressing a common challenge. The key question is what new knowledge (of an academic discipline nature), which is outside the existing disciplines, is required to address the challenge. Aspects of the challenge cannot be addressed easily with existing distributed knowledge, and new knowledge becomes a primary subgoal of addressing the common challenge. The nature of the challenge, either its scale or complexity, requires that many people have interactional expertise to improve their efficiency working across multiple disciplines as well as within the new interdisciplinary area. An interdisciplinarary person is a person with degrees from one or more academic disciplines with additional interactional expertise in one or more additional academic disciplines, and new knowledge that is claimed by more than one discipline. Over time, interdisciplinary work can lead to an increase or a decrease in the number of academic disciplines.

Unlike multidisciplinarity which brings two or more disciplines together without any integration with each discipline approaching the problem from their own perspective, interdisciplinarity utilizes an integrated approach to solve those problems. Both multidisciplinary co-operation and interdisciplinary integration have same intent- providing practical solutions to practical problems either confined in the separateness of unidisciplinarity or inherent in present conditions of specialization. Indeed multidisciplinary co-operation has been always favored over interdisciplinary integration without very strong reasons. For example pharmaceutical companies have always preferred multidisciplinary team of computer scientist and biologist rather than recruiting interdisciplinary bioinformaticians. Although interdisciplinarity scientists are highly trained in several disciplines, but in general they are not considered as specialist for any of them. In case of systems biology, as Steve report, pharmaceutical companies want to recruit hard core mathematicians rather than trained systems biologist for very same argument that multidisciplinary team of mathematicians and biologist can have a better impact on the research. No doubt interdisciplinary trained scientist are good for solving problems, but they can not be hired for implementing solutions which require more than interdisciplinary approach or let say specialist. Contrary to that there is nothing which stops a trained specialist to be a generalist or interdisciplinary. Existing R&D environment in both academic and industry is preoccupied and optimized for the specialist as leading role while interdisciplinary trained scientist have their secondary roles. Ironically road to interdisciplinarity is one way or it is too early to predict that. At the end it does not matter if you are specialist or interdisciplinary, success depends on the passion to do the right things on right time, whatever you do be at best you will be always successful. It will be premature to say that interdisciplinary scientist will not get their place in a specialist world, even interdisciplinary scientist will turn into specialist some day. Just now pharmaceutical companies are playing safe as systems biology is an emerging discipline but is yet to prove itself.

SBML Level 3 is arriving

Last Sunday Systems Biology Markup Language (SBML) community has released a draft specification ofSBML Level 3 Version 1 Core. Community has been working steadily for quite sometime now to release a modular version of this widely accepted XML format. In summary

The next Level of SBML will be modular, in the sense of having a defined core set of features and optional packages adding features on top of the core. This modular approach means that models can declare which feature-sets they use, and likewise, software tools can declare which packages they support.

Apart from the SBML Level 3 Core community is also reviewing several other proposals for optional extension packages such as Layout, Rendering, Multistate multicomponent species, Hierarchical Model Composition, Geometry, Qualitative Models and many more. Nearly all Level 3 activities are currently in the proposal stage.
SBML project started in year 2000 with a small team of researchers dedicated to develop better software infrastructure and standards for computational modeling in systems biology, and after 9 years SBML is supported by more than 170 software making SBML one of the most successful open community based standardization projects in life science domain. SBML project chair Michael Hucka attributes the success of project especially 170 supporting tools to the excellent libSBML library. Actually SBML is a very interesting case study about how open community projects should be organized and executed. SBML Community has not restricted themselves to XML standard only, in fact close collaboration between SBML community members has created the foundation for several sister projects such as Systems Biology Ontology (SBO),BioModels and most notably recently released Systems Biology Graphical Notation (SBGN). The success of SBML also reminds me that whether the community based project have their geographical and cultural advantages or disadvantages.

The Systems Architecture of a Bacterial Cell Cycle

A Meditation on Biological Modeling

No doubt biological modeling is meant for mainstream biologists also, not just for hard core computer scientists or mathematicians. Not every one using the computer knows how transistors work or let say to drive a car you don't need to be a mechanic. Presented at SciFoo 2009, this short video takes a lighthearted look at the future of biological modeling

Why have modeling approaches yet to be embraced in the mainstream of biology, in the way that they have been in other fields such as physics, chemistry and engineering? What would the ideal biological modeling platform look like? How could the connectivity of the internet be leveraged to play a central role in tackling the enormous challenge of biological complexity?

What synthetic biology can learn from programming languages

What is synthetic biology? In simple words Synthetic biology is nothing but putting engineering into biology. An engineered genetic toggle switch developed by Tim Gardner and Jim Collins is a good example of how engineering principles are driving the boat of synthetic biology. Researchers are now trying to adapt concepts developed in area of programming language development & software engineering for synthetic biology applications. A latest paper in PLoS Computational Biology shows how methods used by computer scientists to develop programming languages can be applied to DNA sequences. They report an attribute grammar based formalism to model the structure-function relationships in synthetic DNA sequences. An attribute grammar is constructed as an extension of a context-free grammar and in computer science it is commonly used to translate the text of a program source code or the syntax tree directly into the computational operations or machine level instructions. Further

The translation of a gene network model from a genetic sequence is very similar to the compilation of the source code of a computer program into an object code that can be executed by a microprocessor (Figure 1). The first step consists in breaking down the DNA sequence into a series of genetic parts by a program called the lexer or scanner. Since the sequence of a part may be contained in the sequence of another part, the lexer is capable of backtracking to generate all the possible interpretations of the input DNA sequences as a series of parts. All possible combinations of parts generated by the lexer are sent to a second program called the parser to analyze if they are structurally consistent with the language syntax. The structure of a valid series of parts is represented by a parse tree (Figure 2). The semantic evaluation takes advantage of the parse tree to translate the DNA sequence into a different representation such as a chemical reaction network. The translation process requires attributes and semantic actions. Attributes are properties of individual genetic parts or combinations of parts. Semantic actions are associated with the grammar production rules. They specify how attributes are computed. Specifically, the translation process relies on the semantic actions associated with parse tree nodes to synthesize the attributes of the construct from the attributes of its child nodes, or to inherit the attributes from its parental node.

Figure. 1

Figure 2.

Proposed formalism can be quite useful to understand how a set of genetic components relates to a function with potential to assemble a new biological systems of desired functionality or phenotype using BioBricks standard biological parts. It will be implemented into GenoCAD, a web-based tool used for genetic engineering of cells.