Gladwell states as guidlines for a better omics data management

Universal application of high throughput omics technologies have enabled scientists to measure tens of thousands of data points in a single experiment. As a result of this scientific world has become deluged with data. This has greater implications the way science will be done in coming years. There is a general accord that science has turned more into a data management problem. Put the technical aspect of scientific datamanagement aside, and ask can we depict useful and practically relevant conclusions from our past experiences in scientific data management, and what makes few data management strategies/efforts more popular than others? I am talking about appropriate philosophical foundations for scientific data and knowledge management. Although there is no universal agreement what is best way to manage heterogeneous scientific data which keeps evolving over the time, simplicity and abstraction have always appealed data pundits. Generally most of the scientific data management strategies have not conceived from the social perspective, they are always technology driven. You may call it by any name Big Data, Open Data, Linked Data or you can be in love with XML or RDF, but I am not impressed there will be any ultimate solution to this problem. So is there any social perspective for scientific data management at all? Well I guess so. Masanori Arita has recently published an interesting review article about what can metabolomics learn from genomics and proteomics? What I really liked about this article is the Masanori's analogy of omics data management with Gladwell states. Many of us will be aware about the Malcolm Gladwellhighly praised book The Tipping Point: How Little Things Can Make a Big Difference, where based on sociological observation Gladwell describes the three rules of social epidemics: the law of the few, the stickiness factor, the power of context. Tipping points are nothing but the critical point at which the momentum for change becomes unstoppable or viral. From omics data management perspective lets just consider the first and foremost important, the law of the few which states that a small number of influential people called mavens (information specialists), salesmen (persuaders, charismatic), and connectors (truly extraordinary knack for connecting with people) play a crucial role in staging the tipping point. What Masanori really wanted to stage was the viral success for the metabolomics as a fundamental data-driven science like its counterparts, genomics and proteomics. But there is big massage for every other omics, Masanori draw three simple conclusions,
  1. Mavens: large public databases with focus on information quantity
  2. Salesmen: data appeal by simple formats/standards
  3. Connectors: wiki-based community/knowledge portals
First point is large public databases with early focus on information quantity. Data must be readily and freely accessible through large, stable public repositories such as Genebank, SwissProts. Information quantity is no longer an issue in genomics and proteomics community, although they are suffering from several other issues such as high proportion of unannotated data (Out of 4000 bacterial genome projects only a handful are publicly well annotated) , high error rate (20 to 45% for high-throughput protein–protein interaction data) and low information content. On a rather small note stability need long term funding and I guess most of large public repositories are quite comfortable with that.
Second point is data appeal by simple formats/standards. Data format/standards played a important role in success and popularity of certain research areas. Masanori notes that
In biology, the readability of raw data affects popularity. In fact, metabolism, the primary research topic in metabolomics, is notorious for its incomprehensibility and many researchers stayed away from metabolic networks containing lengthy structural and stoichiometric information. The KEGG database gained popularity for its oversimplified representation of metabolic networks: each metabolite is represented as a node without structure, and each reaction as a binary relationship without stoichiometry. Although its oversimplification resulted in considerable misunderstandings , the KEGG database boosted the graph-oriented analysis of metabolic pathways, and consequently, it awakened the interest of the research community in metabolism. Many popular databases containing gene expression or protein–protein interaction data also use simple notations.
Third point is use of wikis as major platform for hosting the biological information. As matter of fact major biology databases are in the process of transferring to wiki-based sites and use of wiki as sites is getting momenta. Further
We, as scientists, should pay more attention to the evolution of web information because wiki embodies the quintessence of all sciences: the acquisition of knowledge through open discussion.
Openness is not only reason in favor of wiki over traditional databases. Situation is quite complicated for curated and annotated non-wiki databases where evolution of data remains intractable. Whether there was any updated in data base, and if yes why it was updated and was it discussed in appropriated forum before update, these issues remains gray area for non-wiki type sites. For instance, take the example of BioModels database which expose the systems biology models as releases, in each release there are few new models but it also includes old models which may or may not be modified after the previous release. In current BioModels implementation there is no clear mechanism to track the changes related to evolution of a given model. From a user perspective tracking of revisions and edits is really important. The other issue is whether or not curation projects have a backup mechanism in place. I am raising this issues because funding sources for biological databases are quite limited, which means sooner or later few or many projects will be out of fund (A recent example of this is the arabidopsis resource, TAIR). Rather than asking to funding agencies for sustainable model for biological databases funding, I would suggest that project manager should be asked to include the additional details in their project proposal such as how they will keep the data stream alive if funding run out at first place. I think there are several options to keep data stream alive for short lived projects, just dump the whole database in sourceforge or any other repository. Best option is make Wikipedia your new home. In all fairness I am not against long term funding of the database projects, but this should be a goal oriented merit based decision and even then very few will be succeed.
Not everything is well with wiki option also, like absence of incentives for participating in crowd sourcing efforts. But there is better chance and hope.

Share/Bookmark

No comments:

Post a Comment


Powered by  MyPagerank.Net

LinkWithin