Friday, December 20, 2013

Data loss?

A recent report in Current Biology Indicates that a large percentage of the data, supporting scientific publications, are lost due to lack of access to the original authors and obsolete data storage techniques. Underlying data to scientific publications are apparently lost at an astonishing rate of 17% per year. Current expectations of the life of data are significantly different from what was considered to be acceptable 20 years ago.

What is an acceptable span of life for data today? Do data remain relevant for ever? Should data have an expiry date? Data, as we all know, are the raw materials to insight generation but that does not necessarily mean that they should exist for ever. In a world of exponentially increasing data, the challenge is to extract any information content in them quickly and discard the rest. Storing data for ever is likely going to create problems in many different dimensions.

The basic notion that more is better is not at all true for data. Scientific experiments such as the LHC create data at such a rate, it is virtually indistinguishable from random noise, unless one is looking for something specific. Large companies create so much data that many are coming to a grinding halt. The star of the data revolution has been googling its way into such endeavors as creating a human brain through artificial neural nets and curing death on the premise that there is nothing one cannot accomplish if data were available. Based on the artificial brain’s proclivity to seek cat videos on the internet they may be right on one account but not on others.

Data are very close to random noise. More of it is unlikely to solve problems.