Having from the beginning been overblown, Big Data has now become oversold. Viktor Mayer-Schönberger and Kenneth Cukier entitled their recent book Big Data: A Revolution That Will Transform How We Live, Work, and Think. No very modest prospect is in view. It would help, of course, were there a precise definition of Big Data. An article published by Forbes in 2012 is illustrative of the conceptual confusion: “12 Big Data Definitions: What’s Yours?”1 In 2001, long before the term itself had entered the lexicon, Doug Laney described the challenges presented by large datasets as arising from the simultaneous expansion of three properties: volume, velocity, and variety.2 To this 3V’s model, IBM added veracity.3 Not content with 4V’s, others have expanded the framework still further, showing little regard for Laney’s original conception.4 Oracle defined Big Data as “the derivation of value from traditional relational database driven business decision making, augmented with new sources of unstructured data.”5 In a 2012 study, Intel marked the boundary of Big Data with reference to “organizations generating a median of 300 terabytes (TB) of data weekly.”6 Microsoft, on the other hand, defined Big Data as a group of techniques for analyzing large datasets.7 The data may be massive; it is the analysis that counts.
These are not so much definitions as a series of remarks. Mix and match: Big Data designates the storage and analysis of data sets at a scale where data analysis makes possible new insights.
There is, no doubt, such a scale; and data analysis does make for new insights; but whether the insights are of any great value is another matter.
The bigness of Big Data is new in the history of science, but not the underlying idea.
What is empiricism if not the notion that science is in essence the collection and analysis of data? The business of science is observation. This is a thesis that may be traced back to the British empiricists: John Locke, Bishop Berkeley, David Hume.8 Empiricism is, as Hume understood, a surprisingly stern master. There is no necessary connection between causes and their effects, Hume argued. For all we know, the next man to swallow a mouthful of strychnine may find it nourishing. If causation is not necessary, neither is it discernible in experience. There is the swallowing of poison, and there is the dying that follows. The cause between them remains hidden.
It is the constant conjunction between events, Hume concluded, that provides the empiricist with all that he can know of their causal structure.
This is not a conclusion that invests the scientific agenda with optimism. Hume observed:
It must certainly be allowed, that nature has kept us at a great distance from all her secrets, and has afforded us only the knowledge of a few superficial qualities of objects; while she conceals from us those powers and principles on which the influence of those objects entirely depends.9
Although long dead, the British empiricists returned to life in the first half of the twentieth century. The logical positivists argued that what we have in the sciences is what Hume said we had: the manifold of experience, and, in retrospect, its record in data. A number of philosophers of science recognized the sterility of the ensuing discussion. Even simple phenomena, Carl Hempel argued, could not be explained in observational terms.10 Wood floats on water, but not iron. What could be simpler? But wood sometimes sinks and iron sometimes floats. It is only the concept of specific gravity that allows physicists to explain why a water-logged coffin will sink and a hollow metal sphere will float. The specific gravity of an object is like the cause between events: Neither can be grasped directly; both are behind the veil; and each is important in explanation.
Modern neuroscience depends on computational models and simulations. Impressive computational resources are widely available. Such research environments are frequently referred to as brain observatories. The Allen Institute for Brain Science (AIBS) in Seattle is an example.11 Mindscope, a current AIBS project, is concerned to build realistic neuron-based models of animal and human brains.12 Brain observatories have made possible brain maps, or atlases. These are large, computationally-generated models. No brain atlas would be complete without a connectome, a network diagram representing connections between neurons, neuron circuits, and brain regions.
The Brain Activity Map (BAM) sponsored by The Kavli Foundation in 2011, is an effort to develop dynamic maps of brain activity in animals such as mice, monkeys, and humans.13 This has been subsumed under the US Government’s Brain Research through Advancing Innovative Neurotechnologies Initiative (BRAIN Initiative), a multi-year effort that began in 2014 and is expected to cost three billion dollars over the next ten years.14 The AIBS is also completing work on the Allen Mouse Brain Connectivity Atlas.15 Harvard Medical School has a Whole Brain Atlas, and the University of Michigan, a Human Brain Atlas.16
Researchers often refer proudly to the size of their datasets. Big Data has now become very big. The Human Connectome Project is anticipated to yield more than a petabyte (PB) of relatively high-quality imaging data.17 The AIBS has amassed 1.8PB of data in researching and mapping the mouse brain’s 75 million neurons.18 Although impressive in scale, these datasets are dwarfed by the 30PB of data generated annually by the Large Hadron Collider.19
A number of simulations have received significant media attention. Perhaps the most widely known is Henry Markram’s Blue Brain Project, named after the IBM Blue Gene supercomputer used to simulate the cortical columns of the rat brain.20 This project, although widely regarded as telling us little to nothing about how the brain causes behavior, has inspired a follow-up project known as the Human Brain Project (HBP).
Other notable large-scale brain simulations include Dharmendra Modha’s work under the auspices of the Systems of Neuromorphic Adaptive Plastic Scalable Electronics Project (SyNAPSE).21 Modha, manager of IBM’s Cognitive Computing Group, describes his simulation work as “Cognitive Computing via Synaptronics and Supercomputing.”22 The goal is to build a brain as cheaply and quickly as possible.23 Modha claims to have reverse-engineered the entire cortex of a cat, using an IBM Blue Gene/P supercomputer equipped with 147,456 CPUs and 144TB of memory.24
As one might expect, Markram has been critical of these claims, arguing that Modha’s simulation is not biologically realistic.25 Fair is fair. While far more complex than Modha’s, even Markram’s work simplifies the behavior of actual neurons.
Launched in 2013 at the École polytechnique fédérale de Lausanne, the HBP has been approved for over a billion euros in funding over ten years by the European Union. The HBP is committed to a data-driven vision of neuroscience. Intelligence, the HBP asserts, emerges from the complex interaction of the brain’s neurons. No more is needed for the emergence of intelligence than a scheme in which neurons, as well as the synaptic connections between them, are simulated. Separate efforts aimed at discovering missing pieces of our knowledge are unnecessary, and in fact tend to fragment neuroscience and ultimately impede research efforts. In 2010, more than 60,000 papers with the word “brain” in the title were published.26
Neuroscience has become, in Markram’s view, hopelessly fragmented.27
The history of science is now moved to reclaim its own.
The statement that wood floats but that iron sinks says little about flotation, and the little that it says is wrong. Statements about neurons—are they any different? If so, how? If specific gravity is needed to explain flotation, what theoretical concepts might be needed to explain the brain? The correct answer is: Who knows? Nearly every neuroscientist, including Markram himself, admits that we do not yet have the theoretical framework required to make the volume of data meaningful.
What, in fact, is the complex, multivariate neural code that the brain uses to generate complex cognitive, perceptual, and motor behavior in the first place? This code is related, somehow, to the behavior of neuron spiking, but we do not yet know how.
Neuron spikes are explained, Markram suggests, primarily by membrane behavior from ion charge differences; temporal sequencing is thus of greater relevance than higher-order patterns like oscillations. Once data have been generated, they are used to scale up a multivariate simulation. To simulate biologically realistic neurons and their connections, a model of neuron behavior is needed. Markram’s theories are based on the Hodgkin and Huxley equations, a set of non-linear differential equations describing the initiation and propagation of action potentials.28 Alan Lloyd Hodgkin and Andrew Huxley presented their model in the 1950s, using the common electrical circuit as a framework. The NEURON software developed during the 1990s at Yale, and used today in a range of neuroscience projects (including the HBP), is based on the H–H model.
The H–H model is not a complete explanation of neuron spike behavior, or even of ion channel behavior. Still, some advances have been made. Hodgkin and Huxley could not determine the temporal activation sequence of ion conduction. They could only approximate it. Research in the 1960s and 1970s was able partially to solve the problem by using the pore theory of membrane-spanning proteins. Aside from such relatively minor tweaks, however, it is the basic H–H model that remains dominant in neuroscience today.
In their initial research, Hodgkin and Huxley introduced first-order rate equations describing the probability that an ion gate is in an open state. Their equations depended on a number of parametric fudges. How they came to their specific fudges remained unexplained, and Hodgkin and Huxley acknowledged as much.
A typical HBP protocol thus has the following form:
- A research paper is scanned for its numerical parameters: applied stimulus protocols, reverse potentials of ion channels, and inactivation kinetics.
- Since the equations required to model H–H from empirically discovered parameters are often missing from the literature, curve digitization is used to recreate them. This is a technology that converts graphical images into numerical formulas. Given a standard activation curve plotted in a Cartesian coordinate system, curve digitization extracts a function that re-creates the curve in the coordinate system. An open source package, Engauge Digitizer, is currently used for this purpose.
- After digitization and curve fitting, another software package, GenericFit, is used to simulate the H–H model.
- Since the initial simulation generated is usually wrong, parameter readjustments are required. The computer model is made to fit the experimental results by tweaking the numbers extracted from the parameter identification, a process known as double fudging.
This is not a scheme calculated to inspire confidence, if only because errors introduced at the level of individual neurons are propagated to groups and circuits of neurons in the downstream simulation.
Then there is Predictive Neuroscience (PN), an approach used by researchers working with the HBP to simulate connections between neurons.29 Unknown synaptic links are determined from known links using inductive machine-learning techniques. Both traditional neural networks, and the more powerful extended networks known as convolutional neural networks, are used.
Markram has shown that machine-learning functions can correctly predict previously unknown connections in the cortical columns of the rat brain. An analysis using a standard F-measure statistic yields an accuracy of nearly 80 percent for this approach.30 While representing an advance in applying machine learning to biological datasets, this approach has an average error rate of two in ten. This has ominous implications for any strategy for reverse-engineering the human brain. But this concern aside, it is the inductive assumptions inherent in such approaches that are of interest here.
Machine-learning approaches begin by providing the model with a number of known training instances. Each experimentally verified neuron might constitute an instance. Certain parameters are specified, and the model then learns the connection points between neurons by means of its training instances. Numerical optimization techniques force the model to converge on parametric values that best fit the training instances. Once trained, the model can then be run on previously unseen examples. Predictions for connection points on the test, or production, dataset are tested by comparing them to the training set.
Machine-learning techniques are inductive. Pure induction is conservative in the sense that it applies what we already know to new instances. But when the underlying principles behind a set of observations is incomplete or largely unknown, the method fails.
The HBP’s embrace of PN is particularly susceptible to these dangers. We lack a complete, or even adequate, understanding of the underlying distribution of the data generated about the brain. If this picture seems troubling, a broader view of large-scale brain simulations and their reliance on data-driven methods is more so.
Consider the concept of emergence—the popular explanation for the claim that data analysis at the level of neurons is sufficient to explain cognition, or perception, or awareness, or consciousness.
Given enough data, something new is bound to appear.
Neuroscientists have embraced simulations that, as Markram has noted, scale up, hoping that the scaling up will by itself yield insights. And if scaling the data up is not likely to yield anything of interest, then why not scale up the number of researchers? The key thing, evidently, is to scale up. Both Markram and Sean Hill, the director of the International Neuroinformatics Coordinating Facility, are proponents of swarm science. Thus Hill:
One goal of the Human Brain Project is to trigger and facilitate a new wave of global collaboration in neuroscience. … If successful in engaging the community, the aim is to have swarms of scientists attacking the major challenges of understanding the brain and its disorders together—in an environment where every individual will receive credit for his or her contribution.31
The swarm metaphor is a logical endpoint, of sorts, for Big Data. Data collection and computer analysis assume a central role. Research and experimentation recede from view.
Nothing in this is new. Similar notions about the wisdom of crowds have been the hallmarks of passing trends such as Web 2.0.32 Group-centered approaches are valuable if optimizing known quantities is paramount, but they are singularly ineffective otherwise.
Surely this is something that we knew or should have known.
The broader neuroscience community, particularly in Europe, has been sharply critical of the HBP. In July 2014, a petition with more than 800 signatures expressed concern about the course of the project, noting that a second round of funding, “unfortunately, reflected an even further narrowing of goals and funding allocation, including the removal of an entire neuroscience subproject and the consequent deletion of 18 additional laboratories.”33 Should a formal review of the HBP prove ineffective, the signatories called for:
[T]he European Commission and Member States to reallocate the funding currently allocated to the HBP core and partnering projects to broad neuroscience-directed funding [emphasis added] to meet the original goals of the HBP—understanding brain function and its effect on society.34
The promise of neuroscience, and indeed of science itself, rests on advances in our tools and instruments, including our computing resources. Such instruments are handmaidens, not headmasters.
Big Data in science is something already well understood. It is data. Data is observation. Observation is experience. And experience without theory, as Immanuel Kant said, is blind.