Web 2.0 tools, open access, and CC licenses are helping to accelerate scientific discovery.
It was one of those embarrassing episodes in science: Two sets of researchers published papers in a German organic chemistry journal, Angewandte Chemie, announcing that they had synthesized a strange new substance with “12-membered rings.” Then, as blogger and chemist Derek Lowe tells the story, “Professor Manfred Cristl of Wurzburg, who apparently knows his pyridinium chemistry pretty well, recognized this as an old way to make further pyridinium salts, not funky twelve-membered rings. He recounts how over the last couple of months he exchanged awkward emails with the two sets of authors, pointing out that they seem to have rediscovered a 100-year-old reaction. . . .”
In the Internet age, people generally assume that these kinds of things can’t happen. All you have to do is run a Web search for “pyridinium,” right? But as scientists in every field are discovering, the existence of some shard of highly specialized knowledge does not necessarily mean that it can be located or understood. After all, a Google search for “pyridinium” turns up 393,000 results. And even peer reviewers for journals (who may have been partly at fault in this instance) have the same problem as any researcher: the unfathomable vastness of the scientific and technical literature makes it difficult to know what humankind has already discovered.
Paradoxically, even though academic science played the central role in incubating the Internet (in conjunction with the military), it has not fared very well in developing it to advance research. Most search engines are too crude. Journal articles can be expensive and inaccessible. They do not link to relevant Web resources or invite reader comment. Nor do they contain metadata to facilitate computer-based searches, collaborative filtering, and text mining. Scientific databases are plentiful but often incompatible with one another, preventing researchers from exploring new lines of inquiry. Lab researchers who need to share physical specimens still have to shuffle papers through a bureaucratic maze and negotiate with lawyers, without the help of eBay- or Craigslist-like intermediaries.
“The World Wide Web was designed in a scientific laboratory to facilitate access to scientific knowledge,” observed Duke law professor James Boyle in 2007. “In every other area of life — commercial, social networking, pornography — it has been a smashing success. But in the world of science itself? With the virtues of the open Web all around us, we have proceeded to build an endless set of walled gardens, something that looks a lot like Compuserv or Minitel and very little like a world wide web for science.”
Therein lies a fascinating, complicated story. To be sure, various scientific bodies have made great progress in recent years in adapting the principles of free software, free culture, and Web 2.0 applications to their research. Open-access journals, institutional repositories, specialty wikis, new platforms for collaborative research, new metatagging systems: all are moving forward in different, fitful ways. Yet, for a field of inquiry that has long honored the ethic of sharing and “standing on the shoulders of giants,” academic science has lagged behind most other sectors.
Part of the problem is the very nature of scientific knowledge. While the conventional Web works fairly well for simple kinds of commerce and social purposes, the Research Web for science requires a more fine-grained, deliberately crafted structure.
Much as scientists would like to build new types of Internet-based commons, they have quickly run up against a thicket of interrelated problems: overly broad copyright and patent limitations; access and usage restrictions by commercial journal publishers and database owners; and university rules that limit how cell lines, test animals, bioassays, and other research tools may be shared. In a sense, scientists and universities face a classic collective-action problem. Everyone would clearly be better off if a more efficient infrastructure and enlightened social ethic could be adopted — but few single players have the resources, incentive, or stature to buck the prevailing order. There is no critical mass for instigating a new platform for scientific inquiry and “knowledge management.”
Like so many other sectors confronting the Great Value Shift, science in the late 1990s found itself caught in a riptide. The proprietarian ethic of copyright and patent law was intensifying (as we saw in chapter 2), spurring scientists and universities to claim private ownership in knowledge that was previously treated as a shared resource.
Perhaps the most salient example of the power of open science was the Human Genome Project (HGP), a publicly funded research project to map the 3 billion base pairs of the human genome. Many other scientific projects have been attracted by the stunning efficacy and efficiency of the open research model. For example, the HapMap project is a government-supported research effort to map variations in the human genome that occur in certain clusters, or haplotypes. There is also the SNP Consortium, a public-private partnership seeking to identify single-nucleotide polymorphisms (SNPs) that may be used to identify genetic sources of disease. Both projects use licenses that put the genomic data into the public domain.
A 2008 report by the Committee for Economic Development identified a number of other notable open research projects.
There has even been the emergence of open-source biotechnology, which is applying the principles of free software development to agricultural biotech and pharmaceutical development.
Sociologist Robert Merton is often credited with identifying the social values and norms that make science such a creative, productive enterprise. In a notable 1942 essay, Merton described scientific knowledge as “common property” that depends critically upon an open, ethical, peer-driven process.
Although scientific knowledge eventually becomes publicly available, it usually flows in semi-restricted ways, at least initially, because scientists usually like to claim personal credit for their discoveries. They may refuse to share their latest research lest a rival team of scientists gain a competitive advantage. They may wish to claim patent rights in their discoveries.
So scientific knowledge is not born into the public sphere, but there is a strong presumption that it ought to be treated as a shared resource as quickly as possible. As law scholar Robert Merges noted in 1996, “Science is not so much given freely to the public as shared under a largely implicit code of conduct among a more or less well identified circle of similarly situated scientists. In other words . . . science is more like a limited-access commons than a truly open public domain.”
As Web 2.0 innovations have demonstrated the power of the Great Value Shift, the convergence of open source, open access, and open science has steadily gained momentum.
But despite its early interest in making the Web more research-friendly, Creative Commons realized that science is a special culture unto itself, one that has so many major players and niche variations that it would be foolhardy for an upstart nonprofit to try to engage with it. So in 2002 Creative Commons shelved its ambitions to grapple with science as a commons, and focused instead on artistic and cultural sectors. By January 2005, however, the success of the CC licenses emboldened the organization to revisit its initial idea. As a result of deep personal engagement by several Creative Commons board members — computer scientist Hal Abelson, law professors James Boyle and Michael Carroll, and film producer Eric Saltzman — Creative Commons decided to launch a spin-off project, Science Commons. The new initiative would work closely with scientific disciplines and organizations to try to build what it now calls “the Research Web.”
Science Commons aims to redesign the “information space” — the technologies, legal rules, institutional practices, and social norms — so that researchers can more easily share their articles, datasets, and other resources. The idea is to reimagine and reinvent the “cognitive infrastructures” that are so critical to scientific inquiry. Dismayed by the pressures exerted by commercial journal publishers, open-access publishing advocate Jean-Claude Guédon has called on librarians to become “epistemological engineers.”
If transaction costs could be overcome, scientists could vastly accelerate their research cycles. They could seek answers in unfamiliar bodies of research literature. They could avoid duplicating other people’s flawed research strategies. They could formulate more imaginative hypotheses and test them more rapidly. They could benefit from a broader, more robust conversation (as in free software — “with enough eyes, all bugs are shallow”) and use computer networks to augment and accelerate the entire scientific process.
That is the vision of open science that Science Commons wanted to address in 2005. It recognized that science is a large, sprawling world of many institutional stakeholders controlling vast sums of money driving incommensurate agendas. In such a milieu, it is not easy to redesign some of the most basic processes and norms for conducting research. Science Commons nonetheless believed it could play a constructive role as a catalyst.
It was fortunate to have some deep expertise not just from its board members, but from two Nobel Prize winners on its scientific advisory panel (Sir John Sulston and Joshua Lederberg) and several noted scholars (patent scholar Arti Rai, innovation economist Paul David, and open-access publishing expert Michael B. Eisen). The director of Science Commons, John Wilbanks, brought a rare mix of talents and connections. He was once a software engineer at the World Wide Web Consortium, specializing in the Semantic Web; he had founded and run a company dealing in bioinformatics and artificial intelligence; he had worked for a member of Congress; and he was formerly assistant director of the Berkman Center at Harvard Law School.
After obtaining free office space at MIT, Wilbanks set off to instigate change within the scientific world — and then get out of the way. “We’re designing Science Commons to outstrip ourselves,” Wilbanks told me. “We don’t want to control any of this; we’re designing it to be decentralized. If we try to control it, we’ll fail.”
With a staff of seven and a budget of only $800,000 in 2008, Science Commons is not an ocean liner like the National Academy of Science and the National Science Foundation; it’s more of a tug-boat. Its strategic interventions try to nudge the big players into new trajectories. It is unencumbered by bureaucracy and entrenched stakeholders, yet it has the expertise, via Creative Commons, to develop standard licensing agreements for disparate communities. It knows how to craft legal solutions that can work with technology and be understood by nonlawyers.
In 2006, Science Commons embarked upon three “proof of concept” projects that it hopes will be models for other scientific fields. The first initiative, the Scholar’s Copyright Project, aspires to give scientists the “freedom to archive and reuse scholarly works on the Internet.” It is also seeking to make the vast quantities of data on computerized databases more accessible and interoperable, as a way to advance scientific discovery and innovation.
A second project, the Neurocommons, is a bold experiment that aims to use the Semantic Web to make a sprawling body of neurological research on the Web more accessible. The project is developing a new kind of Internet platform so that researchers will be able to do sophisticated searches of neuroscience-related journal articles and explore datasets across multiple databases.
Finally, Science Commons is trying to make it cheaper and easier for researchers to share physical materials such as genes, proteins, chemicals, tissues, model animals, and reagents, which is currently a cumbersome process. The Biological Materials Transfer Project resembles an attempt to convert the pony express into a kind of Federal Express, so that researchers can use an integrated electronic data system to obtain lab materials with a minimum of legal complications and logistical delays.
In many instances, Science Commons has been a newcomer to reform initiatives already under way to build open repositories of scientific literature or data. One of the most significant is the openaccess publishing movement, which has been a diverse, flourishing effort in academic circles since the 1990s. It is useful to review the history of the open access (OA) movement because it has been an important pacesetter and inspiration for the open-science ethic.
The open-access movement has a fairly simple goal: to get the scientific record online and available to everyone. It regards this task as one of the most fundamental challenges in science. Open-access publishing generally consists of two modes of digital access — openaccess archives (or “repositories”) and open-access journals. In both instances, the publisher or host institution pays the upfront costs of putting material on the Web so that Internet users can access the literature at no charge.~[* “Open access” can be a confusing term. In the context of a rivalrous, depletable natural resource like timber or grazing land, an open-access regime means that anyone can use and appropriate the resource, resulting in its overexploitation and ruin. An open-access regime is not the same as a commons, however, because a commons does have rules, boundaries, sanctions against free riders, etc., to govern the resource. However, in the context of an infinite, nonrivalrous resource like information, which can be copied and distributed at virtually no cost, an open-access regime does not result in overexploitation of the resource. For this reason, open access in an Internet context is often conflated with the commons — even though “open access,” in a natural resource context, tends to produce very different outcomes.]~
The appeal of OA publishing stems from the Great Value Shift described in chapter 5. “OA owes its origin and part of its deep appeal to the fact that publishing to the Internet permits both wider dissemination and lower costs than any previous form of publishing,” writes Peter Suber, author of Open Access News and a leading champion of OA.
Just as free software and music downloads have disrupted their respective industries, so OA publishing has not been a welcome development among large academic publishers such as Elsevier, Springer, Kluwer, and Wiley. Online publishing usually costs much less than traditional print publishing and it allows authors to retain control over their copyrights. Both of these are a big incentive for disciplines and universities to start up their own OA journals. In addition, OA publishing makes it easier for research to circulate, and for authors to reach larger readerships. This not only augments the practical goals of science, it bolsters the reputation system and open ethic that science depends upon.
Commercial publishers have historically emphasized their shared interests with scholars and scientists, and the system was amicable and symbiotic. Academics would produce new work, validate its quality through peer review, and then, in most cases, give the work to publishers at no charge. Publishers shouldered the expense of editorial production, distribution, and marketing and reaped the bulk of revenues generated. The arrangement worked fairly well for everyone until journal prices began to rise in the early 1970s. Then, as subscription rates continued to soar, placing unbearable burdens on university libraries in the 1990s, the Internet facilitated an extremely attractive alternative: open-access journals. Suddenly, conventional business models for scholarly publishing had a serious rival, one that shifts the balance of power back to scientists and their professional communities.
Publishers have long insisted upon acquiring the copyright of journal articles and treating them as “works for hire.” This transfer of ownership enables the publisher, not the author, to determine how a work may circulate. Access to an article can then be limited by the subscription price for a journal, the licensing fees for online access, and pay-per-view fees for viewing an individual article. Publishers may also limit the reuse, republication, and general circulation of an article by charging high subscription or licensing fees, or by using digital rights management. If a university cannot afford the journal, or if a scholar cannot afford to buy individual articles, research into a given topic is effectively stymied.
Open-access champion John Willinsky notes, “The publishing economy of scholarly journals is dominated by a rather perverse property relation, in which the last investor in the research production chain — consisting of university, researcher, funding agency and publisher — owns the resulting work outright through a very small investment in relation to the work’s overall cost and value.”
Not surprisingly, many commercial publishers regard OA publishing as a disruptive threat. It can, after all, subvert existing revenue models for scholarly publishing. This does not mean that OA publishing cannot support a viable business model. Much of OA publishing is sustained through “author-side payments” to publishers. In certain fields that are funded by research grants, such as biomedicine, grant makers fold publishing payments into their grants so that the research can be made permanently available in open-access journals. A leading commercial publisher, BioMed Central, now publishes over 140 OA journals in this manner. Hindawi Publishing Corporation, based in Cairo, Egypt, publishes more than one hundred OA journals and turns a profit. And Medknow Publications, based in Mumbai, India, is also profitable as a publisher of more than forty OA journals.
It remains an open question whether the OA business model will work in fields where little research is directly funded (and thus upfront payments are not easily made). As Suber reports, “There are hundreds of OA journals in the humanities, but very, very few of them charge a fee on the author’s side; most of them have institutional subsidies from a university say, or a learned society.”
The tension between commercial publishers and academic authors has intensified over the past decade, fueling interest in OA alternatives. The most salient point of tension is the so-called “serials crisis.” From 1986 to 2006, libraries that belong to the Association of Research Libraries saw the cost of serial journals rise 321 percent, or about 7.5 percent a year for twenty consecutive years.
As journal prices have risen, the appeal of OA publishing has only intensified. Unfortunately, migrating to OA journals is not simply an economic issue. Within academia, the reputation of a journal is deeply entwined with promotion and tenure decisions. A scientist who publishes an article in Cell or Nature earns far more prestige than she might for publishing in a little-known OA journal.
So while publishing in OA journals may be economically attractive, it flouts the institutional traditions and social habits that scientists have come to rely on for evaluating scientific achievement. The OA movement’s challenge has been to document how OA models can help a university, and so it has collaborated with university administrators to showcase exemplary successes and work out new revenue models. It is urging promotion and tenure committees, for example, to modify their criteria to stop discriminating against new journals just because they are new, and hence to stop discriminating against OA journals (which are all new). Much of this work has fallen to key OA leaders like the Open Society Institute, the Hewlett Foundation, Mellon Foundation and the library-oriented SPARC (Scholarly Publishing and Academic Resources Coalition) as well as individuals such as John Willinsky, Jean-Claude Guédon, Stevan Harnad, and Peter Suber.
One of the first major salvos of the movement came in 2000, when biomedical scientists Harold E. Varmus, Patrick O. Brown, and Michael B. Eisen called on scientific publishers to make their literature available through free online public archives such as the U.S. National Library of Medicine’s PubMed Central. Despite garnering support from nearly 34,000 scientists in 180 countries, the measure did not stimulate the change sought. It did alert the scientific world, governments, and publishers about the virtues of OA publishing, however, and galvanized scientists to explore next steps.
At the time, a number of free, online peer-reviewed journals and free online archives were under way.
Creative Commons licenses have been critical tools in the evolution of OA publishing because they enable scientists and scholars to authorize in advance the sharing, copying, and reuse of their work, compatible with the BBB definition. The Attribution (BY) and Attribution-Non-Commercial (BY-NC) licenses are frequently used; many OA advocates regard the Attribution license as the preferred choice. The protocols for “metadata harvesting” issued by the Open Archives Initiative are another useful set of tools in OA publishing. When adopted by an OA journal, these standardized protocols help users more easily find research materials without knowing in advance which archives they reside in, or what they contain.
There is no question that OA is transforming the market for scholarly publishing, especially as pioneering models develop. The Public Library of Science announced its first two open-access journals in December 2002. The journals represented a bold, high-profile challenge by highly respected scientists to the subscription-based model that has long dominated scientific publishing. Although Elsevier and other publishers scoffed at the economic model, the project has expanded and now publishes seven OA journals, for biology, computational biology, genetics, pathogens, and neglected tropical diseases, among others.
OA received another big boost in 2004 when the National Institutes for Health proposed that all NIH-funded research be made available for free one year after its publication in a commercial journal. The $28 billion that the NIH spends on research each year (more than the domestic budget of 142 nations!) results in about 65,000 peer-reviewed articles, or 178 every day. Unfortunately, commercial journal publishers succeeded in making the proposed OA policy voluntary. The battle continued in Congress, but it became clear that the voluntary approach was not working. Only 4 percent of researchers published their work under OA standards, largely because busy, working scientists did not consider it a priority and their publishers were not especially eager to help. So Congress in December 2007 required NIH to mandate open access for its research within a year of publication.
What may sound like an arcane policy battle in fact has serious implications for ordinary Americans. The breast cancer patient seeking the best peer-reviewed articles online, or the family of a person with Huntington’s disease, can clearly benefit if they can acquire, for free, the latest medical research. Scientists, journalists, health-care workers, physicians, patients, and many others cannot access the vast literature of publicly funded scientific knowledge because of high subscription rates or per-article fees. A freely available body of online literature is the best, most efficient way to help science generate more reliable answers, new discoveries, and commercial innovations.
While large publishers continue to dominate the journal market, OA publishing has made significant advances in recent years. In June 2008, the Directory of Open Access Journals listed more than 3,400 open-access journals containing 188,803 articles. In some fields such as biology and bioinformatics, OA journals are among the top-cited journals. In fact, this is one of the great advantages of OA literature. In the networked environment, articles published in OA journals are more likely to be discovered by others and cited, which enhances the so-called impact of an article and the reputation of an author.
Although journals may or may not choose to honor OA principles, any scientist, as the copyright holder of his articles, can choose to “self-archive” his work under open-access terms. But commercial publishers generally don’t like to cede certain rights, and authors usually don’t know what rights to ask for, how to assert them in legal language, and how to negotiate with publishers. So it is difficult for most academics to assert their real preferences for open access. To help make things simpler, SPARC and MIT developed what is called an “author’s addendum.” It is a standard legal contract that authors can attach to their publishing contracts, in which they reserve certain key rights to publish their works in OA-compliant ways.
In an attempt to help the open-access movement, Science Commons in 2007 developed its own suite of amendments to publishing contracts. The goal has been to ensure that “at a minimum, scholarly authors retain enough rights to archive their work on the Web. Every Science Commons Addendum ensures the freedom to use scholarly articles for educational purposes, conference presentations, in other scholarly works or in professional activities.”
To make the whole process easier for scientists, Science Commons developed the Scholar’s Copyright Addendum Engine. This point-and-click Web-based tool lets authors publish in traditional, subscription-based journals while retaining their rights to post copies on the Internet for download, without most copyright and financial restrictions. There are also options for “drag and drop” self-archiving to repositories such as MIT’s DSpace and the National Library of Medicine’s PubMed Central. Besides making selfarchiving easier and more prevalent, Science Commons hopes to standardize the legal terms and procedures for self-archiving to avoid a proliferation of incompatible rights regimes and document formats. “The engine seems to be generating a dialogue between authors and publishers that never existed,” said John Wilbanks. “It’s not being rejected out of hand, which is really cool. To the extent that the addendum becomes a norm, it will start to open up the [contractual] limitations on self-archiving.”
Harvard University gave self-archiving a big boost in February 2008 when its faculty unanimously voted to require all faculty to distribute their scholarship through an online, open-access repository operated by the Harvard library unless a professor chooses to “opt out” and publish exclusively with a commercial journal. Robert Darnton, director of the Harvard library, said, “In place of a closed, privileged and costly system, [the open-access rule] will help open up the world of learning to everyone who wants to learn.”
By far, the more ambitious aspect of the Scholar’s Copyright project is the attempt to free databases from a confusing tangle of copyright claims. In every imaginable field of science — from anthropology and marine biology to chemistry and genetics — databases are vital tools for organizing and manipulating vast collections of empirical data. The flood of data has vastly increased as computers have become ubiquitous research tools and as new technologies are deployed to generate entirely new sorts of digital data streams— measurements from remote sensors, data streams from space, and much more. But the incompatibility of databases — chiefly for technical and copyright reasons — is needlessly Balkanizing research to the detriment of scientific progress. “There is plenty of data out there,” says Richard Wallis of Talis, a company that has built a Semantic Web technology platform for open data, “but it is often trapped in silos or hidden behind logins, subscriptions or just plain difficult to get hold of.” He added that there is a lot of data that is “just out there,” but the terms of access may be dubious.
Questions immediately arise: Can a database be legally used? Who owns it? Will the database continue to be accessible? Will access require payment later on? Since data now reside anywhere in the world, any potential user of data also has to consider the wide variations of copyright protection for databases around the world.
The question of how data shall be owned, controlled, and shared is a profoundly perplexing one. History has shown the virtue of sharing scientific data — yet individual scientists, universities, and corporations frequently have their own interests in limiting how databases may be used. Scientists want to ensure the integrity of the data and any additions to it; they may want to ensure preferential access to key researchers; companies may consider the data a lucrative asset to be privately exploited. Indeed, if there is not some mechanism of control, database producers worry that free riders will simply appropriate useful compilations and perhaps sell it or use it for their own competitive advantage. Or they may fail to properly credit the scientists who compiled the data in the first place. Inadequate database protection could discourage people from creating new databases in the future.
A National Research Council report in 1999 described the problem this way: “Currently many for-profit and not-for-profit database producers are concerned about the possibility that significant portions of their databases will be copied or used in substantial part by others to create ‘new’ derivative databases. If an identical or substantially similar database is then either re-disseminated broadly or sold and used in direct competition with the original rights holder’s database, the rights holder’s revenues will be undermined, or in extreme cases, the rights holder will be put out of business.”
In the late 1990s, when the Human Genome Project and a private company, Celera, were competing to map the human genome, the publicly funded researchers were eager to publish the genome sequencing data as quickly as possible in order to prevent Celera or any other company from claiming exclusive control over the information. They wanted the data to be treated as “the common heritage of humanity” so that it would remain openly accessible to everyone, including commercial researchers. When Sir John Sulston of the Human Genome Project broached the idea of putting his team’s research under a GPL-like license, it provoked objections that ownership of the data would set a worrisome precedent. A GPL for data amounts to a “reach-through” requirement on how data may be used in the future. This might not only imply that data can be owned — flouting the legal tradition that facts cannot be owned — it might discourage future data producers from depositing their data into public databases.
The International HapMap Project attempted such a copyleft strategy with its database of genotypes; its goal is to compare the genetic sequences of different individuals to identify chromosomal regions where genetic variants are shared.
The basic problem with applying copyright law to databases is how to draw the line between what is private property and what remains in the commons. “If you try to impose a Creative Commons license or free-software-style licensing regime on a database of uncopyrightable facts,” explained John Wilbanks, “you create an enormous amount of confusion in the user about where the rights start and stop.”
For two years, Science Commons wrestled with the challenge of applying the CC licenses to databases. Ultimately, the project came to the conclusion that “copyright licenses and contractual restrictions are simply the wrong tool, even if those licenses are used with the best of intentions.” There is just too much uncertainty about the scope and applicability of copyright — and thus questions about any licenses based on it. For example, it is not entirely clear what constitutes a “derivative work” in the context of databases. If one were to query hundreds of databases using the Semantic Web, would the federated results be considered a derivative work that requires copyright permissions from each database owner? There is also the problem of “attribution stacking,” in which a query made to multiple databases might require giving credit to scores of databases. Different CC licenses for different databases could also create legal incompatibilities among data. Data licensed under a CC ShareAlike license, for example, cannot be legally combined with data licensed under a different license. Segregating data into different “legal boxes” could turn out to impede, not advance, the freedom to integrate data on the Web.
After meeting with a variety of experts in scientific databases, particularly in the life sciences, biodiversity, and geospatial research, the Science Commons came up with an ingenious solution to the gnarly difficulties. Instead of relying on either copyright law or licenses, Science Commons in late 2007 announced a new legal tool, CC0 (CC Zero), which creates a legal and technical platform for a scientific community to develop its own reputation system for sharing data.
CC0 is not a license but a set of protocols. The protocols require that a database producer waive all rights to the data based on intellectual property law — copyrights, patents, unfair competition claims, unfair infringement rights — a “quitclaim” that covers everything. Then it requires that the database producer affirmatively declare that it is not using contracts to encumber future uses of the data. Once a database is certified as complying with the protocols, as determined by Science Commons, it is entitled to use a Science Commons trademark, “Open Access Data,” and CC0 metadata. The trademark signals to other scientists that the database meets certain basic standards of interoperability, legal certainty, ease of use, and low transaction costs. The metadata is a functional software tool that enables different databases to share their data.
“What we are doing,” said John Wilbanks, “is reconstructing, contractually, the public domain. The idea is that with any conforming implementation — any licensed database — you have complete freedom to integrate with anything else. It creates a zone of certainty for data integration.”
To develop this scheme, Science Commons’s attorney Thinh Nguyen worked closely with Talis, a company that has built a Semantic Web technology platform for open data and developed its own open database license. Nguyen also worked with the company’s legal team, Jordan Hatcher and Charlotte Waelde, and with the Open Knowledge Foundation, which has developed the Open Knowledge Definition.
The CC0 approach to data represents something of a breakthrough because it avoids rigid, prescriptive legal standards for a type of content (data) that is highly variable and governed by different community norms. CC0 abandons the vision of crafting a single, all-purpose copyright license or contract for thousands of different databases in different legal jurisdictions. Instead it tries to create a legal framework that can honor a range of variable social norms that converge on the public domain. Each research community can determine for itself how to meet the CC0 protocols, based on its own distinctive research needs and traditions. Different norms can agree to a equivalency of public-domain standards without any one discipline constraining the behaviors of another.
The system is clever because it provides legal reliability without being overly prescriptive. It is simple to use but still able to accommodate complex variations among disciplines. And it has low transaction costs for both producers and users of data. Over time, the databases that comply with the CC0 protocols are likely to grow into a large universe of interoperable open data.
It is still too early to judge how well the CC0 program is working, but initial reactions have been positive. “The solution is at once obvious and radical,” said Glyn Moody, a British journalist who writes about open-source software. “It is this pragmatism, rooted in how science actually works, that makes the current protocol particularly important.” Deepak Singh, the co-founder of Bioscreencast, a free online video tutorial library for the scientific community, said, “I consider just the announcement to be a monumental moment.”
Every day there is so much new scientific literature generated that it would take a single person 106 years to read it all.
This visionary project, the so-called Semantic Web, aspires to develop a framework for integrating a variety of systems, so they can communicate with one another, machine to machine. The goal is to enable computers to identify and capture information from anywhere on the Web, and then organize the results in sophisticated and customized ways. “If you search for ‘signal transduction genes in parameter neurons,’ ” said John Wilbanks of Science Commons, “Google sucks. It will get you 190,000 Web pages.” The goal of the Semantic Web is to deliver a far more targeted and useful body of specialized information.
A key tool is the Unique Resource Identifier, or URI, which is analogous to the Unique Resource Locator, or URL, used by the Web. Affix a URI to any bit of information on the Web, and the Semantic Web will (so it is hoped) let you mix and match information tagged with that URI with countless other bits of information tagged with other URIs. It would not matter if the bit of information resides in a journal article, database, clinical image, statistical analysis, or video; the point is that the URI would identify a precise bit of information. By enabling cross-linking among different types of information, the idea is that scientists will be able to make all sorts of unexpected and serendipitous insights.
For example, geneticists studying Huntington’s disease, a rare neurodegenerative disorder, and experts studying Alzheimer’s disease are both exploring many of the same genes and proteins of the brain. But because of the specialization of their disciplines, the chances are good that they read entirely different scientific journals and attend different conferences. There is no easy or systematic way for scientists in one specialty to explore the knowledge that has developed in another specialty. The Semantic Web could probably help.
Unfortunately, for a grand dream that has been touted since the 1990s, very little has developed. The W3C has been embroiled in the design challenges of the Semantic Web for so long that many companies and computer experts now scoff at the whole idea of the Semantic Web. There have been too many arcane, inconclusive debates about computer syntax, ontology language, and philosophical design choices that no one is holding their breath anymore, waiting for the Semantic Web to arrive. (Wikipedia defines a computer ontology as “a data model that represents a set of concepts within a domain and the relationships between those concepts. It is used to reason about the objects within that domain.”) The vision of the Semantic Web may have the potential to revolutionize science, but few people have seen much practical value in it over the near term, and so it has garnered little support.
Wilbanks, who once worked at the W3C, was frustrated by this state of affairs. Although he has long believed in the promise of the Semantic Web, he also realized that it is not enough to extol its virtues. One must demonstrate its practicality. “The way to herd cats is not to herd cats,” he said, citing a colleague, “but to put a bowl of cream on your back stoop and run like hell.” For Wilbanks, the bowl of cream is the Neurocommons knowledge base, a project that seeks to integrate a huge amount of neuroscientific research using Semantic Web protocols and is easy to use.
“The way to overcome the inertia that the Semantic Web critics rightly point out, is not to sit down and argue about ontologies,” said Wilbanks. “It’s to release something that’s useful enough that it’s worth wiring your database into the commons system. If I want to get precise answers to complicated questions that might be found in my own database, among others, now I can do that. I simply have to wire it into the Neurocommons. You don’t need to come to some magical agreement about ontology; you just need to spend a couple of days converting your database to RDF [Resource Description Framework, a set of Semantic Web specifications], and then— boom! — I’ve got all of the other databases integrated with mine.” By getting the ball rolling, Science Commons is betting that enough neuroscience fields will integrate their literature to the Neurocommons protocols and make the new commons a lively, sustainable, and growing organism of knowledge.
Using the “open wiring” of the Semantic Web, the Neurocommons has already integrated information from fifteen of the top twenty databases in the life sciences and neuroscience. The data have been reformatted to conform to Semantic Web protocols and the scientific literature, where possible, has been tagged so that it can be “text-mined” (searched for specific information via URI tags). “We have put all this stuff into a database that we give away,” said Wilbanks. “It’s already been mirrored in Ireland, and more mirrors are going up. It’s sort of like a ‘knowledge server,’ instead of a Web server.”
Commercial journal publishers already recognize the potential power of owning and controlling metadata in scientific literature and datasets. To leverage this control many are starting to make copyright claims in certain kinds of metadata, and to amend their contracts with libraries in order to limit how they may retrieve electronic information. “There is a lot at stake here,” says Villanova law professor Michael Carroll. “What Science Commons wants to do is make sure that metadata is an open resource.”
Wilbanks has high hopes that the Neurocommons project, by providing a useful demonstration of Semantic Web tools, will hasten the interoperability of specialized knowledge that is currently isolated from related fields. It comes down to how to motivate a convergence of knowledge. Instead of arguing about which discipline’s ontology of specialized knowledge is superior to another’s — and making little headway toward a consensus — Wilbanks has a strategy to build a knowledge tool that is useful. Period. His bet is that a useful “knowledge server” of integrated neuroscientific information will be a powerful incentive for adjacent disciplines to adapt their own literature and databases to be compatible. The point is to get the commons going — while allowing the freedom for it to evolve. Then, if people have disagreements or quibbles, they will be free to change the ontologies as they see fit. “The version [of the Neurocommons] that we are building is useful and it is free,” Wilbanks said. “That means that if you want to integrate with it, you can. It means that if you want to redo our work your way, you can— as long as you use the right technical formats. You can reuse all of our software.”
The problem with a field like neuroscience, which has so many exploding frontiers, is that no single company or proprietary software platform can adequately manage the knowledge. The information is simply too copious and complex. Like so many other fields of knowledge that are large and complicated, it appears that only an open-source model can successfully curate the relevant information sources. A Web-based commons can be remarkably efficient, effective, and scalable. This has been the lesson of free and open-source software, wikis, and the Web itself. Although it is too early to tell how the Neurocommons project will evolve, the initial signs are promising. A number of foundations that support research for specific diseases — Alzheimer’s disease, Parkinson’s, autism, epilepsy, Huntington’s disease — have already expressed interest in the Neurocommons as a potential model for advancing research in their respective fields.
Science is not just about text and data, of course. It also involves lots of tangible stuff needed to conduct experiments. Typical materials include cell lines, monoclonal antibodies, reagents, animal models, synthetic materials, nano-materials, clones, laboratory equipment, and much else. Here, too, sharing and collaboration are important to the advance of science. But unlike digital bits, which are highly malleable, the physical materials needed for experiments have to be located, approved for use, and shipped. Therein lies another tale of high transaction costs impeding the progress of science. As Thinh Nguyen, counsel for Science Commons, describes the problem:
The ability to locate materials based on their descriptions in journal articles is often limited by lack of sufficient information about origin and availability, and there is no standard citation for such materials. In addition, the process of legal negotiation that may follow can be lengthy and unpredictable. This can have important implications for science policy, especially when delays or inability to obtain research materials result in lost time, productivity and research opportunities.
To the nonscientist, this transactional subculture is largely invisible. But to scientists whose lab work requires access to certain physical materials, the uncertainties, variations, and delays can be crippling. Normally, the transfer of materials from one scientist to another occurs through a Material Transfer Agreement, or MTA. The technology transfer office at one research university will grant, or not grant, an MTA so that a cell line or tissue specimen can be shipped to a researcher at another university. Typically, permission must be granted for the researcher to publish, disseminate, or use research results, and to license their use for commercialization.
While certain types of transactions involve material that could conceivably generate high royalty revenues, a great many transactions are fairly low-value, routine transfers of material for basic research. Paradoxically, that can make it all the harder to obtain the material because consummating an MTA is not a high priority for the tech transfer office. In other cases, sharing the material is subject to special agreements whose terms are not known in advance.
Corporations sometimes have MTAs with onerous terms that prevent academic researchers from using a reagent or research tool. Individual scientists sometimes balk at sharing a substance because of the time and effort needed to ship it. Or they may wish to prevent another scientist from being the first to publish research results. Whatever the motivation, MTAs can act as a serious impediment to verification of scientific findings. They can also prevent new types of exploratory research and innovation.
Wilbanks describes the existing system as an inefficient, artisanal one that needs to becomes more of a streamlined industrial system. Just as Creative Commons sought to lower the transaction costs for sharing creative works, through the use of standard public licenses, so Science Commons is now trying to standardize the process for sharing research materials. The idea is to reduce the transaction costs and legal risks by, in Nguyen’s words, “creating a voluntary and scalable infrastructure for rights representation and contracting.”
There are already some successful systems in place for sharing research materials, most notably the Uniform Biological Material Transfer Agreement (UBMTA), which some 320 institutions have accepted, as well as a Simple Letter Agreement developed by the National Institutes of Health. The problem with these systems is that they cannot be used for transfers of materials between academic and for-profit researchers. In addition, there are many instances in which UBMTA signatories can opt out of the system to make modifications to the UBMTA on a case-by-case basis.
To help standardize and streamline the whole system for sharing research materials, Science Commons is working with a consortium of ten research universities, the iBridge Network, to develop a prototype system. The hope is that by introducing metadata to the system, and linking that information to standard contracts and human-readable deeds, scientists will be able to acquire research materials much more rapidly by avoiding bureaucratic and legal hassles. Just as eBay, Amazon, and Federal Express use metadata to allow customers to track the status of their orders, so the Science Commons MTA project wants to develop a system that will allow searching, tracking, and indexing of specific shipments. It is also hoped that metadata links will be inserted into journal articles, enabling scientists to click on a given research material in order to determine the legal and logistical terms for obtaining the material.
Wilbanks envisions a new market of third-party intermediaries to facilitate materials transfers: “There’s an emerging network of third parties — think of them as ‘biology greenhouses’ — who are funded to take in copies of research materials and manufacture them on demand — to grow a quantity and mail them out. What Science Commons is trying to do with the Materials Transfer Project is to put together a functional system where materials can go to greenhouses under standard contracts, with digital identifiers, so that the materials can be cross-linked into the digital information commons. Anytime you see a list of genes, for example, you will be able to right-click and see the stuff that’s available from the greenhouses under standard contract, and the cost of manufacture and delivery in order to access the tool. Research materials need to be available under a standard contract, discoverable with a digital identifier, and fulfillable by a third party. And there needs to be some sort of acknowledgment, like a citation system.”
At one level, it is ironic that one of the oldest commons-based communities, academic science, has taken so long to reengineer its digital infrastructure to take advantage of the Internet and open digital systems. Yet academic disciplines have always clung tightly to their special ways of knowing and organizing themselves. The arrival of the Internet has been disruptive to this tradition by blurring academic boundaries and inviting new types of cross-boundary research and conversation. If only to improve the conversation, more scientists are discovering the value of establishing working protocols to let the diverse tribes of science communicate with one another more easily. Now that the examples of networked collaboration are proliferating, demonstrating the enormous power that can be unleashed through sharing and openness, the momentum for change is only going to intensify. The resulting explosion of knowledge and innovation should be quite a spectacle.
Copyright: © 2008 by David Bollier All rights reserved. No part of this book may be reproduced, in any form, without written permission from the publisher. The author has made an online version of the book available under a Creative Commons Attribution-NonCommercial license. It can be accessed at http://www.viralspiral.cc and http://www.onthecommons.org. Requests for permission to reproduce selections from this book should be mailed to "Permissions Department, The New Press, 38 Greene Street, New York, NY 10013". Published in the United States by The New Press, New York, 2008 Distributed by W. W. Norton & Company, Inc., New York ISBN 978-1-59558-396-3 (hc.) CIP data available The New Press was established in 1990 as a not-for-profit alternative to the large, commercial publishing houses currently dominating the book publishing industry. The New Press operates in the public interest rather than for private gain, and is committed to publishing, in innovative ways, works of educational, cultural, and community value that are often deemed insufficiently profitable. www.thenewpress.com A Caravan book. For more information, visit www.caravanbooks.org.
SiSU Spine (object numbering & object search) 2022