Tuesday, June 7, 2011

eScience and Librarianship; An exciting new opportunity

I've been on the conference circuit for the last couple of months, sometimes speaking and sometimes listening. Certainly one of the topics being much discussed by librarians is that of eScience.

It represents a very interesting opportunity for librarianship. I've summarized below what I've heard and some thoughts on the subject that I'm sharing with others when asked.

It is my opinion that eScience is a major new opportunity for librarianship. It is one we should eagerly embrace, help define and ultimately lead in many respects. It's important for us to realize that eScience is not a new field of science. Instead, it is methodology, process and procedures that are intended to empower scientists to do their research faster, better and in different ways. It takes science and makes it a communal and participatory process.

eScience is resulting because everything in research is increasingly computational in nature. The result is massive stores of digital data, software and software infrastructure to manage, analyze, and produce the associated data.

While data was previously tied to a specific hypothesis, now it is subject to reuse, re-testing and recombination with other data. Furthermore, whereas in the past we needed bring the data to the computer, now we need to centrally locate the data and take the computation to the data. The power of networks, cloud computing and numerous other variables are bringing about important changes with regard to handling eScience data.

The intersection of eScience & librarianship ends up largely being about data. Networked data, that drives science. We need to store it, find it, retrieve it and make sure it is reusable and ultimately, can be combined and analyzed with other eScience data.

This intersection brings about many new and emerging roles for librarianship in dealing with that e-data. As librarians we need to address these needs by understanding the data, curating it, helping create relationships between data sets that researchers don't see on their own, creating the metadata to assist in finding and reusing it as well as understand that librarians need to become researchers themselves, so they can learn and see how to add value to data sets.

This is not to say that eScience doesn’t come with major challenges, because it does. However, therein lies the chance for us as librarians to apply our skills and to help bring together the answers needed.

Here are some of the challenges we’re facing.

1. Right now, we have vast amounts of data being generated without organization, description, preservation, or curation. As a result it is very easily lost. Cloud computing and digital preservation systems bring us some answers. Cloud computing brings along economy of scale and will help to make overall prices far more affordable for computing, storage, network, preservation and overall administration. Yet, we have challenges in how to move the data into the clouds because this poses major issues for our networks. Digital preservation systems now exist to ensure future access to data, but these systems bring along with them a whole host of new requirements for librarianship.

2. For instance, eScience data requires some subject expertise in order to understand it and to make it useful. As librarians, we’ll need to understand the standards, practices, values, norms and culture of the research field. We’ll need to teach users the basic concept of databases, how to query those db's and the file formats. We’ll need to understand which formats and data types are appropriate for different research questions the researchers are posing.

3. We also have to worry about data ownership and rights. In order to do that we’ll need to understand the ethics of the field. We’ll have to work with and train people about acknowledging reuse of data when they do so. Certainly we’ll be able to apply our knowledge of intellectual property, privacy and confidentiality to the eScience field, but it won’t be easy. eScience will bring new players and problems to the copyright discussions.

4. eScience data needs to be opened up. We need a strong push for data not to be stored behind firewalls so it can more readily be data mined, analyzed and reused. Having the data be open will help ensure it is reusable and not tied to any specific software silos. Data sets tied to proprietary research software, as many are today, are increasingly becoming a problem for researchers. Much of science is still done using proprietary software and this will continue to be the case. However, the problem is that the resulting data sets also become proprietary. When this happens, it means in order to make the data reusable, we either need to map it to an open format, which might strip out some of the usefulness of the data, or we need to also store a copy of the software, which brings along a whole host of issues concerning licenses, use of software by others not to mention creating emulation environments in order to be able to reuse that data. There are some initial efforts underway here with regard to data. For instance:

a. From the website the Open Data Protocol, they say: “The Open Data Protocol (OData) is a Web protocol for querying and updating data that provides a way to unlock your data and free it from silos that exist in applications today. OData is being used to expose and access information from a variety of sources including, but not limited to, relational databases, file systems, content management systems and traditional Web sites. It makes a deep commitment to URIs for resource identification and commits to an HTTP-based, uniform interface for interacting with those resources (just like the Web). This commitment to core Web principles allows OData to enable a new level of data integration and interoperability across a broad range of clients, servers, services, and tools. OData is released under the Open Specification Promise to allow anyone to freely interoperate with OData implementations.”

b. DataCite is another such effort. Their website says: "The aim of DataCite is: i.) establish easier access to scientific research data on the Internet, ii.) increase acceptance of research data as legitimate, citable contributions to the scientific record, iii.) support data archiving that will permit results to be verified and re-purposed for future study.

c. Finally, because we frequently find research by the researcher, another important effort is ORCID. Name ambiguity and attribution are persistent, critical problems imbedded in the scholarly research ecosystem. Per their website: “The ORCID Initiative represents a community effort to establish an open, independent registry that is adopted and embraced as the industry’s de facto standard. The goal is to resolve the systemic name ambiguity, by means of assigning unique identifiers linkable to an individual's research output, to enhance the scientific discovery process and improve the efficiency of funding and collaboration.”

5. Right now, we have way too many publishers and institutions hurting science because they're hiding content behind firewalls. In order for us to fully exploit the data and to build new knowledge, we need the ability to do data mining across publisher and institution content. Progress is needed here.

6. Discovery and Acquisition of Data. Today, and until we can move to centralized repositories for data, we need to locate disciplinary data repositories. We need assist in importing data and converting it when necessary so it can be used by a downstream process. And we need to work with researchers to have their data sets stored in central repositories that will ultimately provide for the preservation of that data as well.

7. Data management and organization. We’ll need to understand the lifecycle of data resulting from research. As part of this we’ll need to help outline and develop data management plans and keep track of the relation of subsets or processed data to the original data set. In North America, the call by the National Science Foundation (NSF) for data management plans as part of grant applications has created a flurry of activity and work in this area. But note, these are only plans at this point, there is no actual call yet to implement those plans. Yet, we as librarians will need to work with researchers to create a standard operating procedure for data management and documentation.

8. Data Conversion and Interoperability. While the immediate pressure here is to keep the data usable short term, this is also a huge ongoing issue in terms of data preservation. So, as librarians, we’ll need to become proficient in migrating data from one format to another or knowing about the tools that will do this. We’ll need to understand the risk and potential loss or corruption of information caused by changing data format. Furthermore, we’ll need to be able to explain those risks and benefits to researchers so they understand and support the benefits of making data available in standard formats for use downstream.

9. Metadata. Here we must apply our existing skills to these new data sets. We’ll combine our skills at creating metadata with newly acquired subject expertise so that we can proficiently annotate and describe data so it can be understood and used by other workgroups and external users. To do that however, we really understand the structure and purpose of data which underscores our need for subject expertise.

10. Quality Assurance of data. This is a huge challenge, but once we accept the data into our repositories and have made commitments of our ability to preserve it, it will be too late to blame problems down the road on the quality of the data we accepted. So, we’ll need to resolve any apparent artifacts, incomplete or corrupted data sets on the front end. The good news here is that there are digital preservations systems today that will support these kinds of tools being used.

11. Data Preservation and Re-Use. We need to educate researchers to recognize that data may have value beyond the original purpose. It might be used to validate research or for later reuse. Digital preservation is complex and can be costly, but the technology now exists. We’ll need to help educate researchers to understand the benefits and costs and for us to be able to recommend the best practices. This is vital to support community driven e-research. Bottom line, as data is created, it must be prepared for its eventual reuse.

12. Data Analysis/Visualization. This is an emerging field for librarianship, but it is rapidly becoming a vitally important one for science. These tools allow data mining and detailed analysis as well as taking that data and creating visual maps of it so as to increase our knowledge of what it is capable of telling us. Again, to do this we’ll need to learn and know basic analysis and visualization tools that can be used with data.

To summarize. eScience creates all kinds of new roles for librarians; data scientist, subject expert, librarian, creator, curator, translator and manager are all possible roles. There are also times we’ll need to serve as a broker, negotiating things between researchers. We’ll be building new collections as part of this; major research data repositories that also provide data preservation capabilities. Again, this will allow us to apply skills we have in some new and exciting ways.

That’s a lot to do and much of it will require a great deal of work. But the most critical need today is to be sure you have a seat at the table, anytime, anywhere a discussion is started that will result in the creation of digital research data that needs to be maintained and managed. Don’t wait until data is at the output stage. That is too late. You will need to start with working with researchers to assemble those data management plans, what they consist of and how they’re done.

You may want to start implementing those plans, but I would heartily recommend that you work across your organizations to build support for data management before you implement a plan. The output of this early process is to produce and share a data management plan with the researcher/researchers.

For that to be successful, you’ll likely need to educate administrators about the costs and staffing required, the increasing legal issues involved in these plans and the training and tools you’re staff will need to implement those plans. It is equally important, that in doing those steps you’ll want to explain to them how librarianship is adding value to the data along the way. Librarians need to be thought of as researchers and an essential part of any research project. We want researchers to automatically think of librarians as a member of their research team.

I’ll close with the story the author of “You are not a gadget”, Jaron Lanier, told at ACRL. He pointed out that in Japan, as you would expect, there had been a lot of research into tsunamis by hydrologists. Yet, all the models they developed were never coordinated with geologists’ data concerning earthquakes. He noted that researchers are understandably very focused people. Yet there is an opportunity here and an important one that he felt was missed. He pointed out that if we, as librarians, had pointed out the relationship between those two sets of data and helped bring those groups of researchers together, think of what they would have known about what could (and did) happen. Think of what we could have helped prevent, if we’d just helped those researchers connect their data sets.

Jaron Lanier was right - it shows us the possibilities and why we should embrace this opportunity.