ARCHIVED: Project: HathiTrust (Shared Digital Repository)

This content has been archived, and is no longer maintained by Indiana University. Information here may no longer be accurate, and links may no longer be available or reliable.

Primary UITS contact: Robert McDonald

Last update: February 4, 2010

Description: The HathiTrust leverages the tradition of leadership in collaboration among the institutions of the Committee on Institutional Cooperation (CIC). The HathiTrust operates under the leadership of the Repository Administrators (Indiana University and the University of Michigan), which also provide a large part of the funding. Additional governance and financial support are provided by the charter participating libraries of the CIC, and by other libraries and library consortia wishing to archive digital content.

Outcome: The HathiTrust offers persistent and high-availability storage for digitized book and journal content, beginning with the Google content from the CIC members and later extending to other digitized content. It will leverage technology investments and developments at the University of Michigan to build (through IU/UM collaboration) more generalized versions of Michigan's services and gain efficiencies from Michigan's investments.

HathiTrust governance: The executive management committee of the HathiTrust meets monthly and continues to work on a variety of issues ranging from HathiTrust finances to development priorities. The first meeting of the Operational Advisory Board took place in June. The agenda focused on a review of the CIC Steering Committee's Short- and Long-Term Functional Objectives and, where appropriate, status reports. It was agreed that some of these items would be best addressed by CIC collaborations, while others are the responsibility of the centrally funded effort. The CIC will soon convene a committee to help better define the objective to create a public interface for the HathiTrust.

We continue to have productive conversations with several other institutions about possible participation in the HathiTrust, and hope to provide information on our progress in this regard in future updates.

News:

General news

Data sets: Sample data sets containing the OCR of volumes in HathiTrust are now available. These data sets are provided in the same directory structure and format as they are stored in the repository. They are intended to give researchers the opportunity to develop routines that can be run later on larger portions of the corpus. Interested parties should contact hathitrust-datasets@umich.edu with a description of the research they intend to conduct. More information is available at HathiTrust Datasets.
Coordination between UM and UC staff: Collaboration ramped up significantly between teams at the University of Michigan and the University of California in March, in preparation for ingest of content from the University of California. Weekly conference calls speeded the teams' progress in addressing a checklist of ingest items including coordination of bibliographic information, inclusion of coordinate data for OCR files, and reporting on ingested volumes.
Ingest from Indiana University: Bibliographic metadata from Indiana University has been received at the University of Michigan, and is being loaded into local systems. Once the metadata is loaded, ingest of content will begin.
HathiTrust growth: Ingest rates decreased in March, with just under 130,000 volumes entering the repository. As in previous months, this decrease reflects the fact that ingest rates are matching the output of digital content from the University Michigan and the University of Wisconsin. When ingest of content from the University of California and Indiana University begins (projected for April) ingest rates will rise closer to our planned capacity of 500,000 volumes per month.

Deployment status

Establishing Indiana mirror site: Deployment of indexing and access systems on the Indiana University repository instance was completed in March. The repository is now a fully functioning mirror of the site at the University of Michigan with load balancing and fail-over.

Development update

Storage: The partners purchased additional storage for the Michigan and Indiana sites in March. The new storage will be installed in April and May, respectively, bringing both environments to approximately 320TB of capacity.
Large-scale search: We are using the results of large-scale search testing done so far to develop a hardware configuration for production Solr infrastructure. Investigations continue into software solutions for improving response times for slow queries.
Data API: The first draft of a functional specification for the HathiTrust Data API is complete and has been made available publicly on HathiTrust Data API for feedback. Work on the implementation of this specification is underway and will continue in parallel as feedback is received.
Public discovery interface Initial development of the temporary beta catalog for HathiTrust is nearly complete, and the catalog will be released within the next several weeks. It will provide bibliographic search and faceted browse of all volumes in HathiTrust, integrating with the HathiTrust Page Turner to provide access to individual items. Integration with the Collection Builder application will be completed in a second phase of development.

Growth

129,819 new volumes were added in March 2009.
As of April 1, 2009, the repository contained a total of 2,780,007 volumes.
30,758 public domain volumes were added in March, bringing the total number of public domain volumes to 433,641 (15% of the total content).
Ingest of Wisconsin materials continued. As of April 1, 2009, HathiTrust contained 168,098 Wisconsin volumes.

Forecast for April development

Continue to investigate ways to improve performance for slow queries in large-scale search.
Continue work on the HathiTrust Data API specification and gather input from a broader audience.
Continue coding the initial Data API implementation.
Complete initial development of the temporary public beta catalog for HathiTrust.

Outages:

PLEASE NOTE: Please contact Chris Butchart-Bailey (chrisbu at umich.edu) with email addresses of individuals or groups that should be added to our system outage mailing list to receive information about unscheduled outages.

We schedule system maintenance work that requires a system outage during time windows (in Eastern time) where academic user activity is generally lowest:

For major work, Friday evenings (8pm-1am) and Sunday mornings (5am-10am);
For minor work, weekdays from 6:30am-8am.

Advance notice for scheduled outages is given on business days, at least 24 hours in advance. Notice of unscheduled outages is given upon discovery, and additional updates are given as appropriate.

Outages in March: HathiTrust was unavailable on Tuesday, March 3 from 7-8am EST and on Thursday, March 5 from 7-7:45am EST for operating system and database software upgrades.
Outages planned for April/May: No outages are planned at this time.

HathiTrust Short- and Long-Term functional objectives: April 10, 2009

Short-term functional objectives:

Page turner mechanism: A page turner has been deployed for all content in HathiTrust. We hope to report soon on a strategy to re-engineer the current page turner application so that it provides access to materials in HathiTrust through an API. The intention is to provide a wider variety of functions or modes of access to the collections than are currently available. A first draft of the functional specification for the data API was completed in January. Following internal discussion and revision, it was released in April on the HathiTrust website for broader comment. It is now available at HathiTrust Data API. Feedback on the specification is requested and should be sent to hathitrust-info@umich.edu. (Note that this API is separate from the API for extracting metadata from HathiTrust described below).
Branding (overall initiative; individual libraries): After consultation with our partners, we released several new elements that provide support for branding in the HathiTrust repository. These elements include:
- The page turner now prominently identifies the HathiTrust initiative.
- A watermark on every page identifies the digitizing agent.
- A watermark on every page identifies the source library of the print material.
- The source of the print material is included in our feed of bibliographic identifiers so that institutions can import or update records with this information.
- Finally, we will soon be adding an element identifying the relevant partner institution for patrons of that institution.
Format validation, migration, and error-checking: Format validation and error-checking is currently performed for all content that enters HathiTrust. Although, to date, no migration of content has been necessary, we believe that we have mitigated this need by choosing rich, flexible, standards-based formats. We have performed the work required to store a variety of technical and digital preservation metadata along with each object in order to aid in migration should it become necessary. Finally, the Isilon storage automatically conducts periodic parity and media checks in the background, a fairly unique feature in storage systems and one of the reasons this storage system was seen as an appropriate match to the project.
Development of APIs that will allow partner libraries to access information and integrate it into local systems individually: The HathiTrust partners identified the need for a mechanism by which a bibliographic identifier (e.g., an ISBN or OCLC number) can be submitted to a HathiTrust API and resolved as a persistent URL with information about levels of access (e.g., full text or search only). A preliminary version of such an API has been released, and is being implemented in the online catalogs of several partners. For more information, see HathiTrust Rights API.
A second API, known as HathiTrust Data API, is available to provide secure access to HathiTrust data and metadata resources. Making these resources available to client applications (examples of current applications are the HathiTrust Collection Builder and Pageturner) will enable the creation of additional services and uses of repository materials. The specs for the Data API are available at HathiTrust Data API.

Other similar APIs will be developed as needed in the future.
Access mechanisms for persons with disabilities: HathiTrust has deployed an interface for visually impaired users (optimized for use with JAWS and other screen readers). This interface presents to the user the entire text version, with navigation, on one screen. Staff members at the University of Michigan are currently working with UM School of Information interns to optimize this interface for use with screen readers, as well as the general accessibility of the pageturner. For in-copyright resources, access is currently limited to authorized users at the University of Michigan. We plan to add Shibboleth support to the HathiTrust repository so that resources such as access mechanisms for persons with disabilities can tie into the authentication environments of our partner institutions.
Public 'Discovery' Interface for HathiTrust: HathiTrust has initiated a multi-stage strategy to create a "public interface" mechanism, an interface with which digital books and journals in the HathiTrust repository can be discovered and accessed.
- The first phase of this effort is the creation of a temporary public beta of a comprehensive bibliographic search, to be made available in April 2009. In the temporary public beta, we will provide bibliographic search and faceted browse of all content in HathiTrust, with the ability to restrict to all public domain resources or volumes digitized from a specific institution's collection. This public beta will also serve as a real-world proof of concept for the second phase.
- A second phase has begun and involves active planning discussions with OCLC on the creation of a "catalog" for HathiTrust. Chaired by Lee Konrad (Wisconsin) and John Butler (Minnesota), this group will create specifications for adaptation of WorldCat Local (WCL) for HathiTrust. The deployment of the HathiTrust WCL interface is scheduled for early 2010, with work ongoing throughout 2009.
- Subsequently, we will work to integrate this bibliographic discovery mechanism with full text searching. As this work progresses, we will provide updates in this space.
Ability to publish virtual collections: Vast bodies of digital content benefit from methods to gather together subsets into "collections" that can be searched and browsed. HathiTrust has created an early release of a Collection Builder that permits individuals to create public (i.e., shared) and private collections. We will turn our attention to creating mechanisms by which persons such as bibliographers can create and share collections with a more formal identity (cf. imagine having full text resources associated with classic bibliographies such as the Wing or Pollard and Redgrave short title lists). We are now performing intensive usability review on the Collection Builder. Although the Collection Builder's authentication and authorization now relies on the University of Michigan "friend account" guest login system (see How to Set Up a Friend Account for Guest Access to U-M Computing Resources), we will work to add Shibboleth support to the HathiTrust repository so that resources like the Collection Builder can tie into different authentication environments.
Mechanism for direct ingest of non-Google content: We are polling partner institutions for candidate digital book and journal collections that might be used for the creation of an ingest mechanism for content not digitized by Google.

Long-term functional objectives:

Compliance with required elements in the Trustworthy Repositories Audit and Certification (TRAC) criteria and checklist: HathiTrust has addressed most of the minimum required elements in the TRAC criteria and checklist. All of the required elements will receive ongoing attention, with incomplete items being assigned the highest priority. In addition, the Center for Research Libraries and HathiTrust have made plans for an independent assessment of the HathiTrust repository, based largely on the Trusted Repositories Audit and Certification (TRAC) criteria. The assessment will take place during the summer of 2009.
Robust discovery mechanisms like full-text cross-repository searching: In January and February, our experiments in large-scale search shifted from exploring different hardware configurations to load testing. In March, we began using the results of large-scale search testing done so far to develop a hardware configuration for production Solr infrastructure. Investigations also continued into software solutions for improving response times for slow queries. Summaries of monthly progress in search benchmarking are available at Large-Scale Search. We continue to work toward a goal of being able to specify the hardware and software required to support full text searching (with Solr) of all volumes projected to be in the repository.
Development of an open service definition to make it possible for partner libraries to develop other secure access mechanisms and discovery tools: We believe that the great wealth of resources that HathiTrust now makes available can only be effectively exploited through the creation of an open service definition that makes it possible for others to create new tools and approaches to access. As a first step, we intend to create a parallel production system that does not compromise the content in the repository, and gives developers access to the functions of the HathiTrust repository system. We hope that the availability of this development sandbox will make it possible for partner institutions to collaborate in creating new services through, for example, new or expanded APIs. The HathiTrust Data API is an example of this. A draft functional specification of the Data API has been completed, and is available now for public comment at HathiTrust Data API (more information on the Data API is available above in the Short-term Objectives). Future strategies may also include the implementation of Fedora as part of the repository management infrastructure. Updates on our progress will appear in this report.
Support for formats beyond books and journals: Our first "content" priority is support for digitized books and journals, but we believe that HathiTrust must expand its support to other formats (particularly born-digital publications) and materials. This is an area of future work.
Development of data mining tools for HathiTrust and use by HathiTrust of other analysis tools from other sources: Because of the vast bodies of content held by HathiTrust, an important function of the HathiTrust repository will be to support data mining and other forms of large-scale analysis. As a first step toward this goal, HathiTrust has made sample data sets of two different sizes available to researchers for computational processing and analysis. The first sample is available to all researchers through an application process. The second sample will be available to participants in the Digging Into Data Challenge. The samples are described below:
- Sample 1: The first sample is composed of 5,000 texts, which may be requested in one of three bundles. Texts in all bundles are pre-1923 (pre-1869 for works published outside of the United States) and are as follows:
  - A random sample representing four character sets and five languages (Arabic, English, French, Japanese, and Russian)
  - A random sample of English language literary and historical texts
  - A random sample of Classics texts, including original language texts and translations.
- Sample 2 - Digging Into Data: A second sample of 50,000 texts will be made available for participants in the Digging into Data Challenge. The corpus represents a mix of dates (as above, all pre-1923, and pre-1869 for materials published outside the United States), countries of origin, languages, character sets, and formats (i.e., some serial literature in a body of mostly monographic literature). More information about these data sets, as well as specifications of file formats and modes of access, will be posted soon on HathiTrust.org.

More information is available at HathiTrust Datasets.