SIAD proposal -- Discover Life

Building a bioinformatics infrastructure
to understand and monitor
species interactions and associations

Proposal to Atlas of Living Australia
Draft -- 3 June, 2010

Gerry Cassis, University of New South Wales
John Pickering, University of Georgia & Polistes Foundation

OVERVIEW

We propose to build a public domain, open source infrastructure that will enable web users to gather and share information on species interactions and associations. We envision a system that will include (1) a Schema and associated controlled vocabulary to manage such data, (2) mirrored Data warehouses of publicly accessible files, a searchable index, and associated metadata, and (3) Software tools to upload, download, and check the integrity of data records. To test and refine the system's functions, we propose (4) to populate it with Data sets that include information on Australian species. Thus, while we will design the infrastructure to meet the needs of the global biodiversity community at large, this project will serve the Atlas of Living Australia by showcasing species interactions and associations across Australia. We propose to have draft schema and demonstration versions in time to gain and incorporate feedback from the global community at (5) Workshops at the 2010 and 2011 TDWG meetings. Additional considerations in this proposal include (6) assigning permanent globally Unique identifiers to data records to link them to their sources and additional information, (7) resolving and standardizing Taxonomic names, (8) constructing lists of species for higher taxa in alternative Phylogenies, (9) the roll of potential Partners, contributors, and end users, (10) Governance, and (11) Intellectual property and terms of use.

DETAILS

Schema
We propose to develop, test, and refine a data schema to structure information on species interactions and associations writ large. The schema will provide the foundation to link taxa, through a hierarchy of specified relationships, to taxa, substrates, habitats, and other entities. We will develop a controlled vocabulary to assert biological and environmental relationships. We will test the schema with data sets that document plant-insect, host-parasite, and predator-prey relationships in Australia and its surrounding oceans. By extending the list of relationships in the schema, it could be used more broadly. For example, it eventually could describe the structure of data associating taxa with particular soil types, climate zones, or other biotic and abiotic factors.
The essence of the schema will be assertions documented by the following four fields:
- (a) a taxonomic name,
- (b) its relationship to another taxon or entity,
- (c) the name of the other taxon or entity, and
- (d) a reference to the source of the assertion.
Thus, for example, a record's assertion might be
species-a *** IS A HERBIVORE OF *** species-b *** http://...
or, at the specimen rather than species level,
specimen-1 *** WAS REARED FROM *** specimen-2 *** http://...

The schema will include addition fields to speed data management functions and improve quality control. Such fields will include information on the basis of assertions, such as whether they are based on preserved specimens, photographs, observations, or published material. They will include information and links specifying geographic locations, dates, life stages, strength of relationships, etc. We will build the schema to deal with complex biological life cycles, such as host-parasite systems that include both definitive and intermediate hosts.
Data warehouses
We propose to establish a network of data warehouses to gather, store, and serve data sets documenting species interactions through the proposed schema. Data sets will be stored in several text and zipped file formats that can be readily downloaded over the Internet by human users and automated programs using HTTP and SCP protocols. Initially, we propose to establish mirrored warehouses on ALA and Discover Life servers. Eventually, we hope that GBIF, EOL, and other potential partners will establish additional data warehouses.
Assuming that data providers will allow their information to be made publicly available to all users, we intend to make their information world readable. In short, users will be able to download individual records or a dump of one or more of all the data sets. Data will be updated by providers and then distributed to mirrors each night by automated scripts.
We will establish security procedures to assure that only authorized data providers can upload and modify files in the warehouse. We will also establish a feedback mechanism that will enable any web user to send comments on individual records to data providers. This feedback mechanism will rely on the system maintaining a permanent, globally unique identifier for each record.
The data warehouses will only store information on species interactions and associations. They will not store locality information on rare and endangered species or other sensitive information that should not be made publicly available. We envision that access to such geographic information will be controlled by data providers such as GBIF or other modules within the ALA. Thus, the warehouses will provide information about species names, their interactions, and associates, but not all associated data such as contained in specimen based Darwin Core records.
The warehouses will serve up-to-date documentation of the schema and about our proposed project.
Software tools
We propose to develop software tools and make them available to the servers managing the data warehouses. These tools will upload files from providers, check that the files and the records within them conform to the schema, report detected errors, and process raw data into various output formats that will be served from the warehouses. For maximum compatibility across platforms and software packages that providers, users, and additional warehouses may have, we will make the programs and file formats as simple as possible. At the most basic level, we will enable end users to import data about species interactions rapidly into word processors, spread sheets and other software by downloading tab delimited text files.
We envision building and distributing a core set of software tools that the warehouses will need to upload, manage, and share data. These tools will be written as shell scripts and Perl programs to work across Unix and Lenix operating systems. They will not be dependent on any particular database such as MySQL or Oracle. Instead we envision providing data files that can be easily imported into mapping software and other database applications.
Data sets
As a test of the system, we propose to import and structure several Australian insect-plant and host-parasite datasets. The Australian Faunal Directory has 36 catalogues that are potential candidates for testing the system. Assuming that we can obtain permission to use such data, for example, we propose to import information on species interactions of the order Hemiptera (Aphids, Bugs, Cicadas, Leafhoppers, Scale Insects) so as to understand their interactions with Australian plant species.
Workshops
We propose to hold workshops at the TDWG 2010 and 2011 meetings to review and modify drafts of our proposed schema and seek input from others working in systematics, ecology, evolutionary biology, and informatics.
Unique identifiers
Critical to the structure of the proposed system is the concept of permanent, unique identifiers that we will use to manage data records. We propose to work with Discover Life, GBIF and others in assigning and maintaining such identifiers for all records submitted. We intend to assign identifiers to records that are submitted without them.
Taxonomic names
We propose to interface the proposed system with other ALA modules and use standardize taxonomic names across all modules. Using web based tools, we will compare all names in the proposed test data sets and compare them against names in ALA (Australian Plant Names Index, APNI; Australian Faunal Database, AFD), GBIF, Catalogue of Life, Discover Life, and other sources of authoritative names. We will normalize the names in the species-interactions data and allow users to see both the source's original names and the interpreted normalized ones.
Phylogenies
So that users can search for species within higher taxa, in addition to searching for records on individual species, we propose to develop the means to list all the members of a specified higher taxon, such as a family or tribe. We intend to allow users to define alternative phylogenies by submitting customized lists of taxa that they want included in a search.
Partners
A component of the proposed project will be to build partnerships with data providers. Gerry Cassis will be responsible for identifying potential datasets for inclusion in the system and for negotiating terms for their use. While a large amount of geospatial data will ultimately be used in understanding species interactions, we will leave the management of such data to the mapping module of ALA, GBIF, Discover Life, and other potential users. Instead, we envision that the proposed system will provide the taxonomic glue to mapping application so that they know what species to map to understand their interactions and associations.
Governance
After we have developed the schema, we hope that an established organization such as TDWG or GBIF will adopt it and manage it after this project ends.
Intellectual property
The schema and new programs to manage species interactions and associations in the data warehouses will be placed in the public domain so that they can be maintained and improved by the informatics community. We may use existing code on Discover Life in importing data, management, and error checking. This code is not in the public domain and is owned by the Polistes Corporation, which retains exclusive copyright. Use of Discover Life's on-line tools is guaranteed to ALA in perpetuity at no cost, however. A cooperative agreement between the Polistes Foundation and the United States Geological Survey's National Biological Information Infrastructure (NBII) legally guarantees that if Polistes is unable to continue Discover Life's services, then the services will be transferred to a non-profit or government agency. Thus, the tools' functionality is legally guaranteed.

DELIVERABLES & WORK SCHEDULE

The following outlines what we propose to deliver to the ALA by quarter, assuming a start data of 1 July, 2010.

30 September, 2010
- Draft schema
- TDWG meeting
31 December, 2011
- Taxonomic Names
- Machines to UNSW
31 March, 2011
- Test dataset
30 June, 2011
- Unique identifiers
- 36 databases
30 September, 2011
- Schema, v2
- Data Warehouse
- Software tools: upgrade tools, search tools, download tools
31 December, 2011
- ALA interface

Building a bioinformatics infrastructure to understand and monitor species interactions and associations

Building a bioinformatics infrastructure
to understand and monitor
species interactions and associations