#
Data Management and Access Plan
Below we describe how we will assemble, manage, analyze and disseminate information generated through the network.
An asterisk* after an individual's name indicates that there is a corresponding letter of collaboration.
-
Computer network and technical support
The cooperative agreement between the USGS and Polistes Foundation
will provide the computer infrastructure and technical support that we will need.
This infrastructure includes a network of powerful Sun Solaris and X86 servers
at the University of Georgia and an off-site location (Simpson*).
-
Long-term data preservation
Should Polistes not be able to maintain Discover Life in the future,
it is legally obligated by this cooperative agreement to transfer its tools and data to either another NGO or
to a governmental agency. In addition, we will make all specimen data widely available through
regular uploads to GBIF and other interested organizations.
-
Nightly backups
To ensure data integrity, Discover Life maintains at least three copies of all files
and keeps time-stamped copies of file updates and web transactions.
Each night a 'mirror' utility automatically copies new and edited files from
'original' servers to a 'production' server and an off-site archive.
-
Database structure
Discover Life does not use a database application, per se.
Instead it uses custom software to manage millions
of files that are stored and accessed via the operating system through Apache, mod-perl and other means.
All original files are archived, converted to formats that can be shared rapidly and securely via HTTP, HTTPS and SCP,
depending on users' requirements, and then indexed for rapid retrieval and analysis. Each night
automated programs re-index datasets that have changed and build a master index across all datasets.
While unconventional, this approach has major advantages over conventional databases: it is faster, requires
less storage space, and scales well in integrating very large datasets. The mapper, for example, plots 22,000
points per second and can display maps with over 3 million points.
-
Data import
Over the past two decades, the PI has developed tools to support the import
of many file formats, data schema, and uncontrolled vocabularies.
Rather than requiring contributors to submit their information
in a standard format such as Darwin Core, we encourage them to upload the data in their existing formats. We use
a variety of tools to convert contributions and integrate them across datasets.
In addition to supporting Darwin Core, we will integrate data from non-standard
formats from spreadsheets and databases. In addition to supporting
decimal latitude and longitude degrees required by Darwin Core, for example, Discover Life supports
over 10 additional ways of documenting latitude-longitude and UTM coordinates.
The software assigns permanent, globally unique identifiers to images and records as they are uploaded and
processed. These unique identifiers reflect contributors' specimen identifiers wherever possible.
They facilitate user feedback and error correction.
-
Data export
Discover Life's existing tools will enable the project to export data in different formats,
including tab delimited text files, XML, and HTML. We will serve images as JPEGs at 5 resolutions,
sound recordings as MP3 files, and videos at 3 resolutions in both Flash and QuickTime formats.
-
Software tools
Discover Life's integrated tools enable users to upload, manage, analyze and download information.
The tools include online albums of images and associated data (20p), a global mapper (20m),
and identification guides and checklists (20q). These three tools work together so that
records in individual albums contribute to maps and checklists, which customize identification
guides by location, which in turn simplify the determination of species in the albums.
-
Web services
The 20p, 20m and 20q software can be called as services by other websites and software programs.
When customized these services return deep links to other websites. For example, Brown* and Hogue*
call 20m with HTML iframe tags to map points that their Hover Fly Survey of LA County project has uploaded to Discover Life.
-
Documentation
We document web services and other features of Discover Life at www.discoverlife.org/help.
We include information on metadata, ownership, terms of use, and credits with pages,
images and individual records as appropriate or required by contributors.
-
Automated data processing
We will run automated programs using 'cron' to process information, update indices, output reports, and ready
files for download by other websites and individual users.
-
Error checking and correction
We will use automated programs, gazetteers, taxonomic authority lists, natural language processors and
feedback from users to detect and correct errors.
-
Security
All datasets on Discover Life require a password or specific IP address to change them.
As required by contributors, we restrict access to certain information or to the means by which it is presented.
While we will make the data collected by our high school and undergraduate teams globally available,
we will continue to abide by the terms of use of other contributors' information.
General web users will be able to contribute images and associated data.
However, their contributions will be restricted to 'purgatory' and not visible
until reviewed by project staff to make sure that they are appropriate for the site.
-
Restricted access to sensitive data
In certain cases, such as with the exact location of rare and endangered species,
we will not make all data publicly available but rather restrict them to researchers and land managers
on a need-to-know basis.
-
Terms of use
We will make the data collected by the students, in both raw and summarized forms, available
on the condition that users credit contributors and project funders,
in a manner similar to used by GBIF. We will expect authors of published papers to cite this
NSF project and the USGS for their support.
-
DATA COLLECTION (see Project Management Plan15)
The undergraduate and high school teams will upload images and other data to their personal albums.
They will document information such as where, when and how data were collected.
-
IDENTIFICATION (see Project Management Plan16)
As described in Species Identification (4.1.2.2), we will identify the species associate with records,
or when impossible, the genus or higher taxa. For fungi that are not in our target 100 species,
we will make images and associated data available to the Mushroom Observer so that their members
can identify additional taxa (Hollinger*, N. Wilson*).
-
DATA INTEGRATION (see Project Management Plan17)
In addition to the photographs and other data that our teams will collect, we propose to
make NOAA, NASA, NEON (Gram*), GBIF, ITIS and other data available through Discover Life as permitted by
the terms of use of such contributors. We will integrate such data with our field data
to facilitate answering our research questions and also make them available in various forms for investigators
not directly associated with the project.
-
ANALYSIS (see Project Management Plan18)
Our Science and Analysis Sub-committee includes the PI, Co-PI LeBuhn, Hargrove*, Hubbell*, Kjar* and others.
These individuals, working with colleagues and students,
will analyze our contemporary biological data in conjunction
with abiotic data and historical biological information from collections, field sites and the literature.
Hargrove and the PI propose to fit day-degree models to the flight periods of moths and flowering
periods of plants to evaluate how the phenology of different species respond to temperature patterns.
Their modeling will complement what they plan with MODIS satellite seasonal data around NOAA weather stations.
They will extend Hochberg, Pickering and Getz's (1986) methods to evaluate phenology models using field data.
They will use Monte Carlo methods to conduct a sensitivity analysis of their models' predictions.
Hubbell, Hargrove, the PI and their colleagues will analyze alpha and beta diversity
at our field sites both within and across the context of the NEON domains.
LeBuhn and her student will analyze how temperature affects plant-pollinator interactions and
seed set. Co-PI Stephenson will work with our modelers to analyze the impact of abiotic factors
on slime molds, fungi, and lichen growth rates.
Our analysis will include a hierarchical model framework that will allow us to separate the data acquisition model
from the ecological process model. Our initial approach will be to define an observation process model
(relating to data acquisition) and a process model (relating to the question being addressed).
Our simplest models will examine the occurrence of species and their phenology.
They will incorporate recent statistical methods to account for imperfect detection.
-
DISSEMINATION (see Project Management Plan19)
Through Discover Life we will make raw data and summary reports readily available to web users
as we process and update them nightly. We will work with partner websites and other projects to make
the information available through other outlets (NPN, Crimmins*; NEON, Gram*;
USDA Forest Service, Hargrove*; Mushroom Observer, Hollinger*; EOL, Parr* and N. Wilson*;
Floral Report Card, Schwarz*).
-
Physical samples and collections
Although the vast majority of the data that we propose to collect will be digital,
we envision collecting some physical samples to supplement this information for species and groups
that cannot be identified by digital means alone. In these cases, specimens will be deposited
in public collections.
-
Integrating new technology and software tools
We will enable other researchers and web developers to extend our existing
web services and integrate new statistical, graphical and other functions
to process and disseminate the data.