Abstract:
This paper documents a metadata schema, implementation, and associated vocabularies developed for the Internet of Samples (iSamples) project to integrate geoscience, archaeology/anthropology, biology and genomics sample descriptions in a single cross-domain catalog. To develop the sample description scheme for sample discovery across these disparate domains, we reviewed the metadata schema and example metadata from each project partner, as well as other existing schemes. Top level classes in the schema include MaterialSampleRecord, Curation, SamplingEvent, SamplingSite and Agent. By factoring sample type classification into material type, material sample object type, and sampled feature type, it has been possible to classify the approximately 6,000,000 samples in the combined corpus. Category vocabularies for these classifications were developed based unique value summaries from related fields in the source sample metadata, tested using a card sorting exercise and by development of code for automated mapping from source metadata. Each vocabulary has on the order of 20 categories with some hierarchy; the category concepts are intended to be covering, but might overlap. These vocabularies are implemented in SKOS, and published with the ARDC Research Vocabularies Australia (RVA) vocabulary service. The metadata schema is defined using a LinkML YAML file, and implemented as a JSON schema used to validate instance documents. To support interoperability mapping from the iSamples metadata schema to several other schemes is provided in the project Github.