Extracting and re-using research data from chemistry e-theses: the SPECTRa-T Project




Extracting and re-using research data from chemistry e-theses: the SPECTRa-T Project


Objective: Scientific e-theses are data-rich resources, but much of the information they contain is not readily accessible. For chemistry, the SPECTRa-T project1 has addressed this problem by developing data-mining techniques to extract experimental data, creating RDF (Resource Description Framework) triples for exposure to sophisticated Semantic Web searches. Methods: We used Oscar3 2, an Open Source chemistry text-mining tool, to parse and extract data from theses in PDF, and from theses in Office Open XML document format. Results: Theses in PDF suffered data corruption and a loss of formatting that prevented the identification of chemical objects. Theses in .docx yielded semantically rich SciXML that enabled the additional extraction of associated data. Chemical objects were placed in a data repository, and RDF triples deposited in a triplestore. Conclusions: Data-mining from chemistry e-theses is both desirable and feasible; but the use of PDF, the de facto format standard for deposit in most repositories, prevents the optimal extraction of data for semantic querying. In order to facilitate this, we recommend that universities also require deposition of chemistry e-theses in an XML document format. Further work is required to clarify the complex IPR issues and ensure that they do not become an unwarranted barrier to data extraction and re-use. References: 1. SPECTRa-T (Submission, Preservation and Exposure of Chemistry Teaching and Research Data from Theses): http://www.lib.cam.ac.uk/spectra-t/ 2. Oscar3: Corbett P, Murray-Rust P: ""High-Throughput Identification of Chemistry in Life Science Texts"". Computational Life Sciences II, pp.107-118 (Berlin: Springer, 2006)