Dissertations as Data

Content

Metadata

Title

Dissertations as Data

Description

Problem/goalThe paper provides an overview and empirical evidence on the usability of electronic theses and dissertations (ETDs) and related research data for text and data mining (TDM) techniques.Research method/procedureThe first part of the paper is a review of recent publications and projects on the potential and usefulness of ETDs for TDM, followed by a description of our own research projects in the field.Anticipated resultsUsually, research studies on dissertations and data address the handling and potential exploitation of dissertations as a “data vehicle”, where data are published together with the dissertation (e.g. as a kind of data appendix), or as a “gateway to data”, where the data are not published together with the text but are available on a distant server. Yet, often the data are not available; or data, methodology, tools, primary sources are mingled, not indexed, badly described, and unrelated with the text, unconnected with other files.Our paper will describe a different approach that may be helpful to cope with this problem, in particular (but not only) when it is impossible to distinguish between data and dissertations and thus to process the data appropriately (data repository etc.). Our approach is to consider the dissertation as a whole (text, metadata, data, numbers, facts, figures etc.), as “material” potentially exploitable by TDM tools (including natural language processing) designed for unstructured information, i.e. lacking a pre-defined data model or not organized in a pre-defined manner.These tools and techniques may be helpful to find patterns or other useful information but usually involve some kind of structuring the documents, e.g. through manual tagging with metadata. A quite different condition is the legal feasibility. While in some countries TDM for scientific purpose does not require copyright clearance because copyright exceptions recognize that it is legal to extract content for data analytics, in other countries like in France copyright-based legal barriers to TDM are still waiting for removal.Our paper will address these issues, in a general way but also with regards to recent research on content mining of UK dissertations in law and chemistry, to automatic processing of PhD metadata for innovation search and identification of scientific skills and to our own research projects on TDM of unstructured information in the fields of cultural and industrial heritage, geographical data and academic publishing. In particular, we will draw on preliminary results of our interdisciplinary research project TERRE-ISTEX (2016-2018) that will retrieve, organize and make accessible knowledge related to geographical territories from heterogeneous digital academic resources available on the ISTEX platform and in dissertations.Also, we will address the issue of retro-digitisation of older print dissertations and related material in order to make them usable for automatic content mining and to valorise these often hidden treasures of academic heritage.Practical implications/originalityThe paper will provide an up-date on an emerging and promising field of research and development. Our results will be useful for academic libraries and repositories, for the conception and creation of added value services for their ETDs.\n

Date

2016