Migrating from PDF to XML based ETDs: a report on a work in progress at the University of Maine




Migrating from PDF to XML based ETDs: a report on a work in progress at the University of Maine


Abstract What advantage is there for your institution to move from PDF (Adobe Acrobat's Portable Document Format) to XML, the Extensible Markup Language as the standard format for ETDs? There may not be an immediate benefit from adopting XML. Over time there may be the distinct advantage of making your scholarly collections more visible and more accessible to a broader community of students and scholars. This paper reports on work in progress at the University of Maine Library to migrate its collection of PDF formatted ETDs to an XML-based system using the ETD Document Type Definition (DTD). Even though the library lacks broad campus buy-in for its ETD project and only two departments have requirements to submit electronically as well as in print form, the library is using its ETD database of metadata with links to PDF document files (32 full text documents) as one of several document collections that will be implemented under a software architecture that is XML-based. This change to an XML-based architecture will advance the library's over arching digital initiative because of XML's powerful and flexible structuring capabilities, its ability to bridge different systems, and its capacity to capture and organize metadata. XML is a richer, more robust markup language that offers an effective solution for making available the intellectual content of collections. A brief overview of current practice frames questions about the need to use XML and is followed by a more detailed view of aspects of XML that impact system design (DTDs and XML Schemas, database storage issues moving from file system to a relational database, and indexing issues using SQL Server 2000). The report concludes with a discussion of efforts to locate tools to convert existing PDF files into XML (software that allows direct, automated conversion of source files with XML tagged output for chapters, sections, paragraphs, figures, illustrations), and thoughts on the impact the change to an XML based system has on current staffing and workflow.