Case Study of Analyzing the Variety of ETD Layouts


Yet, segmenting such documents automatically and accurately is challenging in dealing with various ETD layouts from different majors, disciplines, and universities. To automatically segment and determine the chapter boundaries of those ETDs, we need to understand the variation in document templates across various disciplines and universities. In this study, we have performed a case study and manual quantitative research on the variation of ETD layouts. We have taken into account several factors likely to affect the variation of ETD layouts, such as STEM/non-STEM, university, department, major, and year of publication. We have found that the layout tends to be similar within a university with slight variation among the departments. The layouts tend to vary significantly across different universities. This is likely to occur as each university library or graduate school typically provides an ETD template. From our analysis of the numbering style of the chapter/section headings, we see that STEM fields (specifically physics) prefer style 3. On the other hand, non-STEM areas, such as education and English, prefer style 1. And Then, we performed the Chi-square(?2) independency test to analyze the dependency of STEM or non-STEM fields on the numbering styles. The p-value of the Chi-square independency test is <0.001. Thus, we have seen the statistically significant dependency of the numbering style on STEM/non-STEM areas through the independence test. The findings of this study can be used to further research in document object extraction and natural language processing for machine reading.