Information extraction from semistructured data such as HTML documents gains importance with the unflagging growth of Web data storage. This paper proposes a structure-based method for extracting Web contents and their metadata from a set of HTML documents generated from a common template, as shown in syllabus and staff data in universities. These HTML documents include a number of grammatical mistakes in HTML, redundant or missing fragments introduced by manual editing. This method first finds a canonical HTML document compliant with the common template. Next, the correspondences of the data between the canonical document and the other documents are identified by an approximate matching algorithm, and aligned according to the correspondences of the data. Experiments have been conducted to extract attribute names for metadata construction and, to align data records from syllabus Web pages in universities.