Extracting structured data from web pages with maximum entropy segmental Markov models

dc.contributor.committeeChairMengel, Susan A.
dc.contributor.committeeMemberShin, Michael
dc.contributor.committeeMemberWatson, Richard
dc.creatorJing, Yaoqin
dc.date.available2011-02-18T23:13:09Z
dc.date.issued2007-12
dc.degree.departmentComputer Scienceen_US
dc.description.abstractThe conventional ways for retrieving information from web pages are time-consuming. A possible solution is to integrate useful data over the whole Internet with uniform schemes so that people can easily access and query the data with the relational database techniques. Many approaches are proposed to solve this problem. Based on the degree of users' involvement, these approaches can be classified into three categories: manual, semi-automatic, and automatic. This dissertation proposes a novel semi-automatic approach based on the maximum entropy segmental Markov model to extract structured data from web pages. The main purpose of this approach is to overcome the shortcomings existing in current semi-automatic approaches: many training web pages and too general or specific learned models (or templates). This approach decreases the number of training web pages by modeling the sequences embedding structured data instead of their context. In addition, the sequences embedding structured data are modeled with segmental Markov models, each of whose states corresponds to a subsequence embedding one data item. Finally, the maximum entropy principle is applied to learn the transition distributions to prevent generating too general or specific models from training data. This approach, therefore, can reduce the users' labor of preparing training data while remaining a good performance. The experimental results on thirty web sites show this approach has better performance than Stalker, a known good performance semi-automatic approach, when only one training web page is provided.
dc.format.mimetypeapplication/pdf
dc.identifier.urihttp://hdl.handle.net/2346/19617en_US
dc.language.isoeng
dc.publisherTexas Tech Universityen_US
dc.rights.availabilityUnrestricted.
dc.subjectData extracten_US
dc.subjectMaximum entropy principleen_US
dc.subjectMakov modelen_US
dc.titleExtracting structured data from web pages with maximum entropy segmental Markov models
dc.typeDissertation
thesis.degree.departmentComputer Science
thesis.degree.disciplineComputer Science
thesis.degree.grantorTexas Tech University
thesis.degree.levelDoctoral
thesis.degree.namePh.D.

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
Jing_Yaoqin_Diss.pdf
Size:
1.77 MB
Format:
Adobe Portable Document Format