Skip to main navigation Skip to search Skip to main content

Exploiting structured reference data for unsupervised text segmentation with conditional random fields

  • Stony Brook University

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

19 Scopus citations

Abstract

Text segmentation is the process of converting information in unstructured text into structured records. This is an important problem since structured data is amenable to efficient query processing. CRPs are a class of discriminative probabilistic models that are gaining acceptance as an effective computing machinery for text segmentation. An important aspect of CRFs is learning model parameters from labeled training data. Labeling can be a labor intensive process. One can avoid the labeling step by using structured reference tables whose data domains and that of the input text data given for segmentation, coincide. In other words the labels in the training data drawn from reference tables "come for free". Inspired by recent work on their use for training HMMs, we developed an unsupervised technique for text segmentation with CRFs using reference tables. Assuming text sequences to be segmented come in batches and sequences in a batch conform to the same attribute order, we build CRF models for each attribute in the reference table, use them to decide the attribute order of a batch of input sequences, derive labeled training data from the reference table according to that order, and train a global CRF model to segment the input sequences in the batch. Preliminary experimental results indicate that our technique works well in practice.

Original languageEnglish
Title of host publicationSociety for Industrial and Applied Mathematics - 8th SIAM International Conference on Data Mining 2008, Proceedings in Applied Mathematics 130
Pages420-431
Number of pages12
StatePublished - 2008
Event8th SIAM International Conference on Data Mining 2008, Applied Mathematics 130 - Atlanta, GA, United States
Duration: Apr 24 2008Apr 26 2008

Publication series

NameSociety for Industrial and Applied Mathematics - 8th SIAM International Conference on Data Mining 2008, Proceedings in Applied Mathematics 130
Volume1

Conference

Conference8th SIAM International Conference on Data Mining 2008, Applied Mathematics 130
Country/TerritoryUnited States
CityAtlanta, GA
Period04/24/0804/26/08

Fingerprint

Dive into the research topics of 'Exploiting structured reference data for unsupervised text segmentation with conditional random fields'. Together they form a unique fingerprint.

Cite this