Max-margin learning for structured outputs: Applications to NLP Sequential Segmenting Tasks

Oferta de tesi de màster

Informació General

Títol de la tesi: Max-margin learning for structured outputs: Applications to NLP Sequential Segmenting Tasks

Data de caducitat: 16/10/2009

Orientació: recerca

Departament del director de la tesi: LSI, UPC

Director de la tesi: Lluís Màrquez

Correu electrònic: lluism@lsi.upc.edu

Descripció breu:
The goal is to implement a new instantiation for the global margin maximization algorithm of SVMstruct (Tsochantaridis et al., 2004; http://svmlight.joachims.org/svm_struct.html) for dealing with sequential segmenting tasks and to test it on two significant Natural Language Processing problems (syntactic chunking and named entity recognition). The generated tool is
to be made public for research usage and incorporated into the SVMstruct suite.

Ampliació de la informació:
In the last years, there has been a great interest in extending traditional machine learning algorithms for classification to deal with structured outputs and to provide efficient learning algorithms to train them (Lafferty et al., 2001‐ICML; Altun et al., 2002; Collins, 2002‐EMNLP; Taskar et al., 2003‐NIPS; Altun et al., 2003‐ICML; Tsochantaridis et al., 2004‐ICML; etc.). Natural Language Processing (NLP) is one of the fields in which these new algorithms may probably play an important role in the near future, since NLP subtasks are represented as structured and relational models (sequences, trees, graphs, etc.).
One of the first global learning methods for structured outputs was proposed for a generalized version of a voted perceptron with kernels, and applied successfully to POS tagging and Named Entity Recognition (Collins, 2002). Another popular global
learning approach, which counts with a variety of training algorithms, is Conditional Random Fields (Lafferty et al., 2001; Sha and Pereira, 2003) . More recently, the margin maximization principle (the one guiding Support Vector Machines learning) has been applied also to develop training methods for structured outputs. One example is Hidden Markov Support Vector Machines (Altun et al., 2003), which has been released as SVMstruct in the SVMlight machine learning suite (Tsochantaridis et al., 2004). Another one is Max‐Margin Markov Networks (Taskar et al., 2003, extended version to appear in JMLR), which has been applied, for instance, to parsing, alignment and machine translation (Taskar et al., 2004‐EMNLP; Taskar et al., 2005‐EMNLP; Liang et al., 2006‐ACL) and which evolved to more recent theoretic works (Bartlett et al., 2004; Taskar et al., 2005‐ICML).
Specifically, what we propose in this master thesis is:
1) To review state of the art in learning for structured outputs;
2) Implement an efficient instantiation of SVMstruct to deal with the recognition of sequential segments (in between of
sequential labeling and hierarchical bracketing);
3) Evaluate the previous implementation in a set of selected NLP problems, including: syntactic chunking and named entity recognition;
4) Compare the implemented approach to the sequential labeling instantiation (SVM‐hmm) in terms of expressivity, accuracy and efficiency;
5) Make the software and tools available to the research community, releasing the package within the SVMstruct suite.

Requisits mínims i coneixements previs:
Courses from the NLP intensification of the AI Master Program: PLN‐CPM and PLNPMT (or equivalent).
Desirable knowledge on:
* Dynamic programming and search
* Statistical Machine Learning
* C++ and Phyton programming languages
* English language proficiency (esp. written technical language)