42031 - Natural Language Processing for Massive Textual Data Management (PLN-PMT) [UPC]

Type: S3 Course
Semester: Fall
Teaching Points: 15
Offer: Annual
Responsible Unit: CS-UPC
Responsible: Lluís Márquez
Language: English


The main goal of this course is to provide students with an in depth knowledge of the techniques, methods and tools, both symbolic and empirical, of Natural Language Processing. The course focuses on systems dealing with human machine communication. Generally these systems manage linguistic knowledge explicitly.

The linguistic knowledge involved can be built from scratch for a specific application or taken from either domain-restricted or general purpose linguistic repositories.

This course is closely coupled with the course on the processing of massive amounts of textual data (Natural Language Processing for massive textual data management). Together they provide students with a broad knowledge of the two basic paradigms in NLP in the context of the two most frequent scenarios.

Finally, the course will introduce students to the most active research areas related to the different course topics.


The content of the course is organized into three main blocks:

  • The most representative applications based on a massive processing of textual data. These applications are currently being used mainly in the context of Internet processing and the automatic organization of very large document data bases, but they are still the focus of very active research. The concrete set of applications that will be covered by the course are: Document Categorization, Information Extraction, and Automatic Summarization.
  • Basic generic tasks, which can be very useful for the applications listed above (apart from others). We will cover only those that have not been introduced by previous mandatory courses of the Master. More concretely, the generic tasks studied will be: partial parsing, word sense disambiguation, and semantic role labeling.
  • The introduction of advanced Machine Learning Techniques for Natural Language Processing. These algorithms and techniques are very useful for implementing most of the generic tasks described in the previous point. By “advanced” we mean that the ML topics covered extend the basic techniques that are already known by students from the previous mandatory courses of the Master.

The table of contents is structured in four themes. The numbers accompanying each of the titles give an orientation of the percentage of the course that will be devoted to the corresponding theme. As can be seen, the main focus will be on the applications.

1. Introduction
  • The necessity of automatically processing massive quantities of textual data. Main applications in this domain.
2. Advanced Topics in Machine Learning
  • Statistical Methods: Maximum Entropy modeling
  • Discriminative Learning Methods: Boosting, Support Vector Machines
  • Semi-supervised Learning: Bootstrapping, co-training and variants
  • Learning & Inference for relational and structured domains
3. Generic Tasks
  • Partial parsing: chunking, clause boundary detection
  • Word Sense Disambiguation 3.3 Semantic Role Labeling
4. Applications
  • Document Categorization: thematic classification, using hierarchies of concepts from the Web, subjective classification (intention, sentiment, etc.)
  • Information Extraction: typology, adaptability, multilinguality, evaluation
  • Automatic Summarization: single document, multi-document, multilingual