Computer Science 696MA - IS- Manuscript PDF Parse/Label
Spring
2017
01
3.00
Andrew McCallum
1:00AM 1:00AM
UMass Amherst
21967
Oracle is an multinational corporation that develops products and builds tools in many different languages. An important practical problem is to make natural language processing (NLP) tools (document classification, named entity recognition, etc.) available in every such language. Traditionally, an NLP practitioner would collect training data in every language for every task for every domain, but such data collection is expensive and time-consuming. Further, many resources available in a language such as English are not available in languages with fewer speakers. In this project, we want to explore a solution to multilingual NLP that does not exclusively require labeling so much data. In particular, we would like to harness unlabeled multilingual data to learn a common representation under which structure is shared across different languages. For example, in such a space, the vector for the English word "good" is close to the vector for the French word "bon." Then, by employing the multilingual representation as features, we can train a classifier in one language and have it generalize to other languages, without much additional labeled data. In this project we will (1) explore how to learn a good multilingual representation (2) study how the number and class of languages affect the quality of the multilingual embedding space, and (3) study how well the multilingual representations allow us to transfer NLP models across different languages.
Open to MS-COMPSCI students with a concentration in Data Science. INSTRUCTOR PERMISSION REQUIRED.