Computer Science 696MA - IS- Manuscript PDF Parse/Label

Description:

Oracle is an multinational corporation that develops products and builds tools in many different languages. An important practical problem is to make natural language processing (NLP) tools (document classification, named entity recognition, etc.) available in every such language. Traditionally, an NLP practitioner would collect training data in every language for every task for every domain, but such data collection is expensive and time-consuming. Further, many resources available in a language such as English are not available in languages with fewer speakers. In this project, we want to explore a solution to multilingual NLP that does not exclusively require labeling so much data. In particular, we would like to harness unlabeled multilingual data to learn a common representation under which structure is shared across different languages. For example, in such a space, the vector for the English word "good" is close to the vector for the French word "bon." Then, by employing the multilingual representation as features, we can train a classifier in one language and have it generalize to other languages, without much additional labeled data. In this project we will (1) explore how to learn a good multilingual representation (2) study how the number and class of languages affect the quality of the multilingual embedding space, and (3) study how well the multilingual representations allow us to transfer NLP models across different languages.

Comments:

Open to MS-COMPSCI students with a concentration in Data Science. INSTRUCTOR PERMISSION REQUIRED.

Instructor Permission:

Permission is required for interchange registration during all registration periods.

Link to Campus Courses

Back to the Five College Course Schedule

About the Five College Consortium

Academics

Community

Computer Science 696MA - IS- Manuscript PDF Parse/Label