Computers may understand you better thanks to new MIT database
Computers are already getting pretty good at deciphering human speech thanks to advancements in natural language processing (NLP), but so far most of these programs have been trained to understand native speakers talking in their own languages. Researchers at the Massachusetts Institute of Technology (MIT) want to change that, and today they announced that they have just completed the first major database of non-native English.
“English is the most used language on the Internet, with over 1 billion speakers,” said Yevgeni Berzak, an MIT graduate student who headed up the project. “Most of the people who speak English in the world or produce English text are non-native speakers. This characteristic is often overlooked when we study English scientifically or when we do natural-language processing for English.”
People make grammatical mistakes all of the time, especially in speech, but good NLP programs are able to navigate those mistakes to understand what the user means rather than what they actually say. This process is a bit more difficult with non-native speakers, who often make more unusual mistakes that would not be made by native speakers. If an NLP program only looks at repositories of native speaker data, it would have trouble understanding input from non-native speakers.
The data used by MIT’s new project comes from 5,124 sentences taken from essays written by English as a second language (ESL) students, which include speakers of 10 non-English languages that are spoken by roughly 40 percent of the world’s population. The sentences have all been annotated for parts of speech ranging from basic concepts, such as verbs and nouns, to more complicated concepts, including plurality, verb tense, adjectives, and more.
In addition, the researchers also used a recently developed annotation scheme called Universal Dependency (UD), which offers a deeper analysis of the relationships between words in a sentence, such as which words function as direct or indirect objects, which nouns are modified by adjectives, and so on. This allows the sentences to be annotated not only for structure but also for meaning.
“What I find most interesting about the ESL [dataset] is that the use of UD opens up a lot of possibilities for systematically comparing the ESL data not only to native English but also to other languages that have corpora annotated using UD,” said Joakim Nivre, an expert on computational linguistics and one of the creators of the Universal Dependency system. “Hopefully, other ESL researchers will follow their example, which will enable further comparisons along several dimensions, ESL to ESL, ESL to native, et cetera.”
Image credit: Kārlis Dambrāns (license)
Since you’re here …
… We’d like to tell you about our mission and how you can help us fulfill it. SiliconANGLE Media Inc.’s business model is based on the intrinsic value of the content, not advertising. Unlike many online publications, we don’t have a paywall or run banner advertising, because we want to keep our journalism open, without influence or the need to chase traffic.The journalism, reporting and commentary on SiliconANGLE — along with live, unscripted video from our Silicon Valley studio and globe-trotting video teams at theCUBE — take a lot of hard work, time and money. Keeping the quality high requires the support of sponsors who are aligned with our vision of ad-free journalism content.
If you like the reporting, video interviews and other ad-free content here, please take a moment to check out a sample of the video content supported by our sponsors, tweet your support, and keep coming back to SiliconANGLE.