Nerdy Linguistic Stuff

September 18, 2013

For the next three months, I will be taking a class online through the Graduate Institute of Applied Linguistics in Dallas TX. Hopefully this will be the second-to-last class I need to take towards my MA in Applied Linguistics, which I hope to finish when we’re back in the US next fall.

The course I’m now taking is an introduction to a computer program called The Bible Translator’s Assistant. TBTA is a “Natural Language Generator” – a computer program that creates a translation of a text into another language, a translation which (hopefully) is accurate to the source text and sounds natural in the target language.

Anyone who has ever used Google Translate knows how bad “machine translations” can be. I have many stories of hapless students who tried to do their Spanish homework for me using an internet translator. The results were usually humorous, often hysterical, and always unfit for publication. Just type a paragraph from a foreign-language novel or storybook into an internet translator and try to make sense of the English “translation” you get. For extra laughs, type a paragraph from an English novel into the computer, translate it to the language of your choice, and then translate that back into English. Or try this paragraph from a Korean children’s story.

So, why do people think that computers can realistically help translate the Bible? Won’t you spend more time cleaning up a poor machine translation than you would just translating manually from scratch? And, I didn’t see Nsenga on the list of options on the internet translator…

Well, there’s a big difference between a Natural Language Generator and a program like Google Translate. Very basically, there are two steps to the translation process: (1) Analyze the source text to determine its meaning, and (2) Reconstruct that meaning using the vocabulary and syntax of the target language. It turns out that computers are pretty good at step two; where they drop the ball is in step one.

The main problem is dealing with source language ambiguities. Is the word “anger” a noun or a verb? Is “read” present tense or past tense? Is the word “may” asking permission, or is it a month of the year? What sense of the word “key” is in focus here, the shiny metal thing or the answers to the test? (This is why a student’s machine-assisted attempt to translate “Can I go to the bathroom?” started with the Spanish word for a tin can…)

A Natural Language Generator like TBTA is a fundamentally different from something like Google Translate because with an NLG, the source text is “pre-analyzed” to remove those ambiguities. Instead of an English Bible translation, TBTA starts with a linguistically-coded semantic representation of the source text, specifically designed to remove ambiguities, so that the computer can clearly apply the “rules” of the target language without danger of mis-understanding the source text.

The TBTA semantic representation for “I [a person named John] should finish reading these books.”

So, TBTA already contains an unambiguous semantic representation of the Biblical text to be translated. A linguist (that’s me) “teaches” the computer the vocabulary and grammatical rules of the target language (in this case, Nsenga). Then the computer applies the rules in a step-by-step way to the semantic representation to produce a translation in the target language.

If the linguist did his job well, the translation is grammatically correct Nsenga, targeted in simple vocabulary at about the 6^th grade reading level. Since the computer always does exactly what you tell it to, the Nsenga that is produced has exactly the same meaning as (is “semantically equivalent” to) the semantic representation of the Biblical text that it started with. And, as long as the semantic representations contained in TBTA are accurate, the Nsenga text will have the same meaning as the original. Voila. Machine-assisted translation. Natural Language Generation.

Testing has shown that a well-done TBTA translation can be used as the base text for a mother-tongue translator, who then needs to do only “light editing” to make the text smooth and natural. And, since the semantic representation has already been checked by a consultant, it theoretically obviates the need for thorough exegetical checking of the draft. In ideal circumstances, a translation project assisted by TBTA can move approximately 5 times faster than a traditional project, with much less manpower and at much lower cost.

Extra-nerdy linguistic sidebar:

TBTA is built on the principles of Natural Semantic Metalanguage Theory. This theory postulates that there are a small set of innate concepts that are present in every language, and that every word in every language can be defined using those innate concepts. These innate concepts are called “semantic primitives.”

However, using the very small number of semantic primitives (there are only about 56) would make communication unwieldy and inelegant, and it would be impossible to translate without distorting the message. So, in addition to the semantic primitives, TBTA also uses “semantic molecules,” which are slightly more complex concepts which are still almost-universally expressed by individual lexemes. TBTA has chosen for its basic vocabulary the approximately 3,000 words in Longman’s Defining Vocabulary, which is a carefully-selected list of the words most commonly used when defining other words in the Longman Dictionary of Contemporary English.

The program also makes use of so-called “complex concepts.” These complex concepts are target-language-specific semantic “bundles,” which are like a condensed shorthand way of expressing a series of semantic primitives. It’s sort of like when a person has spent 5 minutes trying to explain something to you and you finally say, “Oh! We call that _____!” For example, in English, “betray” is a complex concept that bundles together something like, “the action of a friend or ally causing a person’s enemies to be able to capture or harm that person.” The computer doesn’t have to use the unwieldy series of simple molecules, but can simply substitute “betray” each time. The complex concepts are manually written into the program in each target language by the linguist as a “rule,” so that, for example, every time the concept “a person who takes care of sheep” is encountered in the semantic representation, “shepherd” is realized on the surface output.

Currently, the big drawback to TBTA is that crafting the semantic representations of the Bible is a time-consuming and difficult task, and in some particular cases (such as the Psalms) might prove to be impossible. Only a handful of books currently have checked semantic representations. Nevertheless, the program shows promise, and testing with a TBTA-produced translation in Korean has shown that it generates text that tests favorably with other traditionally-translated Bible versions (that is, it’s as good as stuff on the shelf right now).

At any rate, taking this class is getting me one step closer to finishing my MA, which is an important professional credential that can help me in work permits, etc. Exploring TBTA is also interesting from the point-of-view of the Nsenga advisor, because once the linguistic “rules” of Nsenga have been encoded into TBTA, this program is another resource that we can use in our translation, especially as we move forward into the Old Testament after furlough. Finally, I will be the first African-based linguist to work with TBTA, and it will be interesting (hopefully interesting enough to provide a topic for my major MA thesis project!) to see how the program deals with the unique exigencies of a Bantu language.

So, wish me luck, and forgive me if I’m quieter than usual for the next three months as I add this new responsibility to my schedule. “May all of your utterances be laced with humorous semantic ambiguities.”

6 Comments leave one →

Steve Lawrenz permalink

September 18, 2013 9:33 am

Wednesday, September 18, 2013.

Chris,

I read this with interest.

I think that some time back I did get a taste of this concerning a Ukrainian translation that was made of the Bible.

As your write-up says in various ways, “With a computer it finally comes down to ‘garbage in, garbage out.’” In that way I am intrigued at how this program can find any appreciable success. I hope that when you see me again, we can talk about it.

Steve Lawrenz, Blantyre, Malawi.

Reply
Ellen permalink

September 18, 2013 11:59 am

I smile at some of your linguistic problems (although frankly I couldn’t read through this last one). This makes mine seem rather simple and maybe not as important. Volunteer teaching ESL students at the Literacy Center also brings linguistic problems. I am learning things about my language that I never thought about before teaching ESL. One little word or a single sound of a letter can make a different in a meaning.

For example: A former Chinese student needed lessons on hearing the difference between short e sound and the long I sound and short I sound… he could not hear it when he came to the center. He said, “I like your smell (meaning smile)”… now that could be offensive. Seems like a small problem, but it is huge and takes many lessons of minimal pairings to even hear the difference between these two sounds, let alone say it, or write it.

Last week my new Chinese student asked, “Teacher, teacher, is it at school or in school… when do I use at, when do I use in?” It’s subtle, I thought. “Hmmm.” I said. “l would work on that for next week”. Never thought of it before, but it could be in the store or at the store; only in the TV room not at the TV room, usually only in Milwaukee, at home not in home… etc. This week I emailed him and told him to come to class knowing what the words complicated and difficult mean. I will begin next week’s lesson with those words.

I personally know what sounds right when you use at and in (as prepositions of location), but don’t ever think I had to learn any rules for them. So I spent a good amount of time on the computer trying to figure out if there were rules, and found well sort of… at means a point and in means enclosed… usually, but not always. Try to explain at a point concept to someone with very, very limited English.

So I made a list of nouns that sounded right with at and a list of nouns sounding right with in and another list where both are used. I will use a picture/diagram to try to show at a point…the enclosed concept will be easier. But the bulk of the lesson will be repeat reading at and in phrases in different contexts. Hopefully his brain will adapt to the sound of the syntax and semantics.
Challenging… interesting… and definitely worth it. Ellen

Reply
- The Plugers permalink*
  
  September 19, 2013 12:31 am
  
  That sounds like a great way to deal with those issues, Ellen. It’s awesome that you’re helping with ESL. I did some of that at various times, but I could almost always “cheat” with my Spanish. It’s a different thing with a language you don’t know at all. And yes, it’s the “little things” in English that are such a trick to explain — why we say things the way we do, subtle pronunciation things… I had someone who wanted to know why we pronounce “the” as /thee/ sometimes, and /thuh/ at others. Over here, the big English pronunciation thing is between /l/ and /r/ — so I wonder why people suggest putting “glass” over my garden to protect the seeds, or talking about “lope,” or that a thing is “lead” (instead of “red”). Way to meet your challenges head-on!
  
  Reply
Innerspaceretreatcenter permalink

September 18, 2013 1:17 pm

Wishing you all the best in your studies

God bless

Richard

Reply
sudast permalink

October 1, 2013 1:03 pm

Hi Chris and Janine! I got here through the TBTA updates. Great explanation of TBTA. I’ve seen your names twice today, as i saw the update in GIAL alumni newsletter. I celebrate with you about the book of Mark! For myself, our family has just arrived in Asia for our first assignment. Looking forward to keeping in touch

Reply