Sweet Potato

Asks and Answers Questions On Wikipedia Articles

Spring 2014


What Is It?

Sweet Potato is our group's (3 people total) submission for CMU's Natural Language Processing course's semester-long project. Our program reads in Wikipedia articles, generates human-readable questions on them, and produces answers to these questions. It then competes with those of others in a battle for supremacy.

The name "Sweet Potato" is just something random we came up with for the name of our repository, nothing more.



Questions generated on "The Departed" article

How Is It Made?

We used multiple languages and APIs in our development. For our question generator, we used Java, based on the Stanford CoreNLP parser, to break down and rearrange sentences to overgenerate questions. Next, we pass the output to Python, which uses the SKLearn package to rank questions via SVM, and pick the best ones. For our question answerer, we once again used StanfordCoreNLP to first resolve pronouns, and we also feed the output into Python, which tokenizes via NLTK and runs an algorithm that is based on weighted TF/IDF.



Answers generated on "The Departed" article

What Did I Do?

I focused on the Java area of our code, working closely with the Stanford CoreNLP library. First of this is the question overgenerator, which I included multiple ways to trick and mislead other teams' answerers, such as switching pronouns. I also made a pronoun resolver, since answers with pronouns in them are generally nonsensical. Our question generator produced the 3rd hardest question sets out of the 25 groups, where over half of the answers were incorrect.



Questions generated on "Clint Dempsey" article

What Did I Learn?

Besides applications of natural language processing algorithms, I learned techniques of agile development. I also realized that English is a language with many edge cases, which caused me to experiment heavily with my code, sometimes throwing away hundreds of lines of progress, in order to achieve desirable results. I am not afraid of hacking, failing, or restarting to get things done.



Answers generated on "Clint Dempsey" article

Github Repository