Tuesday, May 1, 2012

Nature Language Processing: Word Tokenization Best Practice

Some tokenizers split on all non-alphabetical characters, which includes the apostrophe. This mean that words like "Ophelia's" and '"dimm'd" will be tokenized to "Ophelia s" and "dimm d".

This is a common problem in NLP tasks: The simplest method works for most cases, but turns out not to work for a variety of edge cases. One way to fix this is to split on all whitespace characters, or to first remove all punctuation.
 
Example see Dan Jurafsky's NLP Course.  https://class.coursera.org/nlp/lecture/preview 

No comments:

Post a Comment