Monday, May 7, 2012

What does “A to B is supported” mean in Speller data source?

If we say this data support the transition from A to B, we mean the following:
The probability of B is higher than the probability of A according to this data.

Tuesday, May 1, 2012

Nature Language Processing: Word Tokenization Best Practice

Some tokenizers split on all non-alphabetical characters, which includes the apostrophe. This mean that words like "Ophelia's" and '"dimm'd" will be tokenized to "Ophelia s" and "dimm d".

This is a common problem in NLP tasks: The simplest method works for most cases, but turns out not to work for a variety of edge cases. One way to fix this is to split on all whitespace characters, or to first remove all punctuation.
 
Example see Dan Jurafsky's NLP Course.  https://class.coursera.org/nlp/lecture/preview