CALL

  1. Enter several sample sentences (you can copy paste them from the web or write your own) into the textbox where it says “tokenize text”. Your sentences should include at least one contraction and at least one compound word (if you don’t know what a compound word is, see here).
  2. Observe how the different tokenizers handle your text. Look carefully at the whitespace tokenizer and answer the following question: Are spaces sufficient to tokenize English language text? Why or why not? Cite examples from your test to support your conclusion.
    • The different tokenizers split the text up based on spaces between words, punctuation, and several other patterns. Some of these tokenizers struggled with contractions because of the apostrophes, with the whitespace tokenizer succeeding in tokenizing most terms. Spaces appear to be sufficient for tokenizing a large number of English language text but there are still issues with this method. For example, contractions would be tokenized in their contracted form instead of the two words that make up the contraction. Not only that but hyphenated words would also be tokenized incorrectly because, while these should consist of separate tokens, tokenizing by spaces would combine the words into one token. Therefore, spaces are not sufficient to tokenize all English language text.
Scroll to Top