New features: + Retrain a new tokenization model on a much bigger dataset. F1 score =0.985 + Add training data and training code + Better integration to spacy.io (removing redundant spaces between tokens after tokenization. Eg. Việt Nam , 12 / 22 / 2020 => Việt Nam, 12/22/2020]