newbie question on tokenizer and unicode text