How does BERT handle out-of-vocabulary words?

2023-08-26 / 资讯 / 148 阅读

  BERT uses a technique called WordPiece tokenization to handle out-of-vocabulary words. WordPiece is a subword-based tokenization method that splits words into subword units. It starts by building a fixed vocabulary of common words and subwords based on the training corpus. Out-of-vocabulary words are then broken down into subword units that exist in the vocabulary.

  During training, BERT tokenizes the input text into WordPiece tokens. If a word is not found in the vocabulary, it is split into subword units. For example, the word "unhappiness" might be tokenized into "un", "##happy", and "##ness". The double hash symbol (##) denotes that it is a subword unit. This way, BERT can handle rare or unseen words by representing them as combinations of known subword units.

  During prediction, when encountering an out-of-vocabulary word, BERT breaks it down into subword units and tries to match them to the vocabulary. If some subword units are not found, they are further split until known subword units are obtained.

  By using WordPiece tokenization, BERT can handle out-of-vocabulary words effectively without sacrificing the benefits of word-level representation. It allows BERT to capture contextual information for both known and unknown words, making it more robust in understanding the meaning of a sentence or text.

#免责声明#

  本站所展示的一切内容和信息资源等仅限于学习和研究目的,未经允许不得转载,不得将本站内容用于商业或者非法用途。
  本站信息均来自AI问答,版权争议与本站无关,所生成内容未经充分论证,本站已做充分告知,请勿作为科学参考依据,否则一切后果自行承担。如对内容有疑议,请及时与本站联系。