llama-3 has ditched it's tokenizer and has instead opted to use the same tokenizer as gpt-4 (tiktoken created by openai), it's even using the same first 100K token vocabulary.
In this video chris walks through why Meta has switched tokenizer and the implications on the model sizes, embeddings layer and multi-lingual tokenization.
he also runs his tokenizer benchmark and show's how it's more efficient in languages such as japanese
repos
------
https://github.com/chrishayuk/embeddings
https://github.com/chrishayuk/tokeniz...
Watch video why llama-3-8B is 8 billion parameters instead of 7? online without registration, duration hours minute second in high quality. This video was added by user Chris Hay 21 April 2024, don't forget to share it with your friends and acquaintances, it has been viewed on our site 3,587 once and liked it 117 people.