Login Sign Up

Implementing Tokenization Using bert-base-uncased Tokenizer

To experiment with tokenization, let’s use the transformers library by Hugging Face. This example demonstrates tokenizing text using the bert-base-uncased tokenizer. You can use other tokenizers like the GPT2 model.

Step 1: Install Dependencies

Open a Google Colab notebook and run the following command to install the transformers library and required packages using pip.

!pip install transformers

Step 2: Load a Pre-trained Tokenizer

from transformers import AutoTokenizer
# Load a tokenizer
model_name = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)

Output:

Load a Pre-trained Tokenizer
Load a Pre-trained Tokenizer

Step 3: Tokenize a Text

# Define input text
prompt = "Explain how artificial intelligence is transforming healthcare."
# Tokenize text
input_tokens = tokenizer(prompt, return_tensors="pt")
# Print tokenized output
print("Token IDs:", input_tokens.input_ids[0].tolist())
print("Decoded Tokens:", [tokenizer.decode([tid]) for tid in input_tokens.input_ids[0].tolist()])

Output:

Tokenize a Text
Tokenize a Text

Here, [CLS] and [SEP] are special tokens used in BERT models. The above example shows how different tokenizers process text differently.