Hey @JK_Rider, yes the notebook would be super helpful.
From this reference – #2 supports my initial assumption that queries and documents are zero-padded to fixed length:
“2. BERT manages this additional depth by pre-processing documents and queries into uniform lengths with the Wordpiece tokenizer, ideal for batch processing on GPUs.”
So for example if a document has 30 tokens – you will then zero-pad it to 512 tokens. You then apply an attention mask so the gradient only goes to those original 30 tokens, but you still need it to have the 512 size as BERT models expect a fixed-length input. This is a key distinction between encoder-only versus decoder-only or hybrid encoder-decoder / seq2seq models.