Conversation
There was a problem hiding this comment.
Pull request overview
This PR adds comprehensive support for ModernBERT, a recent encoder model that modernizes BERT with architectural improvements including RoPE position embeddings, alternating local/global attention, gated linear units (GeGLU), and pre-normalization. The implementation follows established patterns in the codebase and includes proper model-to-HuggingFace parameter mappings.
- Full implementation of ModernBERT model with four architectures:
:base,:for_masked_language_modeling,:for_sequence_classification, and:for_token_classification - Special attention architecture with alternating local (window-based) and global attention layers, each with distinct RoPE theta values
- Test coverage for base and MLM architectures with validation against reference outputs
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
| lib/bumblebee/text/modernbert.ex | Core implementation including encoder with alternating attention patterns, gated FFN, RMS normalization, mean pooling for sequence classification, and tied embeddings for MLM head |
| lib/bumblebee/text/pre_trained_tokenizer.ex | Adds ModernBERT special token configuration (UNK, SEP, PAD, CLS, MASK) |
| lib/bumblebee.ex | Registers ModernBERT model architectures and tokenizer type mapping |
| test/bumblebee/text/modernbert_test.exs | Integration tests for :base and :for_masked_language_modeling architectures with output validation |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
684e256 to
af54694
Compare
af54694 to
6904c32
Compare
|
Matched the output with transformers, looks good now. Added ModernBertDecoder as well, which is the decoder variant for causal language modeling. It's a separate model in transformers (different file, different config), so I created a separate module for it. Also tested loading |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 7 out of 7 changed files in this pull request and generated 1 comment.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
|
||
| @moduletag model_test_tags() | ||
|
|
||
| test ":for_causal_language_modeling" do |
There was a problem hiding this comment.
The test file is missing a test for the :base architecture. ModernBertDecoder declares support for the :base architecture in its architectures/0 function, but there is no corresponding test case. Other decoder models in the codebase (e.g., GPT2, Llama) include tests for all their supported architectures, including :base.
Add support for ModernBERT, a modern encoder model with architectural improvements over BERT: - Rotary position embeddings (RoPE) instead of absolute position embeddings - Alternating local and global attention layers for efficiency - Gated linear units (GeGLU) in feed-forward blocks - Pre-normalization with LayerNorm (no bias) - First layer reuses embedding norm for attention Supported architectures: - :base - :for_masked_language_modeling - :for_sequence_classification - :for_token_classification Reference: https://arxiv.org/abs/2412.13663
6904c32 to
d47539c
Compare
Replace manual layer norm implementation with Axon.Layers.layer_norm, passing 0 as beta to implement layer norm without bias. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Rename local_rope_theta/global_rope_theta to rotary_embedding_base_local/ rotary_embedding_base to match naming convention in other models (Gemma 3).
Replace global_attention_every_n_layers with layer_types list to match transformers v5 format. Backward compatibility: generates layer_types from global_attn_every_n_layers when not present in checkpoint config.
jonatanklosko
left a comment
There was a problem hiding this comment.
I refactored it to use Layers.Transformer.blocks, instead of doing the loop ourselves.
FTR I considered adding index as a parameter to the :block_type function, so that we could have the conditional there. It would be a bit more general solution than, however it would require passing index to Layers.Transformer.block, which implies there is a single list of blocks. That is almost always the case, but Albert has actually N groups of M blocks, so there index would be ambiguous. So for now I went with another approach, which is to special case layer norm based on the name.
This adds support for ModernBERT, a recent encoder model that improves on BERT with a few architectural changes:
ModernBert (Encoder)
Supported architectures:
:base:for_masked_language_modeling:for_sequence_classification:for_token_classificationThe MLM head uses tied embeddings (shares weights with the input token embeddings).
ModernBertDecoder (Decoder)
A decoder-only variant trained for causal language modeling (text generation).
Supported architectures:
:base:for_causal_language_modelingReference: https://arxiv.org/abs/2412.13663