{ "metadata": {}, "nbformat": 4, "nbformat_minor": 5, "cells": [ { "id": "metadata", "cell_type": "markdown", "source": "
Generative Artificial Intelligence (AI) represents a cutting-edge domain within machine learning, focused on creating new, synthetic yet realistic data. This includes generating text, images, music, and even biological sequences. At the heart of many generative AI applications are Large Language Models (LLMs), which have revolutionized natural language processing and beyond.
\nLLMs are sophisticated neural networks trained on vast amounts of text data to understand, generate, and interact with human language. Their architecture, often based on Transformers, allows them to capture complex patterns and context within data, making them powerful tools for various applications, from chatbots to creative writing and scientific discovery.
\n\n\n\nTransformers are a type of neural network model designed to handle sequential data, such as text, by using self-attention mechanisms to weigh the importance of input elements relative to each other, enabling the model to understand and generate coherent and contextually relevant outputs.
\n
In this tutorial, we will explore the intersection of generative AI and genomics by pretraining an LLM from scratch on DNA sequences. This process will equip the model with a foundational understanding of the “grammar” of DNA, enabling it to generate and analyze genetic data with remarkable accuracy.
\nMistral AI, French artificial intelligence (AI) startup, recently launched large language models (LLMs) showing performances superior to Llama2. In particular, Mixtral-8x7B implements:
\nThese techniques collectively enhance the performance and efficiency of large language models, enabling them to process and generate text more effectively.
\nIn this tutorial, we will use a simplified Mistral model architecture with fewer layers and hidden units to reduce computational requirements. The model will be trained to predict the next base in the sequence. For instance, for a sequence like ATTTGTTGGT
, the model will be trained to predict the suffix TTGGT
given the prefix ATTTG
. This process is called causal language modeling.
To pretrain the model, we will use a file containing 100,000 non-overlapping DNA sequences of 200 bases, corresponding to around 1% of the human genome (hg38 assembly). This involves training the model to predict the end of a DNA sequence.
\nBy the end of this tutorial, we will obtain a Mistral-DNA model with an internal representation of DNA sequence grammar. This pretrained model can then be used for various applications, such as fine-tuning for classification tasks or predicting mutational effects.
\n\n\nAgenda\nIn this tutorial, we will cover:
\n\n
\n- Prepare resources \n
\n
To pretrain the model, let’s open a Notebook or a Python script.
\nThe first step is to install the required dependencies:
\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-1", "source": [ "!pip install accelerate\n", "!pip install datasets==3.0.1\n", "!pip install transformers\n", "!pip install torch" ], "cell_type": "code", "execution_count": null, "outputs": [], "metadata": { "attributes": { "classes": [ "LLMs are **sophisticated neural networks** trained on vast amounts of text data to understand, generate, and interact with human language. Their architecture, often based on **Transformers**, allows them to capture complex patterns and context within data, making them powerful tools for various applications, from chatbots to creative writing and scientific discovery. " ], "id": "" } } }, { "id": "cell-2", "source": "\n\nQuestion\nWhat are the required dependencies doing?
\n\n👁 View solution
\n\n\n
\n- \n
\n\n
accelerate
: A library by Hugging Face – a platform that provides tools and resources for building, training, and deploying machine learning models – designed to simplify the process of training and deploying machine learning models across different hardware environments. It provides tools to optimize performance on GPUs, TPUs, and other accelerators, making it easier to scale models efficiently.- \n
\n\n
datasets
: A library by Hugging Face for managing and processing datasets. It provides tools to load, manipulate, and share datasets in a standardized format, making it easier to work with machine learning data.- \n
\n\n
numpy
: A fundamental package for scientific computing in Python.- \n
\n\n
torch
: Also known as PyTorch, it is an open-source machine learning library developed by Facebook’s AI Research lab. It provides a flexible platform for building and training neural networks, with a focus on tensor computations and automatic differentiation.- \n
\n\n
transformers
: A library by Hugging Face that provides implementations of state-of-the-art transformer models for natural language processing (NLP). It includes pre-trained models and tools for fine-tuning, making it easier to apply transformers to various NLP tasks.These libraries are widely used in the machine learning and data science communities for their efficiency, flexibility, and extensive functionality.
\n
Let’s now import them.
\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-3", "source": [ "import os\n", "\n", "import accelerate\n", "import flash_attn\n", "import torch\n", "import transformers\n", "from datasets import load_dataset\n", "from transformers import (\n", " AutoConfig,\n", " AutoModelForCausalLM,\n", " AutoTokenizer,\n", " DataCollatorForLanguageModeling,\n", " EarlyStoppingCallback,\n", " Trainer,\n", " TrainingArguments,\n", ")" ], "cell_type": "code", "execution_count": null, "outputs": [], "metadata": { "attributes": { "classes": [ "LLMs are **sophisticated neural networks** trained on vast amounts of text data to understand, generate, and interact with human language. Their architecture, often based on **Transformers**, allows them to capture complex patterns and context within data, making them powerful tools for various applications, from chatbots to creative writing and scientific discovery. " ], "id": "" } } }, { "id": "cell-4", "source": "\n\n\n\n
\n- \n
datasets
:\n\n
\n- \n
load_dataset
: function to load datasets from the Hugging Face Hub or local files.- \n
transformers
:\n\n
\n- \n
AutoConfig
: Automatically loads the configuration for a pre-trained model. It defines the architecture and hyperparameters of the model.- \n
AutoModelForCausalLM
: Loads a pre-trained causal language model for tasks like text generation, where the model predicts the next token in a sequence.- \n
AutoTokenizer
: Loads the tokenizer associated with a pre-trained model. It converts text into tokens that the model can process.- \n
DataCollatorForLanguageModeling
: A data collator specifically designed for language modeling tasks. It prepares batches of data for training by handling padding and masking.- \n
EarlyStoppingCallback
: A callback used during training to stop the process early if the model’s performance on the validation set stops > improving, saving time and resources.- \n
Trainer
: A high-level API for training and evaluating transformer > models. It simplifies the training loop and handles tasks like gradient accumulation and evaluation.- \n
TrainingArguments
: A class to define the training configuration, including hyperparameters like learning rate, batch size, and number > of epochs. It is used to configure theTrainer
.These components work together to streamline the process of training and fine-tuning transformer models for various NLP tasks.
\n
\n\nComment: Versions\nThis tutorial has been tested with following versions:
\n\n
\n- \n
accelerate
> 0.32.1- \n
flash_attn
> 2.6.0.post1 and 2.7.0.post2- \n
transformers
> 4.47.1You can check the versions with:
\n\naccelerate.__version__\nflash_attn.__version__\ntransformers.__version__\n
To pretrain the model, we need to specific resources:
\nLet’s check the resources:
\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-5", "source": [ "!nvidia-smi" ], "cell_type": "code", "execution_count": null, "outputs": [], "metadata": { "attributes": { "classes": [ "LLMs are **sophisticated neural networks** trained on vast amounts of text data to understand, generate, and interact with human language. Their architecture, often based on **Transformers**, allows them to capture complex patterns and context within data, making them powerful tools for various applications, from chatbots to creative writing and scientific discovery. " ], "id": "" } } }, { "id": "cell-6", "source": "The command nvidia-smi
(NVIDIA System Management Interface) is used to monitor and manage NVIDIA GPU devices. It provides information about the GPU’s utilization, memory usage, temperature, and running processes. This tool is essential for developers and researchers to track the performance and health of GPUs, especially when running computationally intensive tasks like machine learning training.
\n\nQuestion\nHow do you interpret the following output?
\n\nTue Mar 25 13:49:35 2025\n+-----------------------------------------------------------------------------> ------------+\n| NVIDIA-SMI 550.54.15 Driver Version: 550.54.15 CUDA > Version: 12.4 |\n|-----------------------------------------+------------------------> +----------------------+\n| GPU Name Persistence-M | Bus-Id Disp.A | Volatile > Uncorr. ECC |\n| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | > GPU-Util Compute M. |\n| | | > MIG M. |\n|=========================================+========================> +======================|\n| 0 Tesla T4 Off | 00000000:00:04.0 > Off | 0 |\n| N/A 40C P8 9W / 70W | 2MiB / 15360MiB | > 0% Default |\n| | | > N/A |\n+-----------------------------------------+------------------------> +----------------------+\n >\n+-----------------------------------------------------------------------------> ------------+\n| > Processes: > |\n| GPU GI CI PID Type Process > name GPU Memory |\n| ID > ID Usage |\n|> ==============================================================================> ===========|\n| No running processes > found |\n+-----------------------------------------------------------------------------> ------------+\n
\n👁 View solution
\n\n\n
\n- \n
Driver Version
: The version of the NVIDIA driver installed on the system (550.54.15
).- \n
CUDA Version
: The version of CUDA installed, which is a parallel computing platform and API model created by NVIDIA (12.4
).- \n
GPU Name
: The model of the GPU, in this case, aTesla T4
.- \n
Persistence-M
: Indicates whether Persistence Mode is enabled (Off
in this case), which can improve performance for certain applications.- \n
Bus-Id
: The PCI bus ID of the GPU (00000000:00:04.0
).- \n
Fan
: The speed of the GPU fan (N/A
means not available or not reporting).- \n
Temp
: The current temperature of the GPU (40°C
).- \n
Perf
: The performance state of the GPU (P8 indicates a low-power state).- \n
Pwr:Usage/Cap
: The current power usage (9W) and the power cap (70W).- \n
Memory-Usage
: The amount of GPU memory currently in use (2MiB) out of the total available (15360MiB).- \n
GPU-Util
: The percentage of GPU utilization (0% indicates the GPU is idle).- \n
Compute M.
: The compute mode of the GPU (Default).- \n
Processes
: Lists any processes currently using the GPU. In this case, there are no running processes.
Let’s configure PyTorch and the CUDA environment – software and hardware ecosystem provided by NVIDIA to enable parallel computing on GPU – to optimize GPU memory usage and performance:
\nEnables CuDNN benchmarking in PyTorch:
\n torch.backends.cudnn.benchmark=True\n
\n\n\nQuestion\n\n
\n- What is CuDNN?
\n- Why enabling benchmarking?
\n👁 View solution
\n\n\n
\n- CuDNN is a GPU-accelerated library for deep neural networks.
\n- Enabling benchmarking allows CuDNN to select the fastest algorithms for the specific GPU and input size. This can improve the performance of the model, especially for fixed-size inputs.
\n
Set an environment variable that configures how PyTorch manages CUDA memory allocations
\n os.environ[\"PYTORCH_CUDA_ALLOC_CONF\"] = \"max_split_size_mb:32\"\n
\n\n\nQuestion\nWhat is this command doing?
\n👁 View solution
\n\nIt sets the maximum split size for memory allocations to 32 megabytes. This can help reduce memory fragmentation and improve memory utilization, which is particularly useful when working with large models or limited GPU memory.
\n
Let’s load now the model, Mistral-DNA
. The Mixtral model (Mixtral-8x7B-v0.1) – a pretrained generative Sparse Mixture of Experts outperforming Llama 2 70B – was modified to significantly reduce the number of parameters mostly by removing layers, such that it could be trained on a GPU such as an RTX3090.
We will get the model from GitHub:
\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-7", "source": [ "!git clone https://github.com/raphaelmourad/Mistral-DNA.git" ], "cell_type": "code", "execution_count": null, "outputs": [], "metadata": { "attributes": { "classes": [ "LLMs are **sophisticated neural networks** trained on vast amounts of text data to understand, generate, and interact with human language. Their architecture, often based on **Transformers**, allows them to capture complex patterns and context within data, making them powerful tools for various applications, from chatbots to creative writing and scientific discovery. " ], "id": "" } } }, { "id": "cell-8", "source": "Let’s check if we have the model now:
\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-9", "source": [ "!ls" ], "cell_type": "code", "execution_count": null, "outputs": [], "metadata": { "attributes": { "classes": [ "LLMs are **sophisticated neural networks** trained on vast amounts of text data to understand, generate, and interact with human language. Their architecture, often based on **Transformers**, allows them to capture complex patterns and context within data, making them powerful tools for various applications, from chatbots to creative writing and scientific discovery. " ], "id": "" } } }, { "id": "cell-10", "source": "We should get two folders: Mistral-DNA
and sample_data
. Let’s change the current working directory to Mistral-DNA/
:
Let’s look at the original archicture of Mixtral-8x7B-v0.1
which is stored in the data/models/Mixtral-8x7B-v0.1
folder (GitHub).
\n\nQuestion\n\n
\n- Which file is essential for the configuring the language model?
\n- What are the key parameters of the simplified architecture used here?
\n\n👁 View solution
\n\n\n
\n- The
\nconfig.json
file is essential for configuring the language model as a Mistral model. It specifies the architecture for causal language modeling (MixtralForCausalLM
) and details the size of the neural network components. The original Mistral model has a larger hidden size, but it is reduced here to make pre-training feasible.- The key parameters are:\n
\n\n
\n- Intermediate Size (
\nintermediate_size
): Size of the intermediate (or hidden) layers within the model. It determines the number of neurons in these layers, influencing the model’s capacity to capture complex patterns in the data. A larger intermediate size can capture more nuanced details but also requires more computational resources. Set to 256, which is relatively small compared to the original model.- Number of Attention Heads (
\nnum_attention_heads
): Number of attention heads in the multi-head attention mechanism. Each head allows the model to focus on different parts of the input sequence simultaneously, capturing diverse aspects of the data. More attention heads can provide a richer representation but also increase computational complexity. Reduced to 8 for efficiency.- Number of Experts per token (
\nnum_experts_per_tok
): Specific to models that use a Mixture of Experts (MoE) architecture. It indicates the number of expert networks that are activated for each token in the input sequence. Experts are specialized sub-networks that handle different parts of the data, improving efficiency and performance, especially for large models. Set to 1 expert per token.- Number of Local Experts (
\nnum_local_experts
): Number of local experts available in the model. Local experts are a subset of the total experts and are used to process specific parts of the input data. This localization can help in managing computational resources more effectively, especially when dealing with large-scale data. Set to 64.- Vocabulary Size (
\nvocab_size
): Specifically designed for DNA sequences, with a size of \\(4,096 = 4^6\\), as DNA consists of four possible letters (A, T, C, and G) and the words are 6-mers (sequences of six nucleotides). By modeling DNA using 6-mers, we capture meaningful patterns within the genetic sequence, enabling the model to understand and generate DNA data effectively.
Let’s load the configuration of the pre-trained model:
\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-13", "source": [ "config = AutoConfig.from_pretrained(\"data/models/Mixtral-8x7B-v0.1\")" ], "cell_type": "code", "execution_count": null, "outputs": [], "metadata": { "attributes": { "classes": [ "LLMs are **sophisticated neural networks** trained on vast amounts of text data to understand, generate, and interact with human language. Their architecture, often based on **Transformers**, allows them to capture complex patterns and context within data, making them powerful tools for various applications, from chatbots to creative writing and scientific discovery. " ], "id": "" } } }, { "id": "cell-14", "source": "By loading the configuration, we can inspect or modify the model’s architecture without loading the actual model weights. Let’s now initialize a causal language model from the loaded configuration object, with a specific attention implementation:
\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-15", "source": [ "model = AutoModelForCausalLM.from_config(config, attn_implementation=\"eager\")" ], "cell_type": "code", "execution_count": null, "outputs": [], "metadata": { "attributes": { "classes": [ "LLMs are **sophisticated neural networks** trained on vast amounts of text data to understand, generate, and interact with human language. Their architecture, often based on **Transformers**, allows them to capture complex patterns and context within data, making them powerful tools for various applications, from chatbots to creative writing and scientific discovery. " ], "id": "" } } }, { "id": "cell-16", "source": "\n\nQuestion\nWhat does
\nattn_implementation=\"eager\"
?\n👁 View solution
\n\n\n
attn_implementation=\"eager\"
specifies the attention implementation to use. Setting it to “eager” means that the attention mechanism will be executed eagerly, which can be useful for debugging or when working with dynamic computation graphs. Eager execution runs operations immediately as they are called in Python, rather than adding them to a graph for later execution.
How does the model look like?
\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-17", "source": [ "model" ], "cell_type": "code", "execution_count": null, "outputs": [], "metadata": { "attributes": { "classes": [ "LLMs are **sophisticated neural networks** trained on vast amounts of text data to understand, generate, and interact with human language. Their architecture, often based on **Transformers**, allows them to capture complex patterns and context within data, making them powerful tools for various applications, from chatbots to creative writing and scientific discovery. " ], "id": "" } } }, { "id": "cell-18", "source": "MixtralForCausalLM(\n (model): MixtralModel(\n (embed_tokens): Embedding(4096, 256)\n (layers): ModuleList(\n (0-7): 8 x MixtralDecoderLayer(\n (self_attn): MixtralAttention(\n (q_proj): Linear(in_features=256, out_features=256, bias=False)\n (k_proj): Linear(in_features=256, out_features=256, bias=False)\n (v_proj): Linear(in_features=256, out_features=256, bias=False)\n (o_proj): Linear(in_features=256, out_features=256, bias=False)\n (rotary_emb): MixtralRotaryEmbedding()\n )\n (block_sparse_moe): MixtralSparseMoeBlock(\n (gate): Linear(in_features=256, out_features=64, bias=False)\n (experts): ModuleList(\n (0-63): 64 x MixtralBlockSparseTop2MLP(\n (w1): Linear(in_features=256, out_features=256, bias=False)\n (w2): Linear(in_features=256, out_features=256, bias=False)\n (w3): Linear(in_features=256, out_features=256, bias=False)\n (act_fn): SiLU()\n )\n )\n )\n (input_layernorm): MixtralRMSNorm((256,), eps=1e-05)\n (post_attention_layernorm): MixtralRMSNorm((256,), eps=1e-05)\n )\n )\n (norm): MixtralRMSNorm((256,), eps=1e-05)\n )\n (lm_head): Linear(in_features=256, out_features=4096, bias=False)\n)\n
As expected, the model is a MixtralForCausalLM
model with several key components:
Embedding Layer (embed_tokens
): Converts input DNA sequences into dense vectors of fixed size. It maps each of the 4,096 (\\(4^{6}\\)) possible DNA tokens (representing 6-mers) to a 256-dimensional vector space. This embedding layer is crucial for transforming discrete DNA sequences into a format suitable for neural network processing.
layers
): Consists of eight MixtralDecoderLayer
modules, each containing several sub-components:\nSelf-Attention Mechanism (self_attn
)
\n\nQuestion\n\n
\n- What are the components?
\n- How is the purpose?
\n\n👁 View solution
\n\n\n
\n- The components are linear projections (
\nq_proj
,k_proj
,v_proj
,o_proj
) for queries, keys, values, and outputs, along witha rotary embedding (rotary_emb
) to incorporate positiona linformation.- This allows the model to weigh the importance of differenttokens in the sequence relative to each other, capturing dependenciesand context.
\n
Sparse Mixture of Experts (block_sparse_moe
):
\n\n\nQuestion\n\n
\n- What are the components?
\n- How is the purpose?
\n👁 View solution
\n\n\n
\n- The components are gating mechanism (
\ngate
) and list of 64 expert networks (experts
), each with multiple linear layers (w1
,w2
,w3
) and an activation function (act_fn
).- This efficiently processes input data by activating only a subset of expert networks, reducing computational load while maintaining model capacity.
\n
Layer Normalization (input_layernorm
, post_attention_layernorm
): Stabilizes and accelerates the training process by normalizing the inputs and outputs of the attention mechanism.
Final Layer Normalization (norm
): Applies normalization to the output of the final decoder layer, ensuring stable and consistent outputs.
lm_head
): Projects the 256-dimensional output of the final decoder layer back into the 4,096-dimensional vocabulary space of DNA tokens. This linear layer (Linear
) maps the hidden states to the original token space, enabling the model to predict the next DNA token accurately.This architecture ensures that the model can capture complex patterns in DNA sequences while maintaining computational efficiency, making it suitable for tasks like DNA sequence generation and analysis. The model’s design culminates in the output of 4,096 tokens, aligning with the input dimension. This consistency is crucial for accurately predicting the next token in a given DNA sequence, ensuring that the model’s predictions are coherent and reliable.
\n\n\n\nQuestion\nHow many parameters are this model?
\n👁 View solution
\n\n\npytorch_total_params = sum(p.numel() for p in model.parameters())\nprint(f\"Model size: {pytorch_total_params/1000**2:.1f}M parameters\")\n
There are 105 millions parameters. It is a big model.
\n
A tokenizer is a crucial component in natural language processing (NLP) that transforms raw text into a format that can be processed by machine learning models. In this section, we will load and configure the Byte-Pair Encoding (BPE) letter tokenizer. The BPE tokenizer efficiently handles rare and unknown words by breaking them down into frequent subword units, ensuring that the model can generalize better to unseen data. This process involves initializing the tokenizer with a predefined vocabulary and settings, enabling it to convert text into a format suitable for neural network processing. By doing so, we prepare the tokenizer to effectively manage DNA sequences, facilitating accurate and reliable model predictions.
\nLet’s loads a pre-trained tokenizer from the Hugging Face Model Hub. The tokenizer is associated with the model DNABERT-2-117M
, which is designed for processing DNA sequences.
\n\nQuestion\nWhat does the above command?
\n\n👁 View solution
\n\n\n
\n- \n
AutoTokenizer.from_pretrained
automatically identifies and loads the appropriate tokenizer for the specified model. There are 1876 sequences.- \n
trust_remote_code=True
allows the loading of custom tokenizers that may include remote code execution. It is necessary when the tokenizer requires additional custom code to function correctly.
Let’s look at the created tokenizer now:
\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-21", "source": [ "print(tokenizer)" ], "cell_type": "code", "execution_count": null, "outputs": [], "metadata": { "attributes": { "classes": [ "LLMs are **sophisticated neural networks** trained on vast amounts of text data to understand, generate, and interact with human language. Their architecture, often based on **Transformers**, allows them to capture complex patterns and context within data, making them powerful tools for various applications, from chatbots to creative writing and scientific discovery. " ], "id": "" } } }, { "id": "cell-22", "source": "PreTrainedTokenizerFast(name_or_path='zhihan1996/DNABERT-2-117M',vocab_size=4096, model_max_length=1000000000000000019884624838656,is_fast=True, padding_side='right', truncation_side='right',special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': [PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'},clean_up_tokenization_spaces=False, added_tokens_decoder={\n\t0: AddedToken(\"[UNK]\", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),\n\t1: AddedToken(\"[CLS]\", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),\n\t2: AddedToken(\"[SEP]\", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),\n\t3: AddedToken(\"[PAD]\", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),\n\t4: AddedToken(\"[MASK]\", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),\n}\n)\n
The PreTrainedTokenizerFast
is a fast and efficient tokenizer used to process text data for the DNABERT-2-117M
model. Here’s a breakdown of its configuration:
name_or_path='zhihan1996/DNABERT-2-117M'
: Specifies the name or path of the pre-trained tokenizer, indicating that it is associated with the DNABERT-2-117M
model, which is designed for processing DNA sequences.
vocab_size=4096
: Defines the size of the tokenizer’s vocabulary.
\n\nQuestion\nWhy is the size of the tokenizer’s vocabulary set to 4,096?
\n\n👁 View solution
\n\nIt corresponds to the number of unique tokens (6-mers) that the model can recognize in DNA sequences.
\n
special_tokens
: Defines a set of special tokens used by the tokenizer:
unk_token: '[UNK]'
- Represents unknown or out-of-vocabulary tokens.sep_token: '[SEP]'
- Used to separate segments within a sequence.pad_token: '[PAD]'
- Used for padding sequences to a uniform length.cls_token: '[CLS]'
- Typically used as the first token in a sequence to represent the classification token.mask_token: '[MASK]'
- Used in masked language modeling to hide tokens that the model must predict.\n\n\nQuestion\nWhat do the other configuration parameters mean?
\n\n
\n- \n
model_max_length=1000000000000000019884624838656
- \n
is_fast=True
- \n
padding_side='right'
- \n
truncation_side='right'
- \n
clean_up_tokenization_spaces=False
- \n
added_tokens_decoder
👁 View solution
\n\n\n
\n- \n
\n\n
model_max_length=1000000000000000019884624838656
: Represents the maximum length of sequences that the model can handle.This extremely large value suggests that the model is designed to process very long sequences, although in practice, the actual limit will be constrained by available computational resources.
\n- \n
is_fast=True
: Indicates that this tokenizer is optimized for speed, leveraging Rust-based implementations to accelerate tokenization processes.- \n
padding_side='right'
: Configures the tokenizer to pad sequences on the right side, ensuring that all sequences in a batch have the same length by adding padding tokens to the end of shorter sequences.- \n
truncation_side='right'
: Specifies that sequences will be truncated from the right side if they exceed the maximum length, preserving the beginning of the sequence.- \n
clean_up_tokenization_spaces=False
: Indicates that the tokenizer will not remove spaces after tokenization, preserving the original spacing in the text.- \n
added_tokens_decoder
: Maps token IDs to their correspondingAddedToken
objects, which include metadata such as whether the token is a special token and how it should be processed (e.g., stripping whitespace).
This configuration ensures that the tokenizer is tailored to efficiently process DNA sequences, handling both the tokenization and padding/truncation of sequences in a manner that aligns with the model’s requirements.
\nBy default, tokenizers may pad sequences on the right side (padding_side='right'
). Let’s set the padding direction for the tokenizer.
When tokenizing a batch of sequences, shorter sequences will be padded with special tokens on the left to match the length of the longest sequence in the batch. This can be useful for ensuring consistent input sizes, especially in models that expect fixed-size inputs.
\nLet’s look at how some DNA sequences are encoded by the tokenizer. We start with a simple sequence “ATT”:
\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-25", "source": [ "encoding = tokenizer(\"ATT\", padding=\"longest\", return_tensors=\"pt\")\n", "print(encoding)" ], "cell_type": "code", "execution_count": null, "outputs": [], "metadata": { "attributes": { "classes": [ "LLMs are **sophisticated neural networks** trained on vast amounts of text data to understand, generate, and interact with human language. Their architecture, often based on **Transformers**, allows them to capture complex patterns and context within data, making them powerful tools for various applications, from chatbots to creative writing and scientific discovery. " ], "id": "" } } }, { "id": "cell-26", "source": "The code tokenizes the DNA sequence “ATT”, pads it to the longest sequence in the batch (padding=\"longest\"
), and returns the result as PyTorch tensors (return_tensors=\"pt\"
).
{'input_ids': tensor([[ 1, 2061, 2]]), 'token_type_ids': tensor([[0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1]])}\n
Here’s a breakdown of each output component:
\ninput_ids
: A tensor containing the token IDs for the sequence. Each number corresponds to a specific token in the tokenizer’s vocabulary. In this case, [1, 2061, 2]
represents the tokens for the sequence:\n1
: the beginning of the sentence ([CLS]
)2061
: the sentence itself (ATT
)2
: the end of the sentence, a separator between sentence ([SEP]
).token_type_ids
: A tensor indicating the type of each token, often used in models that process multiple segments (e.g., question-answering). Here, all tokens are of type 0
, suggesting a single segment.
attention_mask
: A tensor that specifies which tokens should be attended to by the model (1
for real tokens, 0
for padding). In this case, all tokens are valid, so the mask is [1, 1, 1]
.This encoded format is ready for input into a transformer model, ensuring that the sequence is correctly processed and understood by the model.
\n\n\nQuestion\nWhat is the encoding for “ATTGTGGGTCCCCGTAGATGATAGGGGCCCCCC”? Specify that the tokenized sequence should have a maximum length of 5 tokens and ensure that the sequence is padded to the specified
\nmax_length
of 5 tokens.\n👁 View solution
\n\n\n
\n- To specify that the tokenized sequence should have a maximum length of 5 tokens, you need to put
\nmax_length=5
– if the sequence is longer, it will be truncated –- To ensure that the sequence is padded to the specified
\nmax_length
of 5 tokens, you need to addpadding='max_length'
– if the sequence is shorter, padding tokens will be added\nencoding = tokenizer(\"ATTGTGGGTCCCCGTAGATGATAGGGGCCCCCC\", max_length=5, padding='max_length', truncation=True, return_tensors=\"pt\")\nprint(encoding)\n
\n{'input_ids': tensor([[ 1, 2061, 281, 485, 2]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1]])}\n
In this case,
\n[1, 2061, 281, 485, 2]
represents the tokens for the sequence, likely including special tokens like [CLS] and [SEP]. As before, all tokens are of type0
, suggesting a single segment, and are valid, so the mask is[1, 1, 1, 1, 1]
.
We will now prepare the data.
\nFirst we load the data. We will not use here the whole human genome because it comprises too many sequences. Instead, we use a small subset of the data, which is less than 1% of the sequences from the human genome.
\n\n\nComment: Pre-trained model on the whole human genome\nA compact DNA model with approximately 1 million parameters that has been trained on the entire human genome can be found on Hugging Face
\n
We use the load_dataset
function from the datasets
library. This function is commonly used for loading data for Hugging Face Transformers.
\n\nQuestion\n\n
\n- How is
\ndataset_text
structured?- What are the first 5 train dataset in the data?
\n- How long are the sequences?
\n\n👁 View solution
\n\n\n
\n- \n
dataset_text
is aDatasetDict
with atrain
Dataset
containing 1 feature ('text'
) of 99,999 rows (obtained withdataset_text
)- \n
\nTo get the 5 train dataset in the data:
\n\ndataset_text['train']['text'][0:5]\n
\n['TAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCCTAACCCTAACCCTAACCCTAACCCTAACCTAACCCTAACCCTAACCCTAA',\n'CCCTAACCCTAACCCTAACCCTAACCCTAACCCCTAACCCTAACCCTAAACCCTAAACCCTAACCCTAACCCTAACCCTAACCCTAACCCCAACCCCAACCCCAACCCCAACCCCAACCCCAACCCTAACCCCTAACCCTAACCCTAACCCTACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCCTAACCCC',\n'TAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCCTAACCCTAACCCTAACCCTAACCCTCGCGGTACCCTCAGCCGGCCCGCCCGCCCGGGTCTGACCTGAGGAGAACTGTGCTCCGCCTTCAGAGTACCACCGAAATCTGTGCAGAGGACAACGCAGCTCCGCCCTCGCGGTGCTCTCCGGGTCTGTGCT',\n'GAGGAGAACGCAACTCCGCCGTTGCAAAGGCGCGCCGCGCCGGCGCAGGCGCAGAGAGGCGCGCCGCGCCGGCGCAGGCGCAGAGAGGCGCGCCGCGCCGGCGCAGGCGCAGAGAGGCGCGCCGCGCCGGCGCAGGCGCAGAGAGGCGCGCCGCGCCGGCGCAGGCGCAGAGAGGCGCGCCGCGCCGGCGCAGGCGCAGA',\n'CACATGCTAGCGCGTCGGGGTGGAGGCGTGGCGCAGGCGCAGAGAGGCGCGCCGCGCCGGCGCAGGCGCAGAGACACATGCTACCGCGTCCAGGGGTGGAGGCGTGGCGCAGGCGCAGAGAGGCGCACCGCGCCGGCGCAGGCGCAGAGACACATGCTAGCGCGTCCAGGGGTGGAGGCGTGGCGCAGGCGCAGAGACGC']\n
- \n
\nThe sequences are 200 base pair long:
\n\nlen(dataset_text['train']['text'][0])\n
\n200\n
Let’s tokenize the data. First, we create a function that tokenizes a text using the BPE letter tokenizer:
\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-29", "source": [ "def tokenize_function(examples):\n", " return tokenizer(examples['text'], padding=\"longest\", truncation=True, return_tensors=\"pt\")" ], "cell_type": "code", "execution_count": null, "outputs": [], "metadata": { "attributes": { "classes": [ "LLMs are **sophisticated neural networks** trained on vast amounts of text data to understand, generate, and interact with human language. Their architecture, often based on **Transformers**, allows them to capture complex patterns and context within data, making them powerful tools for various applications, from chatbots to creative writing and scientific discovery. " ], "id": "" } } }, { "id": "cell-30", "source": "\n\nQuestion\nWhat do the following parameters?
\n\n
\n- \n
padding=\"longest\"
- \n
truncation=True
- \n
return_tensors=\"pt\"
\n👁 View solution
\n\n\n
\n- \n
padding=\"longest\"
ensures that all sequences in the batch are padded to the length of the longest sequence, adding padding tokens as needed.- \n
truncation=True
specifies that sequences exceeding the model’s maximum length will be truncated to fit.- \n
return_tensors=\"pt\"
indicates that the output should be in the form of PyTorch tensors, suitable for use with PyTorch-based models.
We can now apply this function to the load dataset:
\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-31", "source": [ "dataset = dataset_text.map(tokenize_function, batched=True)" ], "cell_type": "code", "execution_count": null, "outputs": [], "metadata": { "attributes": { "classes": [ "LLMs are **sophisticated neural networks** trained on vast amounts of text data to understand, generate, and interact with human language. Their architecture, often based on **Transformers**, allows them to capture complex patterns and context within data, making them powerful tools for various applications, from chatbots to creative writing and scientific discovery. " ], "id": "" } } }, { "id": "cell-32", "source": "It is quite fast for the almsot 100,000 sequence of length 200 bp.
\n\n\nQuestion\n\n
\n- How is
\ndataset
structured?- What is in the first tokenized sequence of
\ntrain
Dataset
?\n👁 View solution
\n\n\n
\n- \n
dataset
is\n\nDatasetDict({\n train: Dataset({\n features: ['text', 'input_ids', 'token_type_ids', 'attention_mask'],\n num_rows: 99999\n })\n})\n
\n
dataset
is aDatasetDict
with 1train
Dataset
made of 99,999 rows and 4 features:\n
\n- \n
text
: The original text data before tokenization.- \n
input_ids
: The tokenized input data, represented as numerical IDs.- \n
token_type_ids
: Indicates the type of each token, useful for models that handle multiple segments.- \n
attention_mask
: Specifies which tokens should be attended to by the model (1
for real tokens,0
for padding).- The first tokenized sequence of
\ntrain
Dataset
(dataset[\"train\"][1]
) is a dictionary with:\n\n
\n- \n
text
: 200 base pair sequence- \n
input_ids
: list of 49 numerical values, the token IDs.- \n
token_type_ids
: list 490
- \n
attention_mask
: list of 70
(padding) and 421
(real tokens)
We will now split data between training and validation sets randomly. This is a crucial step in machine learning to ensure the model can generalize to unseen data.
\nFor that, 80% of the entire data will be used for the training set and the remaining 20% will go into the validation set. We first compute the size of training and validation sets:
\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-33", "source": [ "train_size = int(0.8 * len(dataset[\"train\"]))\n", "val_size = len(dataset[\"train\"]) - training_size" ], "cell_type": "code", "execution_count": null, "outputs": [], "metadata": { "attributes": { "classes": [ "LLMs are **sophisticated neural networks** trained on vast amounts of text data to understand, generate, and interact with human language. Their architecture, often based on **Transformers**, allows them to capture complex patterns and context within data, making them powerful tools for various applications, from chatbots to creative writing and scientific discovery. " ], "id": "" } } }, { "id": "cell-34", "source": "\n\nQuestion\nHow big are training and validation sets?
\n\n👁 View solution
\n\nTraining set has 79,999 sequences and the validation set 20,000.
\n
To perform the actual splitting of the training dataset into two subsets, we use the torch.utils.data.random_split
function from the PyTorch library that randomly splits a dataset into subsets.
The DataCollatorForLanguageModeling
is a utility class, designed to prepare and format batches of data for language modeling tasks. It handles the dynamic padding and masking of input sequences, ensuring that each batch fed into the model is correctly formatted and optimized for training.
\n\nQuestion\nWhat are the different parameters?
\n\n👁 View solution
\n\n\n
\n- \n
tokenizer=tokenizer
specifies the tokenizer to be used for processing the input data. The tokenizer converts raw text into numerical tokens that the model can understand.- \n
mlm=False
: Indicates that the data collator is set up for causal language modeling (CLM) rather than masked language modeling (MLM).
This will:
\nThe DataCollatorForLanguageModeling
is typically used in conjunction with a Trainer
from the Hugging Face library. It simplifies the data preparation process, allowing you to focus on model training and evaluation without worrying about the intricacies of batch formatting.
We are now going to defines the hyperparameters and configurations for training the language model using the Hugging Face transformers
.
Before, we specify the batch size for training and evaluation. A batch size of 32 means that 32 samples will be processed before the model updates its weights. This size is chosen to balance computational efficiency and memory usage.
\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-39", "source": [ "batchsize=32\n", "training_args = TrainingArguments(\n", " output_dir=\"./results/models\",\n", " evaluation_strategy=\"epoch\",\n", " save_strategy=\"epoch\",\n", " num_train_epochs=50,\n", " per_device_train_batch_size=batchsize,\n", " per_device_eval_batch_size=batchsize,\n", " learning_rate=5e-4,\n", " weight_decay=0.01,\n", " logging_dir=\"./logs\",\n", " load_best_model_at_end=True,\n", " fp16=True,\n", " gradient_accumulation_steps=50,\n", " report_to=\"none\",\n", ")" ], "cell_type": "code", "execution_count": null, "outputs": [], "metadata": { "attributes": { "classes": [ "LLMs are **sophisticated neural networks** trained on vast amounts of text data to understand, generate, and interact with human language. Their architecture, often based on **Transformers**, allows them to capture complex patterns and context within data, making them powerful tools for various applications, from chatbots to creative writing and scientific discovery. " ], "id": "" } } }, { "id": "cell-40", "source": "output_dir=\"./results/models\"
: directory where the training outputs, including model checkpoints and results, will be saved.evaluation_strategy=\"epoch\"
indicates that the model’s performance will be evaluated at the end of each epoch, a complete pass through the entire training dataset. This allows for monitoring the model’s progress and adjusting the training process as needed.save_strategy=\"epoch\"
specifies that the model will be saved at the end of each epoch. This ensures that checkpoints are available for each complete pass through the dataset.num_train_epochs=50
sets the total number of training epochs to 50. This means the model will iterate over the entire dataset 50 times, allowing it to learn and optimize over multiple passes.per_device_train_batch_size=batchsize
and per_device_eval_batch_size=batchsize
set the batch size for training and evaluation on each device (e.g., GPU) to 32. This ensures consistency in batch processing across different stages of training and evaluation.learning_rate=5e-4
defines the learning rate for the optimizer, set to \\(5 \\times 10^{-4}\\). This rate controls the step size during gradient descent and is a common choice for pre-training models.weight_decay=0.01
applies L2 regularization to the model weights with a standard decay rate of 0.01. This helps prevent overfitting by penalizing large weights.logging_dir=\"./logs\"
specifies the directory where training logs will be stored, allowing for monitoring and analysis of the training process.load_best_model_at_end=True
ensures that the best model, based on the lowest evaluation loss, is loaded at the end of training. This helps in selecting the model with the best performance across all epochs. During gradient descent, the model will be optimized, and at some point, the loss will start to increase again. We want to pick the model with the lowest loss, not when it starts increasing. So, “load best model at the end” means selecting the model with the best loss across all epochs.fp16=True
enables mixed-precision training using 16-bit floating-point numbers. This reduces memory usage and can speed up training on compatible hardware.gradient_accumulation_steps=50
accumulates gradients over 50 steps before performing a backward pass. This effectively increases the batch size without requiring additional memory, helping to stabilize training.report_to=\"none\"
disables Weights & Biases (WandB), a popular platform used for experiment tracking, dataset versioning, and model management in machine learning
\n\nComment: Why Disable WandB?\nDisabling WandB is often done in specific scenarios:
\n\n
\n- Avoiding Unwanted Logging: If we do not intend to use WandB for tracking our experiments or if we want to avoid potential conflicts with other logging mechanisms, we would disable it.
\n- Reducing Overhead: WandB logging can introduce some overhead, particularly when dealing with large datasets or complex models. Disabling it can slightly improve performance if tracking is not essential.
\n- Testing/Debugging: During testing or debugging, we might prefer to have more control over logging or we might want to avoid cluttering our WandB workspace with intermediate results.
\n
\n\nQuestion\nWhat is stored in
\ntraining_args
: the parameters to the model, the parameter of the LLM or the parameters of the trainer function?\n👁 View solution
\n\nThe parameters of the trainer function
\n
Here is the most important part: the pre-training process. For this, we will use a Trainer
function. This function takes as input the model that we built previously, which has an architecture but no initialized weights.
The Trainer function also takes:
\nargs
: the training arguments we configured earlierdata_collator
: the data collator function feeding the tokenized data sequences to the model.train_dataset
: the training set, i.e. the data used for computing the gradientseval_dataset
: the validation set, i.e. the data used to assess the prediction accuracy at each epoch. It’s important to use a validation set that is independent of the training set to ensure unbiased evaluation.callbacks
: EarlyStoppingCallback
with a patience of three is used to monitor the training process.
During training, we minimize the loss at each step. However, at some point, the loss may start to increase again. We want to capture the model parameters when the loss reaches its minimum. By using a patience of three, we aim to mitigate the effects of noise during training. Noise can cause fluctuations in the loss, making it seem like we’ve reached a local minimum when a better one might be found with further training.
\nWith a patience of three, even if we find a good minimum, we wait for three more epochs to ensure that the loss does not improve further. If the loss does not decrease for three consecutive epochs, we stop training. However, if a better model with a lower loss is found within those three epochs, training continues. This approach helps in finding a more robust local minimum by reducing the impact of noise in the training data.
\nLet’s launch the training with trainer.train()
method
Here, the trainer is set to run for 50 epochs. After the initiation, we get an estimation of the time it takes per epoch to get an idea of the total training duration. Let’s run it for a bit to see how long it takes.
\nWith this small model and dataset, the estimated time to run 50 epochs is 20 hours – this value changes depending on the infrastructure –.
\n\n\nQuestion\nWill the model be trained to 50 epochs?
\n\n👁 View solution
\n\nSetting the number of epochs to 50 doesn’t mean the model will train for all 50 epochs. It’s likely to stop earlier
\n
The 50 epochs serve as a maximum limit. The model will stop training earlier if it reaches the minimum loss and then starts to increase again, thanks to the early stopping callback. This means the model might only require half the epochs, perhaps 25 epochs or 10 hours, to achieve optimal performance.
\n\n\nComment: Don't train until the end\nThe idea here is not to train the model until completion, as it would take too much time.
\n
Let’s stop the actual training and cheat a bit by loading a previously trained Mistral model:
\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-45", "source": [ "model = AutoModelForCausalLM.from_pretrained(\"RaphaelMourad/Mistral-DNA-v1-17M-hg38\")" ], "cell_type": "code", "execution_count": null, "outputs": [], "metadata": { "attributes": { "classes": [ "LLMs are **sophisticated neural networks** trained on vast amounts of text data to understand, generate, and interact with human language. Their architecture, often based on **Transformers**, allows them to capture complex patterns and context within data, making them powerful tools for various applications, from chatbots to creative writing and scientific discovery. " ], "id": "" } } }, { "id": "cell-46", "source": "This is a mixed model that was pre-trained on the entire Human Genome. It contains approximately 17 million parameters and was trained using the Human Genome assembly GRCh38. Unlike models pre-trained on sequences of 200 bases, this model was pre-trained on sequences of 10,000 bases (10K). The advantage of this model is its ability to process larger DNA contexts or sequences. This capability allows it to capture more extensive patterns and dependencies within the genomic data.
\n\n\nQuestion\nBy looking at the output of:
\n\nmodel\n
\n
\n- How many transformer layers does this model have?
\n- Is it similar to previous model?
\n\n👁 View solution
\n\n\n
\n- 8 transformer layers
\n- Yes
\n
With this kind of model something, we can convert the DNA sequence to a vector.
\nLet’s:
\nThe generated hidden states are the internal representations of the input sequence at different layers of the model. Here we look at the hidden neurons of the last layer. They capture contextual information about the sequence and provide a richer representation of the sequence compared to the raw nucleotide string, capturing contextual information that can be used for tasks such as sequence similarity analysis, functional prediction, variant impact analysis, and more.
\n\n\nQuestion\nWhat is the shape of
\nhidden_states
?\n👁 View solution
\n\n\n
[1, 17, 4096]
:\n
\n- \n
1
: number of sequences, here 1 DNA sequence- \n
17
: number of words, here the DNA sequence has been converted to 17 words larger that 1- \n
4096
: size of the vocabulary, the number of possible tokens
We would like now to calculate the mean of the hidden states across a specific dimension, here the first layer of the model (hidden_states[0]
):
dim=0
indicates that the mean is calculated across the sequence length dimension. This effectively averages the hidden states for each token position in the sequence, resulting in a single vector that represents the entire sequence.
\n\nQuestion\n\n
\n- What is the shape of
\nembedding_mean
?- Which type of data is in
\nembedding_mean
?\n👁 View solution
\n\n\n
\n- \n
4096
, the number of possible tokens.- \n
embedding_mean
is a vector of numerical values.
embedding_mean
is a numerical vector of size 4,096 that represents the average embedding of the DNA sequence. This fixed-size representation can be used for various downstream tasks, such as classification, clustering, or similarity comparisons.
\n\n\nHands On\nApply a max pooling instead of a mean pooling to summarize information along the DNA sequence.
\n👁 View solution
\n\n\nembedding_max = torch.max(hidden_states[0], dim=0)[0]\n
\n\nComment: Similar process to ChatGPT\nWhen you use a system like ChatGPT, the process involves converting your textual input, or “prompt,” into a numerical vector. This conversion is similar to the process we just did. Here’s how it works:
\n\n
\n- Input Prompt: You write a prompt, which is a textual query or statement.
\n- Tokenization: The prompt is tokenized, meaning it is broken down into smaller units, such as words or subwords, using a tokenizer.
\n- Vector Representation: These tokens are then converted into numerical vectors, or embeddings. These vectors capture the semantic meaning and context of the words in the prompt.
\n- Model Processing: The model processes these vectors to generate a response. The embeddings allow the model to understand the context and nuances of your input, enabling it to produce coherent and relevant responses.
\nThis process of converting text into numerical vectors is fundamental to how language models like ChatGPT operate, enabling them to interpret and generate human-like text based on the input they receive.
\n
This tutorial provides a comprehensive guide to preparing, training, and utilizing a pre-trained language model for DNA sequence analysis. It begins by setting up the necessary resources, including installing dependencies, importing Python libraries, and configuring computational resources. The tutorial then walks through loading and choosing an appropriate model architecture for DNA sequences, followed by setting up a tokenizer to convert DNA sequences into numerical tokens. Data preparation involves loading, tokenizing, splitting, and collating DNA sequences to ensure efficient model training. The training process is detailed with parameter definitions and pretraining steps, culminating in the calculation of DNA sequence embeddings.
\nWe can now leverage the pre-trained model in various bioinformatics applications, such as sequence similarity analysis and functional prediction, offering a robust foundation for integrative biological research.
\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "cell_type": "markdown", "id": "final-ending-cell", "metadata": { "editable": false, "collapsed": false }, "source": [ "# Key Points\n\n", "- Efficient Model Training: By leveraging parameter-efficient fine-tuning techniques and distributed training strategies, it is possible to train large language models on DNA sequences using consumer-grade hardware, making advanced bioinformatics research more accessible.\n", "- Importance of Data Preparation: Properly tokenizing and organizing DNA sequence data is crucial for effective model training and evaluation, as it directly impacts the model's ability to learn and generalize from the data.\n", "- Practical Applications of Embeddings: The embeddings generated by a trained language model capture rich contextual information about DNA sequences, enabling a wide range of downstream applications, from sequence classification to functional prediction in genomics research.\n", "\n# Congratulations on successfully completing this tutorial!\n\n", "Please [fill out the feedback on the GTN website](https://training.galaxyproject.org/training-material/topics/statistics/tutorials/genomic-llm-pretraining/tutorial.html#feedback) and check there for further resources!\n" ] } ] }