IntroductionRecent years have seen deep learning (DL) models achieve remarkable proficiency in complex computational tasks, including protein structure prediction[1], strategic reasoning [2], and natural language generation [3]—areas previously thought to be the exclusive domain of human intelligence. Traditional(symbolic) programming allows functions like 𝑓(𝑥,𝑦)=cos(𝑎𝑥)+sin(𝑏𝑦)to be implemented in code with clear typographical isomorphism—meaning thecode’s structure directly mirrors the mathematical notation. For example, in thelanguage Haskell: f x y = cos(a * x) + sin(b * y). In contrast, DL modelsare inherently sub-symbolic, meaning that the models’ atomic constituents(often 32-bit floating-point numbers centered around 0) do not map directly tomathematical vocabulary. For reference, shows a DL-based implementation ofthe aforementioned function. Indeed, the increasing prevalence of DL can beunderstood as a transition from symbolic to sub-symbolic algorithms.Precursors to modern DL methods learned how to weigh human-designed features [4], with later works learning to create features from data to subsequentlyweigh [5], [6]—in combination with tree search strategies, in the case of games[7]. Very recent DL work has even eliminated tree search in the case of chess,mapping directly from observation space to action space [8]. Pure DL methodsare thus becoming ubiquitous but remain largely inscrutable, with recent worksstill attempting to define what interpretability even means in the DL context[9]. Given the sub-symbolic nature of DL models, it is unsurprising that theirinterpretation remains difficult.Mathematically, DL refers to a set of methods that combine linear maps (matrixmultiplications) with non-linearities (activation functions). Formally, all thepotential numerical values of a given model’s weights 𝑊 can be thought of asa hypothesis space . Often, is determined by human decisions (numberof layers, kinds of layers, sizes of layers, etc.). is then navigated using someoptimization heuristic, such as gradient descent, in the hope of finding a 𝑊that “performs well” (i.e., successfully minimizes some loss often computedby a differentiable function with respect to 𝑊) on whatever training data ispresent. This vast, sub-symbolic hypothesis space, while enabling impressiveperformance and the solving of relatively exotic¹ tasks, makes it challengingto understand how any one particular solution actually works (i.e., a black boxalgorithm).The ways in which a given model can minimize can be placed on a continuum:on one side, we have overfitting, remembering the training data, (i.e. functioningas an archive akin to lossy and even lossless compression); and on the other,we have generalization, learning the rules that govern the relationship betweeninput and output (i.e. functioning as an algorithm).When attempting to give a mechanistic explanation of a given DL model’sbehavior, it necessarily entails the existence of a mechanism. Mechanisticinterpretability (MI) assumes this mechanism to be general, thus making generalization a necessary (though insufficient) condition. Generalization ensures thatthere is a mechanism/algorithm present to be uncovered (necessity); however,it is possible for that algorithm to be so obscurely implemented that reverseengineering, for all intents and purposes, is impossible (insufficiency). Variousforms of regularization are used to incentivize the emergence of algorithmic(generalized) and interpretable, rather than archiving (over-fitted) behavior [10],[11], [12].As of yet, no MI work has explored the effect of multi-task learning, the focusof this paper. Multitask learning also has a regularizing effect [13]. Formally, theset of hypotheses spaces for each task of a set of tasks (often called environment)is denoted by . When minimizing the losses across all tasks in parallel,generalizing 𝑊’s are thus incentivized, as these help lower loss across tasks (incontrast to memorizing 𝑊’s that lower loss for one task). A 𝑊 derived froma multi-task training process can thus be thought of as the intersection of thehigh-performing areas of all .In this spirit, the present paper builds on the work of Nanda et al. (2023), whichtrains a transformer [15] model to perform modular addition, as seen in Eq. 1.The task is denoted as 𝒯nanda throughout the paper.(𝑥0+𝑥1)mod𝑝,𝑥0,𝑥1<𝑝,𝑝=113(1)The task of this paper focuses on predicting remainders modulo all primes 𝑞 lessthan 𝑝, where 𝑥 is interpreted as 𝑥0𝑝0+𝑥1𝑝1, formally shown in Eq. 2, and isreferred to as 𝒯miiii:(𝑥0𝑝0+𝑥1𝑝1)mod𝑞,𝑥0,𝑥1<𝑝,𝑞<𝑝,𝑝=113(2)𝒯miiii differentiates itself from 𝒯nanda in two significant ways: 1) it is non-commutative, and 2) it is, as mentioned, multi-task. These differences presentunique challenges for mechanistic interpretation, as the model must learn tohandle both the order-dependent nature of the inputs and develop shared representations across multiple modular arithmetic tasks. Further, as 𝒯miiii is harderthan 𝒯nanda the model can be expected to generalize slower when trained on theformer. Therefore, Lee et al. (2024)‘s recent work on speeding up generalizationby positing the model parameters gradients through time can be viewed as asum of 1) a slow varying generalizing component (which is boosted), and 2),a quick varying overfitting component (which is suppressed), is (successfully)replicated to make training tractable.Figure 1: Visualizing natural numbers less than 12769 in polar coordinates(𝑛,𝑛mod2𝜋). Left: union of numbers with remainder 0 mod 17 and 23 (see thetwo spirals). Middle: numbers with remainder 0 mod 11. Right: prime numbers.It is shown here to encourage the reader to think in periodic terms.More generally, modular arithmetic on primes is a particularly useful task forMI as it ensures uniformity among the output classes, allows for comparisonwith other MI work [14], and, from a number-theoretic point of view, primescontain mysteries ranging from the trivially solved—are there an infinite numberof primes?—to the deceptively difficult—can all even numbers larger than 4 bedescribed as the sum of two primes? The latter, known as Goldbach’s Conjecture,remains unsolved after centuries. The choice of using every prime less thanthe square root of the largest number of the dataset also serves the followingpurpose: to test if a given natural number is prime, it suffices to test that it is nota multiple of any prime less than its square root—the set of tasks trained for here,can thus be viewed in conjunction as a single prime detection task (primes arethe only samples whose target vector contains no zeros, since it is not a multipleof any of the factors 𝑞). There are about 𝑛ln(𝑛) primes less than 𝑛.To provide insight into the periodic structure of these remainders for naturalnumbers less than 12769 (and motivate thinking in rotational terms), Figure 1visualizes various modular patterns in polar coordinates (𝑛,𝑛mod2𝜋). Onecould imagine tightening and loosening the spiral by multiplying 2𝜋 by aconstant to align multiples of a given number in a straight line (imagining thisis encouraged).Background and related workMultiple papers describe the use of deep learning to detect prime numbers [17],[18], [19]. None are particularly promising as prime detection algorithms, asthey do not provide speedups, use more memory, and are less accurate thantraditional methods. However, in exploring the foundations of deep learning,the task of prime detection is interesting, as it is a simple task that is difficult tolearn, and is synthetic, meaning that the arbitrary amounts of data are generatedby a simple algorithm.Mechanistic Interpretability (MI)MI is a relatively new field focused on reverse-engineering the internalmechanisms of neural networks. Lipton (2018) explored different definitionsof interpretability in this context. MI can be contrasted with other forms ofinterpretability, such as feature importance analysis; while feature importancemeasures correlations between inputs and outputs (e.g., red pixels correlatingwith “rose” classifications), MI aims to understand how the model actuallyprocesses information (i.e., the mechanism).Methods and tools used so far in MI include: Activation visualization acrossordered samples; Singular value decomposition of weight matrices; Ablationstudies to identify critical circuits. Conmy et al. (2023) even successfully automate circuit² discovery. Many reverse engineering methods from other fields,such as computational neuroscience or signal processing, almost certainly havetheir uses here as well.In spite of deep learning’s practical successes, uncertainty remains about itstheoretical underpinnings, echoing the interpretability debate. Recent workattempts to place different DL architectures and concepts in a either geometric[21], information theoretic [22], or even category theoretic [23] context. However, no unified theory has emerged. Much interesting deep learning researchthus focuses on practical, simple, or algorithmic tasks with known solutions andarchitectures. For example, grokking [24], the (relatively) sudden generalizationafter overfitting, as elaborated later, is a recent and practical discovery.Case study: modular additionOne such practical discovery is made by Nanda et al. (2023). A single layer transformer model with ReLU activation function was trained to perform modularaddition (𝒯nanda). Nanda et al. (2023)‘s analysis of their trained model exemplifiesMI methodology. They discovered that: 1) The embedding layer learns trigonometric lookup tables of sine and cosine values as per Eq. 3; 2) The feed-forwardnetwork combines these through multiplication and trigonometric identities(Eq. 4), and 3) The final layer performs the equivalent of argmax (Eq. 5).𝑥0sin(𝑤𝑥0),cos(𝑤𝑥0)(3.1)𝑥1sin(𝑤𝑥1),cos(𝑤𝑥1)(3.2)sin(𝑤(𝑥0+𝑥1))=sin(𝑤𝑥0)cos(𝑤𝑥0)+cos(𝑤𝑥0)sin(𝑤𝑥1)(4.1)cos(𝑤(𝑥0+𝑥1))=cos(𝑤𝑥1)cos(𝑤𝑥1)sin(𝑤𝑥0)sin(𝑤𝑥1)(4.2)Logit(𝑐)cos(𝑤(𝑥0+𝑥1𝑐))(5.1)=cos(𝑤(𝑥0+𝑥1))cos(𝑤𝑐)+sin(𝑤(𝑥0+𝑥1))sin(𝑤𝑐)(5.2)Generalization and grokkingPower et al. (2022) shows generalization can happen “[...] well past the point ofoverfitting”, dubbing the phenomenon “grokking”. The phenomenon is now wellestablished [14], [25], [26]. Nanda et al. (2023) shows that a generalized circuit“arises from the gradual amplification of structured mechanisms encoded in theweights,” rather than being a relatively sudden and stochastic encounter of anappropriate region of . The important word of the quote is thus “gradual”.By regarding the series of gradients in time as a stochastic signal, Lee et al.(2024) proposes decomposing the signal. Conceptually, Lee et al. (2024) arguesthat in the case of gradient descent, the ordered sequence of gradient updatescan be viewed as consisting of two components: 1) a fast varying overfittingcomponent, and 2) a slow varying generalizing components. The general algorithm explaining the relationship between input and output is the same for allsamples, whereas the weights that allow a given model to function are uniquefor all samples. Though not proven, this intuition bears out in that generalizationis sped up fifty-fold in some cases.This echoes the idea that generalized circuits go through gradual amplification[14]. To the extent that this phenomenon is widespread, it bodes well for generalizable DL in that the generalizing signal that one would want to amplify mightexist long before the model is fully trained and could potentially be boosted ina targeted way by the method described by Lee et al. (2024).Perhaps the most widespread loss functions used in deep learning are meancross-entropy Eq. 6.1 (for classification) and mean squared error Eq. 6.2 (forregression).𝐿MCE=1𝑛𝑛𝑖=1𝑘𝑗=1𝑦𝑝𝑖𝑗ln(1̂𝑦𝑝𝑖𝑗)(6.1)𝐿MSE=1𝑛𝑛𝑖=1(𝑦𝑖̂𝑦𝑖)2(6.2)These have various computational and mathematical properties that make themconvenient to use, though they have been shown to struggle with generalizingout-of-distribution data [27], [22].Multi-task learning in deep learningAs stated, multi-task learning has been shown to have a regularizing effect [13],[28] as the hypothesis 𝑊 that performs well across all of the hypothesis spaces is more likely to be general. Viewed information theoretically, this concept is reminiscent of Shannon (2001)‘s asymptotic equipartition property [30],or even more generally, the law of large numbers, which states that the moresamples we have of a distribution, the closer our estimates are to its underlyingproperties will align with the true underlying properties.In the context of 𝒯miiii, multi-task learning is done by having the last layer outputpredictions for all tasks in parallel. Thus, whereas 𝒯nanda outputs a single one-hot1×113 vector for each of the potential remainders, 𝒯miiii, as we shall see, outputsa 1×𝑞 vector for each prime 𝑞<𝑝 (i.e., 29 output-task vectors when 𝑝=113).The embeddings layer and the transformer block are thus shared for all tasks,meaning that representations that perform well across tasks are incentivized.Transformer architectureTransformers combine self-attention (a communication mechanism) with feed-forward layers (a computation mechanism). The original transformer-block [15]used extensive regularization—layer norm [10], dropout, weight decay, andresidual connections are all integral components of the original architecture,though recent years have seen simplifications yielding similar performance [31],[32].Input tokens are embedded into a 𝑑-dimensional space using learned token andpositional embeddings:𝑧=TokenEmbed(𝑥)+PosEmbed(pos)(7)Each transformer block comprises multi-head attention:Attention(𝑄,𝐾,𝑉)=softmax(𝑄𝐾𝑇𝑑𝑘)𝑉(8)where 𝑄, 𝐾, and 𝑉 are linear projections of the input. Attention heads arecombined through addition rather than concatenation (a transformer specificdetail to align with Nanda et al. (2023)). This is followed by a feed-forwardnetwork with ReLU activation:FFN(𝑧)=ReLU(𝑧𝑊in)𝑊out(9)mapping from 𝑑4𝑑𝑑 dimensions, before finally:̂𝑦=𝑧𝑊unembed(10)Each component includes residual connections and dropout.MethodsHow exactly a given model implements an algorithm is a non-trivial question—even modular addition is implemented in a relatively obscure way [14], as perEq. 3, Eq. 4, and Eq. 5.This investigation probes the fundamental algorithmic structures internalizedby a transformer model trained on a set of basic prime number-related modulararithmetic tasks with slight variations in complexity. This approach providesinsights into how and why specific algorithmic patterns emerge from seeminglystraightforward learning processes.As stated, the setup here differentiates itself from 𝒯nanda in two crucial ways: 1)It is non-commutative; and 2) It is multitask.TasksStated plainly: the task 𝒯miiii predicts the remainder when dividing a two-digitbase-𝑝 number by each prime factor 𝑞 less than 𝑝. The set of prime factors weconstruct tasks for is thus {𝑞}={𝑞:𝑞<𝑝}. For 𝑝=113, this yields 29parallel tasks, one for each prime less than 𝑝. Each task predicts a remainderin the range [0,𝑞1]. This means smaller primes like 2 and 3 require binaryand ternary classification, respectively, while the largest prime less than 𝑝, 109,requires predictions across 109 classes. The tasks thus naturally vary in difficulty: predicting mod2 requires distinguishing odd from even numbers (whichin binary amounts to looking at the last bit) while predicting mod109 involvesmaking a selection between many relatively similar classes. From an information-theoretical perspective, the expected cross entropy for an 𝑛-class problemis ln(𝑛), which has implications for the construction of the loss function, furtherdiscussed in .Additionally, a baseline task 𝒯basis was constructed by shuffling the 𝑦-labels of𝒯miiii, and a task ablation test 𝒯masked was constructed by masking away the foursimplest tasks 𝑞{2,3,5,7}.DataInput Space (𝑋) Each input 𝑥𝑋 represents a number in base 𝑝 using twodigits, (𝑥0,𝑥1), where the represented number is 𝑥0𝑝0+𝑥1𝑝1. For example,with 𝑝=11, the input space consists of all pairs (𝑥0,𝑥1) where 𝑥0,𝑥1<11,representing numbers up to 1121=120. This yields a dataset of 121 samples.Figure 2 visualizes this input space, with each cell representing the value 𝑥0𝑝0+𝑥1𝑝1.Figure 2: Visualizing X (for a small dataset where 𝑝=11). Each cell representsthe tuple (𝑥0,𝑥1). The top left shows 0 (0,0), and the bottom right shows 120(10,10)—both in base-11Output Space (𝑌) For each input 𝑥, a vector 𝑦𝑌 contains the remainderwhen dividing by each prime less than 𝑝. For 𝑝=11, this means predicting theremainder when dividing by 2, 3, 5, and 7. Each element 𝑦𝑖 ranges from 0 to𝑞𝑖1 where 𝑞𝑖 is the 𝑖-th prime. Figure 3 visualizes these remainders, with eachsubplot showing the remainder pattern for a specific prime divisor. For comparison, the rightmost plot shows the output space of [14]‘s modular addition task.Figure 3: Visualizing tasks in Y (for 𝑝=11). 𝑥0 and 𝑥1 vary on the two axis,with the remainder modulo 𝑞{2,3,5,7} indicated by the square size. Notethe innate periodicity of the modulo operator.ModelThe model follows the original transformer architecture [15] with several keydesign choices aligned with recent work on mechanistic interpretability [14],[16]: biases are disabled, and layer normalization is not used. The model consistsof three main components: an embedding layer, transformer blocks, and anoutput layer. All weights are initialized following He et al. (2015). The modelprocesses vectors of the kind seen in Eq. 11, writing the eventual result to thelast position.[𝑥0𝑥1̂𝑦](11)TrainingHyper parameter optimization was conducted using Optuna [34], searching overTable 1.dropout𝜆wd𝑑lrheads0,12,15,1100,12,20,110,12,1128,2563e-4, 1e-44, 8Table 1: Hyperparameter search space for training.The model is trained using AdamW [35] with 𝛽1=0.9, 𝛽2=0.98 followingNanda et al. (2023). To handle the varying number of classes across tasks (from 2classes for mod 2 to 109 classes for mod 109), a modified (weighted) mean cross-entropy (Eq. 6.1) loss is created, correcting for the difference in the expected losswithin each task. Note that 𝔼[𝐿MCE]=ln(1𝑞), where 𝑞 is the number of classeswithin the task in question. Correcting for this, the loss function becomes asshown in Eq. 12.3.𝐿𝒯miiii=𝑞{𝑞}𝐿MCE𝑞ln(𝑞)(12.1)=𝑞{𝑞}𝑛𝑖=1𝑞1𝑗=0𝑦𝑞𝑖𝑗ln(̂𝑦𝑞𝑖𝑗)𝑛ln(𝑞)(12.2)=𝑞{𝑞}𝑛𝑖=1𝑞1𝑗=0𝑦𝑞𝑖𝑗ln(̂𝑦𝑞𝑖𝑗)𝑛ln(𝑞)(12.3)To accelerate generalization, gradient filtering as per Lee et al. (2024) is implemented and replicated.𝑔𝑡=𝜃𝐿+𝜆(𝛼𝑒𝑡1+(1𝛼)𝑔𝑡1)(13)where 𝑒𝑡 is the exponential moving average of gradients with decay rate 𝛼=0.98, and 𝜆 controls the influence of the slow-varying component.The training uses full batch gradient descent with the entire dataset of 𝑝2 samples (12769 when 𝑝=113). The model is evaluated on a held-out validation setafter each epoch, tracking per-task accuracy and loss. As the setup used in 𝒯nanda,training was done on thirty percent of the total dataset, with the remaining usedfor validation (1000 samples) and testing (remaining). Further, as 𝒯miiii involvesthe learning of 29 (when 𝑝=113) tasks rather than 1, and due to each task’snon-commutativity, a larger hidden dimension of 256 was added to the hyperparameter search space, as well as the potential for 8 heads (𝒯nanda was solvedwith a hidden dimension of 128, and 4 heads). The number of transformer blockswas kept at 1, as this ensures consistency with 𝒯nanda (and as full generalizationwas possible, as we shall see in the ).Training was done on a NVIDIA GeForce RTX 4090 GPU, with Python3.11 andextensive use of “JAX 0.4.35” and its associated ecosystem. Neuron activationswere calculated at every training step and logged for later analysis.VisualizationMuch of the data worked with here is inherently high dimensional. For training,for example, we have 𝑛 steps, two splits (train/valid) about 𝑝ln(𝑝) tasks, and twometrics (accuracy and loss). This, along with the inherent opaqueness of deeplearning models, motivated the development of a custom visualization library,esch³, to visualize attention weights, intermediate representations, trainingmetrics, and more. To familiarize the reader with visualizing the inner workingsof a trained model, an essential plot type for the reader to keep in mind is seenin Figure 4. As there are only 12769 samples when 𝑝=113, all samples can befed at once to the model. Inspecting a specific activation thus yields a 1× 12796vector 𝑣, which can be reshaped as a 113×113 matrix, with the two axes, 𝑥0and 𝑥1, varying from 0 to 112, respectively. The top-left corner then shows thegiven value for the sample (0𝑝0+0𝑝1), and so on.Figure 4: Plotting a neuron: (left) The activation of a particular neuron as 𝑥0 and𝑥1 varies from 0 to 𝑝. (right) The same processed with a fast Fourier transformto active frequencies (𝜔).Note that in esch plots, when appropriate, only the top leftmost 37×37 slice isshown so as not to overwhelm the reader.Mechanistic interpretability processRecall that a combination of linear products is itself a linear product. Therefore,as a mechanistic interpretability rule of thumb, one should look at the outputsof the non-linear transformations. In our case, that will be the attention weightsand the intermediate representations within the transformer block’s feed-forward output (which follows the ReLU activation). Additionally, the embeddinglayers will be inspected using Fourier analysis and singular value decomposition.As mentioned in , our interpretability approach combines activation visualization with frequency analysis to understand the learned algorithmic patterns.Following Nanda et al. (2023), we analyze both the attention patterns and thelearned representations through several lenses:Attention visualizationUsing esch, the custom visualization library, to visualize attention weights andintermediate representations. The library allows for the visualization of attention patterns across different layers, as well as the visualization of intermediaterepresentations at each layer. These visualizations provide insights into thelearned patterns and help identify potential areas of improvement.The fast Fourier transformAs periodicity is established by Nanda et al. (2023) as a fundamental feature ofthe model trained on 𝒯nanda, the fast Fourier transform (FFT) algorithm is usedto detect which frequencies are in play. Note that any square image can bedescribed as a sum of 2d sine and cosine waves varying in frequency from 1 tothe size of the image divided by 2 (plus a constant). This is a fundamental toolused in signal processing. The theory is briefly outlined in for reference. Thisanalysis helps identify the dominant frequencies in the model’s computationalpatterns. Recall that a vector can be described as a linear combination of otherperiodic vectors as per the discrete Fourier transform.The default basis of the one-hot encoded representation of the input is thus theidentity matrix. This can be projected into a Fourier basis by multiplying withthe discrete Fourier transform (DFT) matrix visualized in .Results and analysisHyper-parameter optimizationThe best-performing hyper-parameters for training the model on 𝒯miiii are listedin Table 2. Notably, the model did not converge when 𝜆=0, confirming theutility of the gradient amplification method proposed by Lee et al. (2024) in thecontext of 𝒯miiii.dropout𝜆wd𝑑lrheads11012132563×1044Table 2: Result of hyper-parameter search over 𝒯miiii.Model PerformanceFigure 5 show the training and validation accuracy on 𝒯miiii over time. The modelachieves a perfect accuracy of 1 on the validation set across all 29 tasks. Thecross-entropy loss in Figure 6 echoes this. In short—and to use the terminologyof Power et al. (2022)—the model “grokked” on all tasks. Interestingly, taskscorresponding to modulo 2, 3, 5, and 7 generalized in succession, while theremaining 25 tasks generalized around epoch 40000 in no particular order. Thismight suggest that the model initially learned solutions for the simpler tasksand later developed a more general computational strategy that allowed it togeneralize across the remaining, more complex tasks.Figure 5: Accuracy training “curves”: Training (top) and validation (bottom)accuracy over time (𝑥-axis in log-scale). We see grokking occur on all tasks, firstfor 𝑞{2,3,5,7} in that order, and then the remaining 25 in no particular order.Figure 6: Cross-entropy (Eq. 6.1) loss on training (top) and validation (bottom)over time (note the log scale on the 𝑥-axis).EmbeddingsPositional embeddings play a crucial role in transformers by encoding theposition of tokens in a sequence. Figure 7 compares the positional embeddingsof models trained on 𝒯nanda and 𝒯miiii.For 𝒯nanda, which involves a commutative task, the positional embeddings arevirtually identical, with a Pearson correlation of 0.95, reflecting that the positionof input tokens does not significantly alter their contribution to the task. Incontrast, for 𝒯miiii, the positional embeddings have a Pearson correlation of−0.64, indicating that the embeddings for the two positions are different. Thisdifference is expected due to the non-commutative nature of the task, wherethe order of 𝑥0 and 𝑥1 matters (𝑥0𝑝0𝑥0𝑝1). This confirms that the modelappropriately encodes position information for solving the tasks.Figure 7: Positional embeddings for (𝑥0,𝑥1) for models trained on 𝒯nanda (top)and 𝒯miiii (bottom). Pearson’s correlation is 0.95 and −0.64 respectively. Thisreflects the commutativity of 𝒯nanda and the lack thereof for 𝒯miiii. Hollow cellsindicate negative numbers.Recall that a matrix 𝐌 of size 𝑚×𝑛 can be decomposed to its singular values𝐌=𝐔𝚺𝐕𝐓 (with the transpose being the complex conjugate when 𝐌 iscomplex), where 𝐔 is 𝑚×𝑚, 𝚺 an 𝑚×𝑛 rectangular diagonal matrix (whosediagonal is represented as a flat vector throughout this paper), and 𝐕𝐓 a 𝑛×𝑛 matrix. Intuitively, this can be thought of as rotating in the input space, thenscaling, and then rotating in the output space.Figure 8 displays the singular values of the token embeddings learned for 𝒯nandaand 𝒯miiii. The singular values for 𝒯miiii are more diffuse, indicating that a largernumber of components are needed to capture the variance in the embeddingscompared to 𝒯nanda. This suggests that the token embeddings for 𝒯miiii encodemore complex information, reflecting the increased complexity of the multi-tasklearning scenario.Figure 8: First 83 of 113 singular values (truncated for clarity) of U for 𝒯nanda(top) and 𝒯miiii (bottom). The ticks indicate the points where 50% and 90% ofthe variance is accounted for. We thus see that for 𝒯miiii, the embedding space ismuch more crammed.Figure 9: 𝒯nanda’s most significant (cutoff at 0.5 as per Figure 8) singular vectorsof U from the singular value decomposition. Note this looks periodic!Figure 10: 𝒯miiii’s most significant vectors of U. Note that, like in Figure 9, we stillobserve periodicity, but there are more frequencies in play, as further exploredin Figure 12.Figure 9 and Figure 10 present the most significant singular vectors of U for𝒯nanda and 𝒯miiii, respectively. Visual inspection shows periodicity in the topvectors for both models, but the 𝒯miiii model requires more vectors to capture thesame amount of variance, consistent with the diffuse singular values observedin Figure 8.To further understand the structure of the token embeddings, we applied theFast Fourier Transform (FFT). Only a few frequencies are active for 𝒯nanda asseen in Figure 11, consistent with the model implementing a cosine-sine lookuptable as described in Nanda et al. (2023).For the 𝒯miiii model, we observe a broader spectrum of active frequencies(Figure 12). This is expected due to the model having to represent periodicitycorresponding to 29 primes.Comparing with 𝒯basis in figure Figure 13, the periodicity is understood to be astructure inherent to the data picked up by the model.