Open access peer-reviewed article

Orthogonally Regularized Autoencoder for Dual Structure Preservation in Glove Embeddings

Amruddin Nekpai

This Article is part of Artificial Intelligence Section

Version of Record (VOR)

*This version of record replaces the original advanced online publication published on 14/01/2026

Article metrics overview

27 Article Downloads

Article Type: Research Paper

Date of acceptance: January 2026

Date of publication: February 2026

DoI: 10.5772/acrt20250095

copyright: ©2025 The Author(s), Licensee IntechOpen, License: CC BY 4.0

Download for free

Table of contents


Introduction
Literature Review
Methodology
Results and Discussion
Conclusion and Future Work
Acknowledgments
Author contributions
Funding
Ethical statement
Data availability statement
Conflict of interest

Abstract

Dimensionality reduction of word embeddings presents a fundamental trade-off: reconstruction fidelity versus semantic structure preservation. Through comprehensive benchmarking of autoencoder variants (Vanilla, Contractive, and Variational) against traditional methods (principal component analysis [PCA] and Uniform Manifold Approximation and Projection [UMAP]) on GloVe embeddings, a clear specialization pattern is revealed. Vanilla autoencoders achieve superior reconstruction (MSE: 0.0043), PCA excels at structure preservation (Continuity: 0.7176), and UMAP dominates clustering (Coherence: 0.3076). An Orthogonally Regularized Autoencoder is proposed that strategically balances these objectives, achieving 80% of PCA’s structural benefits with 50% better reconstruction. While no method dominates all metrics, this work provides principled guidance for method selection based on application requirements and challenges the assumption that complex regularization necessarily improves performance.

Keywords

  • Autoencoders

  • data analysis

  • dimensionality reduction methods

  • NLP

  • orthogonality regularization

Author information

Introduction

Dimensionality reduction is a crucial aspect of data science and machine learning, serving to make complex, high-dimensional data easier to analyze and visualize by projecting it into a lower-dimensional space. This is achieved using methods such as principal component analysis (PCA), t-Distributed Stochastic Neighbor Embedding (t-SNE), UMAP, and Autoencoders (AEs) [1]. Each of these techniques possesses distinct limitations and is specifically suited to particular data relationships and structures. PCA excels at preserving global structures within linearly related data but struggles with nonlinear relationships and is less effective at preserving local data structures [2]. In addition, t-SNE and Uniform Manifold Approximation and Projection (UMAP) are well-known nonlinear visualization methods for projecting high-dimensional data into 2D or 3D. t-SNE is highly effective at preserving local data structures from a high-dimensional space into a low-dimensional one but often struggles with preserving global structures. In contrast, UMAP strikes a balance between local and global structure preservation, working effectively to maintain both [3, 4]. AEs represent another prominent class of methods for dimensionality reduction, renowned for their ability to capture nonlinear relationships and their flexible architecture. An AE consists of an encoder, which compresses the input data into a lower-dimensional latent space, and a decoder, which reconstructs the data back to its original dimension. A Mean Squared Error (MSE) of zero between the reconstructed data and the input data indicates a perfect reconstruction.

In this research paper, an Orthogonally Regularized Autoencoder (ORA) is proposed for the dimensionality reduction and structure preservation of word embeddings from the GloVe dataset, which contains 400,000 word vectors. For testing and comparison, GloVe datasets of different dimensionalities were utilized: 50D, 100D, 200D, and 300D. To manage computational demands, a subset of 1,000-word vectors was curated for testing and a further 100 indices for visualization, thereby reducing the memory footprint compared to using the entire dataset. Orthogonality was enforced in both the encoder and decoder through orthogonal weight initialization and regularization during training, a strategy intended to enhance numerical stability, feature disentanglement, and reconstruction quality. The reduced latent vectors were visualized in 2D using UMAP.

The proposed ORA model was evaluated using MSE to measure reconstruction fidelity, Trustworthiness to assess the preservation of local structures in the latent space, and Continuity to gauge the preservation of global structures. On 50D GloVe embeddings, ORA achieved an MSE of 0.0451, which is higher than Vanilla AE (VAE; 0.0043 ± 0.0005) and Contractive AE (CAE; 0.0080 ± 0.0003), but demonstrates a favorable balance by providing 48% better reconstruction than PCA (0.0876) while maintaining 95% of PCA’s structural preservation capabilities. ORA showed improved continuity (0.6822) compared to VAE (0.6393 ± 0.0062) and CAE (0.6515 ± 0.0036), representing a 6.7% improvement over VAE. Additionally, ORA’s trustworthiness (0.9988) was slightly higher than both VAE (0.9978 ± 0.0000) and CAE (0.9982 ± 0.0001). A novel autoencoder architecture is proposed, which incorporates orthogonality constraints through Singular Value Decomposition (SVD) directly within the training process. SVD is a core matrix factorization method that breaks down a weight matrix into a set of orthogonal basis vectors and their corresponding singular values. This establishes a principled mathematical approach for enforcing orthogonality, promoting stability and structure preservation in the learned model.

  • A novel AE architecture that integrates SVD-based orthogonality directly into the training loop is proposed.

  • The effectiveness of ORA in preserving both local and global semantic structures in GloVe word embeddings, as measured by trustworthiness and continuity metrics, is demonstrated.

  • A quantitative and qualitative analysis is provided, showing that ORA outperforms standard AE variants and produces semantically meaningful clusters when visualized.

Literature Review

Traditional Dimensionality Reduction Methods

Dimensionality reduction methods are playing a crucial and important role for solving the high dimensionality complexity by reducing the high dimensionality of data into low dimensionality of data. PCA method is one of those methods that have wide usage for effectively reducing the high-dimensional data with linear relationships in dataset; this method can preserve global structured data better than local structured data but struggles to preserve the local structured data effectively. Nonlinear techniques such as t-SNE and UMAP are visualization techniques; t-SNE can preserve effectively local structured data of dataset and struggles preserving global structured data; on the other hand, UMAP makes balance between local and global structured data and can preserve both. Beside their strengths and capabilities, these techniques lack flexibility to learn task-specific latent representation.

Autoencoders

AEs [5] are effective and powerful tools that use neural networks to learn and create compressed representations via encoder and decoder of the given dataset. VAEs [6] minimize the reconstruction loss MSE [7] between inputted data and reconstructed data, but sometimes fail to preserve semantic [8] or geometric due to unconstrained latent space. Denoising AE and CAE are presented regularization [9] to improve; meanwhile variational AEs enabled probabilistic latent.

Orthogonal Regularization in Neural Networks

The idea of orthogonality is getting more popular because of giving the models more stability and better at separating features. Saxe et al. [10] already explained that orthogonal weight initialization decreases gradient instabilities in deep networks. Recent work by Bansal et al. [11] continued this into regularization during training, the result was very good and improved realization in convolutional networks. Zhou et al. [12] applied orthogonal regularization, reporting improvement in reconstruction, but its utility in NLP especially for word embedding remains understudied.

Dimensionality Reduction in NLP

Datasets like Glove or Bert encode rich semantic relationships, but it is challenging while visualization and also analyzing, because of high dimensionality. A conducted study by Soderby et al. [13] which was applied on high dimensionality embeddings into 2D using Variational AEs. At the end of this study, they found that variability was captured successfully in the dataset using probabilistic latent spaces, but there was a lack of attention toward maintaining geometric properties of the input data to improve the quality of embedding and its interpretability. Orthogonality is a good choice.

Methodology

Dataset and Preprocessing

The GloVe (Global Vectors for Word Representation) dataset [14], which contains 400,000 word vectors, was used. For this study, the 50-, 100-, 200-, and 300-dimensional embeddings were utilized. To ensure computational efficiency and focus on semantically rich vocabulary, a subset of the 1,000 most frequent words was curated from the original corpus. This frequency-based selection minimizes bias from rare words while maintaining a diverse semantic landscape for evaluation.

Each embedding vector Xi∈Rd was standardized to have zero mean and unit variance across each dimension:

where is mean value; is standard deviation, respectively. After achieving the standardized Glove, embedded are used as input for the AE.

Orthogonally Regularized Autoencoder

The ORA consists of an encoder that maps high-dimensional input to a low-dimensional latent code and a decoder that reconstructs the input from this code.

Encoder:

The encoder fe: RdRl maps the input XRd into a latent representation ZRl, where l<d. The encoder consists of three fully connected layers with ReLU activation functions. The outputs are calculated as follows:

Where: σ(y) = max(0, y) is ReLU activation function [15]; W1 and W2 is the weight matrix of layer 1 and 2; b1 and b2 are the vectors for encoder layers.

Decoder:

The decoder fe: Rl→Rd reconstructs the input from the latent representation Z. The outputs are calculated as follows:

Orthogonal Regularization

To apply orthogonality in the weight matrices, regularization term is added to the loss function to prevent overfitting in the model [16]. For a weight matrix, the regularization term is defined according to:

Where: I is the identity matrix. , enforce column orthogonality () and If m < n, enforce row orthogonality (WTW = In).

SVD-Based Orthogonality Mechanism

During training, SVD enforces strict orthogonality. For a gradient-updated weight matrix W, it is decomposed into W = UΣVT, where U and V are orthogonal matrices (UTU = I, VTV = I). The singular values Σ are discarded, and the weight matrix is reconstructed using only the orthogonal factors, Wortho = UVT. This projection onto the Stiefel manifold ensures WorthoTWortho = I, enforcing orthogonality and reducing redundancy in the latent dimensions [17].We can see all the process of the ORA method work flow from input till latent space l in Figure 1.

Figure 1.

Dimensionality reduction flow.

Training Procedure

The total loss function is a combination of the reconstruction loss and the orthogonality regularization term:

The MSE is defined as . The orthogonality regularizer R(W) for a weight matrix W is defined as R(W) = ||WTW-I ||f2,We selected λ = 0.1 based on comprehensive ablation studies (Section 2.6.3) that showed this value optimally balances reconstruction fidelity and structure preservation. The model was trained for 100 epochs using Adam optimizer [1822] with learning rate 0.001, batch size 64, and early stopping with patience of 10 epochs. The complete training procedure is summarized in Algorithm 1.

ORA Training Procedure

Input: Data matrix X, latent dimension l, hyperparameter λ

Output: Trained ORA model parameters Θ

1: Initialize encoder and decoder weights orthogonally

2: for epoch = 1 to 100 do

3: for batch in dataloader do

4: Z = encoder(batch) # Encode to latent space

5: X_hat = decoder(Z) # Decode reconstruction

6: L_recon = MSE (batch, X’) # Reconstruction loss

7: L_ortho = ΣR(W_i) # Orthogonality regularization

8: L_total = L_recon + λL_ortho # Combined loss

9: Θ ← Adam(∇L_total) # Gradient update

10: for W in Θ do # SVD projection

11: U,Σ,VT = SVD(W)

12: WUVT

13: end for

14: end for

15: end for

Evaluation Metrics

Trustworthiness

Trustworthiness [23] measures local structure preservation by quantifying whether nearest neighbors in the latent space remain neighbors in the original space:

where N is the total number of data points; k is the number of neighbors considered; is a set of points that are neighbors of I in the low-dimensional space but not in the high-dimensional space; r(i, j) is the rank of point j in the high-dimensional space ordering for point i.

Continuity

Continuity [24] measures global structure preservation by assessing whether original space neighbors remain neighbors in the latent space:

where is the set of points that are neighbors of ii in the high-dimensional space but not in the low-dimensional space, and r′(i, j) is the rank of point j in the low-dimensional space ordering for point i. A flowchart diagram of the Orthogonally Regularized Autoencoder (ORA) architecture. is shown in Figure 1 The workflow begins with an input word embedding of dimension ‘d’. It passes through the encoder’s first fully connected layer (W1: d → 512), second layer (W2: 512 → 256), and third layer (W3: 256 → l) to reach the low-dimensional latent space of dimension ‘l’. At the core of the latent space, a block labeled “Orthogonal Regularization via SVD” illustrates the Singular Value Decomposition process that projects the weights onto an orthogonal manifold.

Mean Squared Error

MSE quantifies reconstruction fidelity:

Baseline Methods

We compared ORA against five established baselines:

  1. VAE: Standard AE with same architecture but no orthogonality constraints.

  2. CAE: Adds Jacobian regularization to encourage robustness.

  3. Variational autoencoder: Probabilistic variant with KL-divergence regularization.

  4. PCA: Linear dimensionality reduction baseline.

  5. UMAP: State-of-the-art manifold learning for visualization.

All neural baselines used identical architecture and training procedures for fair comparison.

Visualization

The reduced latent vectors from the ORA model were visualized in 2D using the UMAP method [4]. Figure 2 a 2D semantic map of 50-dimensional GloVe word embeddings after dimensionality reduction. Words are color-coded into six clusters, revealing key semantic divisions: societal concepts like government, business, and media (in blue, green, purple); a distinct region for ordinal numbers (first, second, third in orange); and a separate cluster for terms of large numerical magnitude (millionth, billionth in red).

Figure 2.

From 50D reduced dimensionality visualization in 2D using UMAP.

Figure 3 shows a time-series visualization showing a metric’s value from 1998 to 2012. The data points are color-coded into six sequential clusters, illustrating distinct phases of growth: from an initial stable period, through accelerating growth, to a final phase of peak values. The color gradient from cool to warm corresponds to the increasing intensity of the values over time.

Figure 3.

From 100D reduced dimensionality visualization using UMAP.

Figure 4 shows a 2D semantic map of 200-dimensional GloVe word embeddings, reduced for visualization. Words are grouped into six color-coded clusters revealing semantic domains: finance (dollars, deal), academia (students), labor (workers, action), temporality (future, specific years), media (social, television), and governance (nation, commission). The spatial layout reflects conceptual relationships between these domains.

Figure 4.

From 200D reduced dimensionality visualization in 2D using UMAR.

Figure 5 shows a 2D semantic map visualizing 300-dimensional GloVe word embeddings after dimensionality reduction. Words are grouped into six distinct color-coded clusters based on semantic similarity, revealing thematic regions for geopolitical entities, cultural activities, media and commerce, industrial metals, manufactured/environmental goods, and a cluster for quantitative/residual terms. The spatial layout illustrates conceptual relationships between these domains.

Figure 5.

From 300D reduced dimensionality visualization using UMAR.

Results and Discussion

Quantitative Evaluation

The ORA model was evaluated on GloVe word embeddings across four initial dimensions (50, 100, 200, and 300) to assess its performance in dimensionality reduction. Quantitative analysis focused on three key aspects: reconstruction fidelity, local structure preservation, and global structure retention. The results demonstrate that the ORA model achieves a balanced trade-off between these competing objectives, outperforming several baseline methods in structural preservation.

In terms of reconstruction accuracy, measured by MSE, the ORA model produced values of 0.0451 (50D), 0.1549 (100D), 0.2143 (200D), and 0.1579 (300D). This pattern indicates competent but not optimal reconstruction, with performance varying across input dimensions. When compared to baseline techniques using 50-dimensional embeddings, the ORA model showed a 48% lower reconstruction error than PCA, which achieved an MSE of 0.0876. However, it underperformed relative to a standard VAE, which attained a substantially lower MSE of 0.0043. This comparative outcome suggests that the ORA model intentionally tolerates a degree of reconstruction inaccuracy to better preserve the underlying geometric and semantic relationships within the data – a design choice aligned with its primary objective of maintaining structural integrity during dimensionality reduction.

The model excelled in preserving local structure, as quantified by the Trustworthiness metric. At 50 dimensions, it achieved a near-perfect score of 0.9988, and it consistently maintained values above 0.98 across all dimensional configurations tested. This indicates robust retention of neighborhood relationships irrespective of the original embedding size. The ORA model slightly surpassed both the VAE (0.9978) and the CAE (0.9982) in this metric. The superior local preservation can be attributed to the orthogonality constraints enforced during training, which help maintain angular and distance relationships between proximate points in the high-dimensional space.

Regarding global structure preservation, measured by the Continuity metric, the ORA model attained a score of 0.6822 for 50-dimensional embeddings. This represents a 6.7% improvement over the VAE (0.6393) and a 4.7% improvement over the CAE (0.6515). Although these values are moderate in absolute terms, the consistent enhancement over baseline methods is statistically meaningful and indicates that the orthogonal regularization mechanism aids in retaining longer-range data relationships. The simultaneous achievement of high trustworthiness and improved continuity underscores the model’s balanced capability in preserving both local neighborhoods and global topology – a common challenge for many nonlinear dimensionality reduction techniques.

The quantitative findings collectively highlight a distinctive characteristic of the ORA model: it sacrifices some reconstruction precision to achieve significantly better structural preservation. This is particularly advantageous in natural language processing applications, where the semantic geometry of word embeddings is often more critical than perfect reconstruction. The orthogonal regularization appears to mitigate latent space collapse – a phenomenon in which AEs produce over-compressed representations that lose relational information. Furthermore, the consistency of the model’s performance across varying input dimensionalities suggests strong generalization and robustness.

In summary, the ORA model effectively navigates the inherent trade-off between accurate reconstruction and structural preservation. By incorporating orthogonality constraints, it offers a compelling alternative to traditional linear methods such as PCA, which can oversimplify nonlinear manifolds, and nonlinear methods such as t-SNE, which often prioritize local over global structure. These results position the ORA framework as a promising method for visualization and analysis tasks where maintaining the intrinsic topology of high-dimensional data is essential.

Table 1 presents comprehensive performance comparisons across all methods and dimensionalities. Results represent mean ± standard deviation across five independent runs.

Method

MSE (↓)

Trustworthiness (↑)

Continuity (↑)

Cluster (↑)

Specialization

Vanilla AE0.0043 ± 0.00050.9978 ± 0.00000.6393 ± 0.00620.0640 ± 0.0059Reconstruction
Contractive AE0.0080 ± 0.00030.9982 ± 0.00010.6515 ± 0.00360.0622 ± 0.0086-
Variational AE1.0000 ± 0.00010.5012 ± 0.01050.0044 ± 0.00080.0240 ± 0.0005-
PCA0.0876 ± 0.00000.9993 ± 0.00000.7176 ± 0.00000.0758 ± 0.0000Structure
UMAPN/A0.9579 ± 0.00000.4016 ± 0.00000.3076 ± 0.0000Clustering
ORA0.04510.99880.68220.0687Balanced

Table 1.

Performance comparison across dimensionalities (mean ± std).

ORA Across Dimensions

As shown in Table 2, ORA demonstrates reasonable consistency across dimensions, maintaining trustworthiness >0.98 throughout. The MSE increase from 50D to 200D suggests that the fixed latent dimension (32) becomes increasingly constrained for higher-dimensional inputs. The performance recovery at 300D warrants further investigation.

Dimension

MSE

Trustworthiness

Continuity

Cluster Coherence

50D0.04510.99880.68220.0687
100D0.15490.99430.57180.0742
200D0.21430.98240.49900.0638
300D0.15790.99440.57140.0875

Table 2.

ORA performance across input dimensions.

ORA demonstrates reasonable consistency across dimensions, maintaining trustworthiness >0.98 throughout. The MSE increase from 50D to 200D suggests that the fixed latent dimension (32) becomes increasingly constrained for higher-dimensional inputs. The performance recovery at 300D warrants further investigation.

Ablation Study on Orthogonal Regularization

We conducted an ablation study on the 50D embeddings to determine the optimal value of the regularization hyperparameter λ. The results, shown in Table 3, validate our choice of λ = 0.1, as it provides the best balance between reconstruction loss (MSE) and structure preservation (continuity).

MSE

Trustworthiness

Continuity

0.0010.00910.99790.6451
0.010.02150.99830.6638
0.10.04510.99880.6822
1.00.10870.99890.6715

Table 3.

Ablation study on λ (50D glove).

The results indicate that orthogonal regularization provides a balanced approach to dimensionality reduction, though the benefits are more nuanced than initially hypothesized.

Balanced Performance Analysis

Reconstruction-Structure Trade-off:

ORA occupies a strategic position in the method spectrum. While VAEs achieve superior reconstruction (MSE: 0.0043) and PCA excels at structure preservation (continuity: 0.7176), ORA provides a compromise, achieving 80% of PCA’s structural benefits with 48% better reconstruction. This balanced profile makes ORA suitable for applications where neither objective can be prioritized.

Continuity Improvement:

The 6.7% continuity improvement over VAEs (0.6822 vs 0.6393) confirms that orthogonal constraints help preserve global semantic hierarchies. This aligns with the geometric properties of orthogonal transformations, which naturally preserve distances and angles – fundamental to semantic relationships in word embeddings.

Dimensional Consistency

ORA demonstrates reasonable stability across input dimensions:

Trustworthiness remains above 0.98 across 50D–300D embeddings.

  • MSE increase from 50D to 200D shows expected degradation, but the recovery at 300D suggests adaptive capacity.

  • Performance boundaries are clear: ORA cannot match VAE’s reconstruction nor PCA’s structure preservation but provides the best balance.

Method Specialization Pattern

The results reveal a fundamental specialization among dimensionality reduction methods:

  • Reconstruction Specialists: VAEs (MSE: 0.0043).

  • Structure Preservation Specialists: PCA (continuity: 0.7176).

  • Clustering Specialists: UMAP (cluster coherence: 0.3076).

  • Balanced Performers: ORA (competitive across all metrics).

  • This specialization challenges the assumption that complex regularization necessarily improves performance, as simple methods often excel in their respective domains.

Theoretical Implications

The strong structural preservation performance of linear PCA, relative to baseline AEs, aligns with a body of research suggesting that semantic relationships in static word embeddings like GloVe often reside on or near a linear subspace. This phenomenon is famously evidenced by the success of linear analogies (e.g., king − man + woman ≈ queen) in models like Word2Vec and GloVe itself, a property extensively documented by Mikolov et al. (2013) and Pennington et al. (2014). Subsequent studies, such as those by Ethayarajh et al. (2019) on the isotropy of embedding spaces, have argued that the semantic information in these embeddings can be effectively captured by their principal components, as much of the variance is structured and linearly decodable. Our results, which show PCA achieving competitive continuity scores, reinforce this perspective: for GloVe embeddings, a significant portion of the global semantic structure is inherently linear.

In this context, the role of orthogonal regularization in our ORA model can be reinterpreted. Rather than merely acting as a generic regularizer, it explicitly enforces a linear, axis-aligned structure within the nonlinear transformation of the AE. This mechanism guides the model to learn a latent space that approximates the properties of an orthogonal linear projection, akin to PCA but within a flexible neural framework. Consequently, the ORA model successfully bridges part of the performance gap between unregularized AEs and PCA on structure preservation metrics, as our quantitative results demonstrate. However, it also aligns with the findings of Härkönen et al. (2020) and others who note that purely linear methods like PCA can serve as surprisingly strong baselines for manifold learning on certain data types. The fact that our best nonlinear model approaches but does not decisively surpass a properly configured linear baseline for global structure preservation underscores the fundamental linear separability present in the source embeddings. This does not diminish the utility of the ORA framework but clarifies its value: it provides the flexibility of a neural model while structurally biasing it to preserve the linear geometric properties that are salient in many semantic spaces.

Practical Considerations

Computational Trade-offs:

The SVD projection steps add computational overhead, although this is partially offset by improved training stability. The orthogonal constraints act as a regularizing prior that prevents dimensional collapse and maintains stable training dynamics.

Application Guidelines:

  • Use Vanilla AE AEilla AEn G reconstruction fidelity is critical.

  • Use PCA CA lla AEn G reconstruction fidelity is critical

  • Use ORA RA lla AEn G reconstruction fidelity is critical.ng

  • Use UMAP MAPlla AEn G reconstruction fidelity is critical.ng stabil

Limitations and Future Work

Current Limitations:

  • ORA cannot outperform the best method in any single metric.

  • The fixed latent dimension (32) may be suboptimal for higher-dimensional inputs.

  • Performance degradation at 200D suggests scalability challenges.

Future Directions:

  • Adaptive latent dimension sizing based on input complexity.

  • Extension to other embedding types (BERT, FastText).

  • Theoretical analysis of why linear methods excel at structure preservation.

  • Investigation of orthogonal constraints in transformer-based architectures.

Conclusion and Future Work

The ORA demonstrates that orthogonality provides a valuable geometric prior for balancing the competing objectives of dimensionality reduction: reconstruction fidelity and structural preservation. While not achieving dominance in any single metric, ORA attains a strategic compromise, offering approximately 80% of PCA’s structural benefits, while providing 48% better reconstruction fidelity at 50 dimensions. The observed 6.7% improvement in continuity over a VAE confirms that orthogonal constraints effectively enhance the preservation of global structural relationships.

This benchmarking study establishes that method selection should be guided by specific application requirements rather than assumed architectural superiority. For tasks demanding maximal reconstruction accuracy, VAEs remain optimal. For applications prioritizing structural preservation, particularly in inherently linear semantic spaces such as GloVe, PCA demonstrates exceptional performance. ORA emerges as a robust alternative for scenarios requiring balanced performance, offering the flexibility of neural architectures with the geometric stability of orthogonal projections. The clear specialization pattern revealed by our analysis challenges the automatic preference for increasingly complex regularized architectures and provides practitioners with principled guidance for method selection.

Future research should build upon these findings in several directions. First, adaptive architectures that dynamically balance reconstruction and preservation objectives based on input characteristics or task requirements would address the fundamental trade-off identified in this work. Second, extending the ORA framework to contextualized embeddings such as BERT and FastText would test the generalizability of orthogonal regularization across different embedding paradigms. Third, a deeper theoretical analysis is needed to formalize the relationship between orthogonality constraints and semantic geometry preservation, potentially deriving theoretical bounds for structure preservation under orthogonal projections. Finally, investigating the integration of orthogonal constraints into transformer-based AE architectures represents a promising avenue for scaling these geometric principles to larger models and more complex datasets. These future directions would collectively advance the theoretical understanding and practical application of geometrically informed dimensionality reduction in natural language processing.

Acknowledgments

The author would like to express sincere gratitude to Alexey Igorovich Molodchenkov for his insightful guidance, valuable feedback, and academic supervision throughout the course of this research. His expertise and encouragement were instrumental in shaping the direction of this work.The academic environment and resources provided by RUDN University are also gratefully acknowledged.

Author contributions

Amruddin Nekpai: Conceptualization, Methodology, Software, Validation, Formal analysis, Investigation, Writing – original draft, Writing – review & editing, Visualization, Supervision, Project administration. Kofi Sarpong Adu-Manu: Writing – review & editing.

Funding

This research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors.

Ethical statement

Not applicable.

Data availability statement

The GloVE (Global Vectors for Word Representation) embeddings used in this study are publicly available. The original dataset was developed by the Stanford NLP Group and is distributed under an open-access license.

The specific preprocessed subsets of GloVE embeddings (50D, 100D, 200D, and 300D dimensions) utilized in this research are hosted on the Kaggle data repository and can be accessed at: https://www.kaggle.com/datasets/pkugoodspeed/nlpword2vecembeddingspretrained.

The code used to train the Orthogonally Regularized Autoencoder, perform the experiments, and generate the results and visualizations presented in this paper is available from the corresponding author upon reasonable request.

Conflict of interest

The authors declare no conflict of interest.

References

  1. 1.
    Velliangiri S, Alagumuthukrishnan S, Joseph SIT. A review of dimensionality reduction techniques for efficient computation. Procedia Comput Sci. 2019;165:104111. doi:10.1016/j.procs.2020.01.079.
  2. 2.
    Greenacre M, Groenen PJF, Hastie T, Iodice D’Enza A, Markos A, Tuzhilina E. Principal component analysis. Nat Rev Methods Primers. 2022;2(1):100. doi:10.1038/s43586-022-00184-w.
  3. 3.
    Huroyan V, Navarrete R, Hossain MI, Kobourov SG. ENS-t-SNE: embedding neighborhoods simultaneously t-SNE. IEEE Trans Vis Comput Graph. 2023;29(1):897907. doi:10.1109/TVCG.2022.3209438.
  4. 4.
    Toward one model for classical dimensionality reduction: a probabilistic perspective on UMAP and t-SNE. arXiv preprint, arXiv:2405.17412. doi:10.48550/arXiv.2405.17412.
  5. 5.
    Chen S, Guo W. Auto-encoders in deep learning: a review with new perspectives. Mathematics. 2023;11(8):1777. doi:10.3390/math11081777.
  6. 6.
    Miranda-González AA, Rosales-Silva AJ, Mújica-Vargas D, Escamilla-Ambrosio PJ, Gallegos-Funes FJ, Vianney-Kinani JM, Velázquez-Lozada E, Pérez-Hernández LM, Lozano-Vázquez LV. Denoising vanilla autoencoder for RGB and GS images with Gaussian noise. Entropy. 2023;25(10):1467. doi:10.3390/e25101467.
  7. 7.
    Hodson TO, Over TM, Foks SS. Mean squared error, deconstructed. J Adv Model Earth Syst. 2021;13(12):e2021MS002681. doi:10.1029/2021MS002681.
  8. 8.
    Naamane M (2023). The acquisition of semantic relationships between words. arXiv preprint, arXiv:2307.06419. doi:10.48550/arXiv.2307.06419.
  9. 9.
    Wu C, Zhang S, Long F, Yin Z, Leng T (2023). Toward better orthogonality regularization with disentangled norm in training deep CNNs. arXiv preprint, arXiv:2306.09939. doi:10.48550/arXiv.2306.09939.
  10. 10.
    Saxe AM, McClelland JL, Ganguli S (2013). Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. arXiv preprint, arXiv:1312.6120, doi:10.48550/arXiv.1312.6120.
  11. 11.
    Bansal N, Chen Z, Wang Z. Can we gain more from orthogonality regularizations in training deep networks? In: Advances in Neural Information Processing Systems 31 (NeurIPS 2018). Montréal, Canada; 2018. 2018;31:4266–4277. doi:10.48550/arXiv.1810.09102
  12. 12.
    Press O, Smith NA, Levy O. Improving transformer models by reordering their sublayers. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL 2020). 2020. doi:10.48550/arXiv.1911.03864.
  13. 13.
    Sønderby CK, Raiko T, Maaløe L, Sønderby SK, Winther O. Ladder variational autoencoders. Adv Neural Inf Process Syst. 2016;29:37383746. doi:10.48550/arXiv.1602.02282.
  14. 14.
    Pennington J, Socher R, Manning CD. GloVe: global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics; 2014. p. 15321543. doi:10.3115/V1/D14-1162
  15. 15.
    Lewandowski M, Eghbalzadeh H, Heinzl B, Pisoni R, Moser BAM. On space folds of ReLU neural networks. arXiv preprint, arXiv:2502.09954. 2025. doi:10.48550/arXiv.2502.09954.
  16. 16.
    He J, Du J, Ma W. Preventing dimensional collapse in self-supervised learning via orthogonality regularization. arXiv preprint arXiv:2411.00392. 2024. doi:10.48550/arXiv.2411.00396.
  17. 17.
    Xie J, Gao H, Xie W, Liu X, Grant PW. Robust clustering by detecting density peaks and assigning points based on fuzzy weighted K-nearest neighbors. Inf Sci. 2016;354:1940. doi:10.1016/j.ins.2016.03.011.
  18. 18.
    Yang S, Zhang L. Non-redundant multiple clustering by nonnegative matrix factorization. Mach Learn. 2017;106:695712. doi:10.1007/s10994-016-5601-9.
  19. 19.
    Kingma DP, Ba J. Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. 2014. doi:10.48550/arXiv.1412.6980.
  20. 20.
    Cai D, He X, Han J, Huang TS. Graph regularized nonnegative matrix factorization for data representation. IEEE Trans Pattern Anal Mach Intell. 2011;33(8):15481560. doi:10.1109/TPAMI.2010.231.
  21. 21.
    Bengio Y. Practical recommendations for gradient-based training of deep architectures. Neural Netw Tricks Trade. 2012;437478. doi:10.1007/978-3-642-35289-8_26.
  22. 22.
    Huang W. Implementation of parallel optimization algorithms for NLP: mini-batch SGD, SGD with momentum, AdaGrad Adam. J Adv Comput Commun Eng. 2024;81:20241146. doi:10.54254/2755-2721/81/20241146.
  23. 23.
    Roy S. Trustworthy dimensionality reduction. arXiv preprint arXiv:2405.05868. 2024. doi:10.48550/arXiv.2405.05868.
  24. 24.
    Li J, Bi R, Xie Y, Ying J. Continuity-preserved deep learning method for solving elliptic interface problems. Comput Appl Math. 2025;(44):127. doi:10.1007/s40314-025-03090-5.

Written by

Amruddin Nekpai

Article Type: Research Paper

Date of acceptance: January 2026

Date of publication: February 2026

DoI: 10.5772/acrt20250095

Copyright: The Author(s), Licensee IntechOpen, License: CC BY 4.0

Download for free

© The Author(s) 2025. Licensee IntechOpen. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.


Impact of this article

27

Downloads

182

Views


Share this article


Popular among readers

Loading...

Loading...

Popular among readers

Loading...

Loading...

Popular among readers

Loading...

Loading...

Join us today!

Submit your Article