SSPP: Smiles School Projects Proceedings 2025

|SMILES
Golden partner
Innovation partner
Supported by
Scientific partner
Host
General partner
Selected papers
SSPP identifies and showcases research that drives innovation through pioneering explorations of emerging concepts, critical analysis of cutting-edge technology applications, and rigorous empirical studies that establish reliable foundations for future scholarly work.

search papers

Natural disasters, particularly floods, cause significant annual damage worldwide, exceeding $40 billion annually. Traditional forecasting methods are often inadequate due to reliance on limited datasets. This project investigates the development of an advanced generalized flood monitoring model using self-supervised learning (SSL) on multi-source geospatial data. The core of the project involves preparing a general model using a self-supervised approach on a combination of Sentinel-1 (S1), Sentinel-2 (S2), and Digital Elevation Model (DEM) data. Subsequently, this generalized model will be fine-tuned for the downstream task of water surface detection, creating three specialized models (S1+DEM, S2+DEM, S1+S2+DEM). The primary goal is to empirically test if this SSL-based transfer learning approach produces more robust and accurate models compared to traditional direct training methods. This approach aims to address the critical challenge of annotated data scarcity in diverse geographic regions, potentially reducing annotation costs and enabling scalable flood monitoring in under-resourced areas.

Authors: Ayrat Abdullin, Anna Korotkova, Dmitriy Ryazanov, Ruslan Dzharkinov, Maria Smirnova, Marat Saibodalov, Mariia Ulianova, Muhammad Awais
Curators: Ilya Novikov, Svetlana Illarionova
Link to pdf
This project investigates adversarial attacks on benchmark datasets used to evaluate large language models (LLMs). We aim to design subtle perturbations of MMLU benchmark questions that selectively degrade the performance of a target LLM while minimally affecting other models. Such attacks expose vulnerabilities in current evaluation pipelines and highlight the need for more robust benchmarks. Using TextAttack as the main framework, we will generate and validate adversarial examples, measuring their differential impact across multiple LLMs.

Authors: Anastasiia Orlova, Nina Gubina, Ivan Dubrovsky, Illarion Iov, Jamilya Erkenova
Curators: Irena Gureeva, Alexey Zaytsev
Link to pdf
In this study, we present the first systematic evaluation of the effectiveness of Synolitic Graph Neural Networks (SGNNs), which transform high-dimensional biomedical data into sample-specific graphs via ensembles of pairwise classifiers. We enrich these graphs with topology-aware node descriptors (degree, strength, closeness, betweenness) and apply sparsification using either minimum-connected graphs or edge retention at a fixed probability. We train convolution-based (GINE) and attention-based (GATv2) graph neural networks for the tabular classification task under two regimes: a foundation setting formed by concatenating datasets, and dataset-specific training. In the foundation regime, GINE with minimum-connected sparsification and node features achieves the best ROC-AUC of 88.54. In the separate-datasets regime, GATv2 with p=0.8 edge retention and node features obtains the best ROC-AUC of 82.03. In both regimes, GNNs enhanced with node features consistently outperform the XGBoost baseline (ROC-AUC of 75.70 / 70.27), demonstrating that SGNN-induced structure combined with topology-aware augmentation and appropriate sparsification provides an effective recipe for biomedical classification.

Authors: Artem Sosedka, Ivan Sviridov, Anastasia Linich, Ernest Nasyrov
Curators: Alexey Zaikin, Daniil Vlasenko, Vadim Ushakov, Denis Zakharov
Link to pdf
Lymphoma is a diverse group of blood cancers that develop in the lymphatic system and require accurate histopathological diagnosis to guide treatment. Examining biopsy slides manually is both time-consuming and subject to differences in interpretation between pathologists. Although deep learning (DL) has shown promise in automating histopathology analysis, most prior studies have focused on binary classification or a small number of lymphoma subtypes, often using limited datasets.These models typically rely on hierarchical features extracted by convolutional neural networks (CNNs), which are effective at capturing local patterns but may miss broader structural characteristics of the tissue. To address these limitations, this project uses a privately collected dataset of approximately 190000 histopathology image tiles across five categories: Classic Hodgkin Lymphoma (cHL), Diffuse Large B-Cell Lymphoma (DLBCL), Follicular Lymphoma (FL), Mantle Cell Lymphoma (MCL), and Reactive (non-malignant) tissue. We develop a topology-aware CNN model that combines features learned by a pretrained CNN with topological features derived from persistent homology. This fusion enables the model to capture both local image characteristics and global structural patterns within histopathology images. All experiments use patient-level stratified splits (80/10/10) to avoid data leakage across tiles from the same patient. The proposed approach demonstrates improved classification performance over baseline models. These results indicate that integrating topological data analysis (TDA) into DL frameworks not only enhances diagnostic accuracy but also offers clinically useful insights for supporting robust and reliable lymphoma diagnosis.

Authors: Daniyal Asif, Daniil Sulimov
Curators: Svetlana Illarionova
Link to pdf
The first approach based on self-supervised learning (SSL) with parameter-efficient fine-tuning is proposed for automated identification of tumor-associated macrophages (TAMs) on standard sections stained with hematoxylin and eosin (H&E) in diffuse large cell B-cell lymphoma (DLBCL). The approach utilized ResNet-18 architecture with SimCLR-based contrastive learning using NT-Xent loss function (Normalized Temperature-scaled Cross Entropy), enhanced by Low-Rank Adaptation (LoRA) for efficient parameter optimization. Based on 23 WSI microphotographs of DLBCL (856 manually annotated TAMs) supplemented with LyNSeC and Immunocto sets, the NT-Xent loss function ResNet-18 achieved an accuracy, exceeding the performance of traditional transfer learning. The proposed system successfully distinguishes TAM on H&E slices, eliminating the need for immunohistochemical (IHC) staining. However, despite the promising results, larger clinical trials are needed to confirm the prognostic significance.

Authors: Anastasiia Studenikina, Danil Sulimov, Dmitry Zvezdin, Arsenii Galimov, Olga Filimonova, Alibek Epkhiev, Ekaterina Anpilogova
Curators: Svetlana Illarionova
Learn more
This paper investigates token homogenization -- the convergence of token representations toward uniformity across transformer layers and its relationship to positional bias in large language models. We empirically examine whether homogenization occurs and how positional bias amplifies this effect. Through layer-wise similarity analysis and controlled experiments, we demonstrate that tokens systematically lose distinctiveness during processing, particularly when biased toward extremal positions. Our findings confirm both the existence of homogenization and its dependence on positional attention mechanisms.

Authors: Viacheslav Yusupov, Danil Maksimov, Ameliia Alaeva, Tatiana Zaitceva, Antipina Anna, Anna Vasileva, Chenlin Liu, Rayuth Chheng, Danil Sazanakov, Andrey Chetvergov
Curators: Egor Shvetsov, Alina Ermilova
Link to pdf
Recent advances in diffusion-based large language models (dLLMs) position them as a promising alternative to autoregressive decoders for text generation. This study evaluates dLLMs on the task of translating in-code comments from Chinese into Russian and English. We benchmark two diffusion LLMs (DiffuCoder and Dream) against a size-matched autoregressive baseline (Qwen2.5). Using a bilingual corpus of paired code and comments, we analyze generation behavior and report BLEU and COMET scores together with wall-clock generation latency measured under identical hardware and decoding settings. Experimental results indicate that, under comparable conditions, diffusion models can better preserve in-code context and technical terminology and — in several configurations — achieve improvements in automatic translation metrics while offering favorable trade-offs between quality and inference time. This work provides a focused benchmark for code-comment translation and highlights the practical potential of diffusion-based methods for code intelligence and machine translation tasks.

Authors: Alexander Dikov, Vladimir Zvorygin
Curators: Valentin Malykh
Link to pdf
Transporting between arbitrary distributions poses a fundamental challenge in generative modelling. Although diffusion bridges and flow matching methods offer elegant solutions that have been widely applied to unpaired domain translation problems through Schrödinger bridge (SB) or optimal transport (OT) formulations in continuous domains, their application to discrete state spaces remains limited (e.g. text or graphs). We address this issue by introducing a novel framework called Categorical Iterative Proportional Fitting (C-IPF) that extends the SB method to discrete settings using discrete diffusion models. Our approach constructs a sequence of distributions that converges on the SB solution, enabling principled transport between high-dimensional categorical distributions. We demonstrate the effectiveness of C-IPF using the Swiss-roll dataset by qualitatively evaluating sample fidelity and diversity. Initial experiments on Colored MNIST using Discrete Flow Matching (DFM) reveal the framework's potential for unpaired translation, while also highlighting current limitations in feature preservation, particularly with regard to color consistency. This work establishes a foundation for discrete Schrödinger bridges, thereby expanding the scope of generative modelling for categorical data.

Authors: Nikita Ligostaev, Vladimir Latypov, Viacheslav Iablochnikov, Maria Nesterova, Ramil Khafizov
Curators: Sergei Kholkin, Alexander Korotin
Link to pdf
Central banks face significant challenges in effectively communicating with general public, which is essential for influencing economic decision-making and fostering trust that can lead to lower inflation expectations. Traditional focus groups, while valuable for capturing subjective perceptions, have logistical limitations, making the implementation of Large Language Models (LLM) for pre-testing communications a promising solution to enhance understanding and improve the effectiveness of central bank messages.
This project develops a synthetic focus group (SFG) composed of LLM agents representing individuals with varying sentiments towards the central bank, based on their interviews. The LLM avatars were created using the Deepseek model, calibrated with existing materials from the bank, and compared against interview perceptions. To facilitate the SFG discussion, several prominent LLM models, including Deepseek-r1, Mistral-7b-Instruct, and Gemini-2.5-Pro via Openrouter, were evaluated. Feedback from the Deepseek-r1 model was utilized to enhance the central bank's report, which was then re-submitted for further discussion.
Experimental results demonstrate SFGs are a viable alternative to traditional focus groups. Revisions to a Bank of Russia document, informed by SFG feedback, were statistically validated using two new SFGs (identical agents, no prior discussion context). The statistical analysis revealed a significant improvement in the perception of the enhanced central bank communication compared to the original text. The mean aggregate score increased from 2.5 (SD = 3.0) for the original text to 12.7 (SD = 2.5) for the revised version, demonstrating a substantial positive shift in sentiment.

Authors: Darya Dubinina, Fernando Leon, Timur Zakarin, Lyudmila Zavadskaya
Curators: Alina Evstigneeva
Link to pdf
Retrieval-Augmented Generation (RAG) chatbots are transforming banking by providing real-time, accurate assistance through dynamic retrieval from trusted financial sources. However, these systems face significant challenges with hallucinations—plausible but factually incorrect outputs—which pose serious risks in financial contexts, including regulatory violations and erosion of customer trust. To address these challenges, we present an optimized RAG pipeline that evaluates both lexical (BM25) and semantic (Chroma Vector Store) retrieval methods, rigorously tested using METEOR, Factual Correctness, Faithfulness, and Context Precision metrics. Our best-performing configuration— Recursive Text Splitting with Semantic Retrieval (ChromaDB at k = 8) achieves a Factual Correctness score of 0.48 and a METEOR score of 0.39, while significantly outperforming hybrid and lexical approaches in faithfulness (0.85) and context precision (0.59). These results demonstrate that semantic retrieval, when paired with recursive text splitting and manual text processing, offers the most reliable balance of accuracy and coherence for financial RAG-based LLM chatbots.

Authors: Taktashov Rustam
Curators: Alexander Senin
Link to pdf
Automatically detecting harmful actions (like violence, theft, or unsafe behavior) in security videos remains challenging. The difficulty comes from three main issues: the meaning of an action can change depending on the situation, these events rarely happen, and current systems trigger too many false alarms. At the same time multimodal models show promise in understanding and interpreting videos, however they may have critical weaknesses in security applications. In this project, we aim to cure possible problems and adapt VLMs to the task.

Authors: Kirill Borodin, Kirill Kondrashov, Inna Larina. Kseniia Gladkova
Curators: Anastasia Yaschenko
Link to pdf
Recent advances in automated discovery, exemplified by Google’s AlphaEvolve framework, demonstrate the effectiveness of integrating large language models (LLMs) with evolutionary search for complex optimization tasks. AlphaEvolve sets a new standard in this domain through candidate code generation and rigorous evaluation mechanisms. In this study, we run and analyze OpenEvolve agent, the open-source implementation of AlphaEvolve. We further extend our analysis to the applied Computer-Aided Design (CAD) reconstruction problem, demonstrating the approach’s versatility and establishing a comprehensive experimental benchmark for this task. Besides, we explore the possibility of the AlphaEvolve appliance to the combinatorial geometry problems. Our source-code is available at GitHub: https://github.com/crogs-foundation/openevolve-smiles25.

Authors: Dmitrii Beresnev, Roman Khalikov, Ivan Ulitin, Aleksandr Tolmachev, Ainura Zakirova
Curators: Vladimir Makharev, Petr Anokhin

Link to pdf
We investigate whether large language models (LLMs) encode clinically meaningful ICD-10 disease–disease struc- ture or primarily mirror textual similarity. At the nosology level, we construct and compare multiple similarity matrices: (i) statistical co-occurrence estimated from MIMIC-IV, (ii) MedBERT embeddings trained on vast amount of electronic health records (EHR) data, (iii) BERT embeddings of ICD-10 descriptions, (iv) Yandex Doc Search model embeddings, (v) a sequence-only masked-language model (MLM) trained on ICD sequences from MIMIC-IV, and (vi) adjacency scores elicited from several LLMs via standardized prompts. LLM-derived similarities align most with document/text embeddings (Spearman ρ ≈ 0.09 − 0.14) and only weakly with EHR-based signals (Spearman ρ ≈ 0.00 − 0.07). Taking into account the obtained p-values < 0.05 of such correlations, we can conclude that there is no monotonic relationships between LLMs’ results and real EHRs, but there is a slight monotonic dependence between LLMs’ results and texts. These results indicate that without domain ground- ing, current LLM judgments largely capture surface text as- sociations rather than population co-occurrence. We release a simple, reproducible validation protocol for benchmarking disease–disease structure across text, EHR-trained models, and LLMs, providing practical guidance for safely using LLM signals in clinical analytics and a foundation for future alignment with representative EHR data.

Authors: Dmitrii Kornilov, Sofiia Samoilova, Sofia Senotrusova
Curators: Alina Ermilova
Link to pdf
This work compares dynamic activation sparsity and unstructured weight sparsity during training of a language model with transformer architecture. Although weight pruning is widely used for model compression, activation sparsity, which operates similarly to mixture-of-expert, remains understudied despite its potential for faster training and improved model quality. To address this gap, we implement both sparsity types using iterative magnitude-based pruning for the main goal of training GPT2-124M on 1B tokens of RedPajama subset. Comparing these approaches allows to identify the most effective way to improve the generation quality and accelerate training.

Authors: Elizaveta Shlychkova, Alexandr Serkov, Maksim Borisov, Alexandr Andreev, Lev Shepelev
Curators: Dmitry Redko, Vladislav Goloshchapov, Maxim Zhelnin, Egor Shvetsov
Link to pdf
Large Language Models (LLMs) are increasingly utilized across diverse domains. This work investigates their potential as assessment tools for student selection through a case study of applicants to a Machine Learning Summer School (MLS). Analyzing anonymized applications, human assessor scores from the selection phase, and longitudinal performance metrics collected during and after the program, we demonstrate that LLMs not only match but can outperform human evaluators in predicting long-term student success. However, we also address a critical dependence of LLM performance on the evaluation criteria provided. Establishing objective criteria a priori – before performance data becomes available – remains a key challenge for future research. Particularly, systematic analysis of biases and their mitigation strategies remains unexplored.

Authors: Ekaterina Andrianova, Alexander Vavilkin, Makar Korchagin, Kseniia Shkuleva, Daniil Musin
Curators: Alina Ermilova, Egor Shvetsov, Irena Gureeva
Link to pdf
This work performs a foundational preliminary study for semi-structured activation sparsification in pretrained large language models (LLMs). We systematically analyze pat- terns of performance degradation and develop mitigation techniques to inform future efficiency research. While ac- tivation pruning offers theoretical efficiency benefits, its practical application to off-the-shelf models remains lim- ited by uncharacterized degradation profiles and insuf- ficient recovery methods. To address these challenges, we conduct comprehensive experiments across Llama2-7B, Llama3-8.1B, Qwen2.5-7B, and Gemma3-4B models.

Authors: Alina Kostromina, Shirin Alanova, Alexey Dontsov, Ekaterina Galaeva, Kristina Kazistova, Vladimir Smirnov, Anastasia Chernysheva, Petr Mikhailov
Curators: Dmitry Redko, Vladislav Goloshchapov, Maxim Zhelnin, Egor Shvetsov
Link to pdf
While modern Vision-Language Models (VLMs) excel in general image understanding, adapting them to specialized domains like historical handwriting recognition remains challenging. Naive fine-tuning risks catastrophic forgetting, degrading pre-trained reasoning and generation capabilities. This project explores Parameter-Efficient FineTuning (PEFT), particularly Low-Rank Adaptation (LoRA), to adapt VLMs for multilingual historical OCR while preserving foundational skills. Using LLaMA-Factory, we finetune Qwen2.5-VL on historical manuscripts (Digital Peter, IAM, VMLHD). Our contribution is a dual evaluation protocol assessing task-specific OCR metrics (WER, CER, BLEU4) and general knowledge retention (MMLU). We hypothesize that optimized LoRA tuning can achieve competitive OCR performance while retaining baseline capabilities, enabling scalable foundation model adaptation for digital humanities.

Authors: Ilya Trofimenko, Artem Vorozhtsov, Egor Ushakov, Irina Maltseva
Curators: Maxim Novopoltzev, Ruslan Murtazin, Alexandr Tulenkov
Link to pdf
While modern Vision-Language Models (VLMs) excel in general image understanding, adapting them to specialized domains like historical handwriting recognition remains challenging. Naive fine-tuning risks catastrophic forgetting, degrading pre-trained reasoning and generation capabilities. This project explores Parameter-Efficient FineTuning (PEFT), particularly Low-Rank Adaptation (LoRA), to adapt VLMs for multilingual historical OCR while preserving foundational skills. Using LLaMA-Factory, we finetune Qwen2.5-VL on historical manuscripts (Digital Peter, IAM, VMLHD). Our contribution is a dual evaluation protocol assessing task-specific OCR metrics (WER, CER, BLEU4) and general knowledge retention (MMLU). We hypothesize that optimized LoRA tuning can achieve competitive OCR performance while retaining baseline capabilities, enabling scalable foundation model adaptation for digital humanities.

Authors: Slavinskaia Elizaveta, Korzun Ivan, Shuvaeva Alexandra, Porvatov Vadim
Curators: Konstantin Egorov
Link to pdf
Traditional sequential recommender systems model each item as a unique embedding and rely primarily on collaborative signals. At the same time, rich item metadata and
multimodal content such as text, images, and audio remain underused. A common practice is to initialize item embeddings with external content vectors or feed them as inputs, but this does not fully exploit available information. Our work on semantic item representations offers an alternative by compressing high-dimensional features into
discrete codes that integrate naturally with transformer architectures. In this project, we implement and evaluate several content-aware representations for sequential recommendation, including semantic IDs generated by a residual quantized variational autoencoder (RQ-VAE), simple content embeddings, and fusion schemes. We test these approaches in a realistic evaluation scenario to assess whether they are worth implementing in real pipelines to improve the quality of recommendations.

Authors: Mark Filatov, Danil Gusak, Ekaterina Mozhegova
Curators: Anna Volodkevich
Link to pdf
In this work we investigate the capabilities of large language models (LLMs) in solving proof-based mathematical tasks. We construct a dataset of theorem-proof pairs and generate corrupted versions of proofs using LLMs to simulate common logical errors. These paired examples are then used to fine-tune a compact model, Qwen2.5-Math, using Direct Preference Optimization (DPO). We evaluate model improvements through LLM-based assessments. Median correctness improved from 65% (before training) to 90%, which shows the benefits of preference-based fine-tuning.

Authors: German Roev, Yana Fitkovskaya, Albina Klepach, Pavel Borisenko
Curators: Irina Piontkovskaya
Link to pdf
Analyzing large-scale medical records can reveal valuable insights into disease mechanisms and their latent interconnections. However, manual analysis is infeasible due to data volume and asymptomatic disease progression, which often leads to severe outcomes (e.g., cancer). Machine learning (ML) offers scalable solutions trained on data of different types, such as patients' histories in the form of sequences of codes from the International Classification of Diseases (ICD), text descriptions of diseases, etc. However, existing methods suffer from critical limitations: they yield inconsistent results and lack interpretability, making it unclear whether predictions derive from genuine clinical patterns or superficial lexical correlations (e.g., similarities in diagnosis names).

To address this, we aim to compare disease interconnections derived from real data using several methods with those derived from ICD codes and their text descriptions. Additionally, we compare disease interconnections obtained from a large language model (LLM) utilizing its own world knowledge.

Authors: Ekaterina Laptenkova, Anastasia Kolesnikova, Ekaterina Podplutova
Curators: Alina Ermilova, Dmitrii Kornilov
Link to pdf
We propose Regularized Alignment Loss (RAL), a framework for learning compact and predictive embeddings from time series by directly aligning present and future states. Unlike contrastive or reconstruction-based methods, RAL formulates the objective in a CCA-inspired way that extends naturally to linear, kernel, and neural variants. Our experiments on synthetic and real-world datasets demonstrate that RAL improves multi-horizon forecasting and consistently outperforms classical dimensionality reduction methods such as PCA and CCA. Moreover, in anomaly detection, RAL achieves state-of-the-art performance across multiple benchmarks, highlighting its effectiveness for both predictive and diagnostic tasks. These results point to predictive latent embeddings as a simple and versatile foundation for temporal representation learning.

Authors: Aleksandr Yugay, Hang Cui, Semyon Yakushov, Anatoly Kiryushin, Timur Sibagatullin
Curators: Alexey Zaytsev
Link to pdf
Reinforcement Learning (RL) has achieved significant progress in mathematical reasoning tasks. However, empirical evidence suggests that current RL approaches primarily sharpen the output answer distribution—favoring high-confidence responses—rather than encouraging genuine model exploration. To address this limitation, this paper proposes a novel reward framework that combines internal self-consistency signals with external verification rewards, thereby enhancing both the accuracy and exploratory capacity of LLMs in mathematical reasoning.

Authors: Daria Voronkova, Pengyi Li, Matvei Skripkin, Bo Li, Ekaterina Boyarina
Curators: Irina Piontkovskaya
Link to pdf
partners
host
general partner
golden partner
supported by
scientific partner
innovation partner
previous SMILES
contacts
Or via e-mail at
SMILES2025@skoltech.ru
Skoltech Artificial Intelligence Center
e-mail: ai4esg@skoltech.ru
Telegram-channel
Any questions about the summer school
can be asked in the Telegram-channel

Skolkovo Institute of Science and Technology (Skoltech)

Bolshoy Boulevard 30, bld.1

Skolkovo Innovation Center Territory