Projects

Explore my data science and development projects

MET Art Display Predictor

ML-powered prediction of artwork exhibition likelihood - using data from the Metropolitan Museum of Art

Gradient Boosting MLOps Data Analysis

Analysis of public data from the Metropolitan Museum of Art and a gradient boosting algorithm predicting which art piece is on view

This project uses data made available by The Metropolitan Museum of Art under the Open Access program.

MET Open Access Dataset: https://github.com/metmuseum/openaccess

Try the Model

Range: -400000 to 5000
Range: -240000 to 2870

Dataset Analysis

Loading...
Total Artworks
Loading...
On Display
Loading...
In Storage

Class Distribution

10.8%
89.2%
On View
Not On View

Model Performance Dashboard

Loading...

Master's Thesis - Modality fusion in enzyme interaction prediction

Modality-fusion strategies for a transformer predicting enzyme-substrate interactions

Bioinformatics Deep Learning Modality Fusion
Objective
Improved state-of-the-art prediction of protein-small molecule interactions by incorporating 3D protein structural data into a pre-trained multimodal transformer to enhance model generalizability and robustness.
Key Responsibilities & Actions
  • Constructed a novel dataset of ~185,000 data points by mapping protein sequences to 3D structures from AlphaFold using the Graphein library.
  • Represented protein structures as graphs and generated node embeddings using the Node2Vec algorithm.
  • Designed & implemented novel modality fusion techniques (additive & multiplicative) to merge protein sequence and structural embeddings before feeding them into a pre-trained transformer, avoiding costly retraining from scratch.
  • Trained and evaluated model performance end-to-end, including a gradient boosting (XGBoost) ensemble step, across various data split scenarios (random, cold protein, cold SMILES).
Key Results
  • Validated that the model maintains high performance (~97% accuracy) on standard random splits.
  • Demonstrated improved generalization, achieving a ~1% accuracy increase on challenging "cold start" splits with unseen small molecules—the primary failure mode of the original model.
  • Showcased significantly better performance (via MCC score) for rare, underrepresented substrates, highlighting the model's potential for novel discovery.
Master's Thesis - Modality fusion in enzyme interaction prediction visualization 1

Tranformer-based Lane Detection

Transformer-based lane detection that leverages spatiotemporal information from sequential video frames to robustly dynamic road conditions

Computer Vision Deep Learning Transformers
Objective
Improve robustness of lane detection through leveraging spatiotemporal context from video sequences by using Transormer-based architecture.
Personal Responsibilities & Actions
  • Curated and preprocessed large-scale driving datasets (CULane: 55h / 133k frames, TuSimple), converting lane point annotations into segmentation masks.
  • Designed and applied a data augmentation pipeline (random flips, rotations ±30°, crops, and normalization) to improve robustness to camera pose changes, occlusions, and viewpoint shifts.
  • Built a transformer-based segmentation pipeline that leverages temporal information across video frames to improve lane detection consistency.
  • Evaluated and replaced a ResNet34 backbone with a self-supervised MAE-ViT, validating superior lane-specific feature extraction via custom image-reconstruction experiments.
Key Results
  • Demonstrated MAE-ViT's advantage over CNN backbones for capturing long, thin, and partially occluded lane structures.
  • Showed that temporal modeling with transformers stabilizes training and improves qualitative lane predictions under challenging conditions.
  • Delivered a validated proof-of-concept under strict hardware constraints.
Tranformer-based Lane Detection visualization 1

Lindel

Reproducing, Evaluating and Interpreting the Lindel Model for CRISPR-Cas9 Repair Outcome Prediction

Bioinformatics Machine Learning Model Interpretation

🧬 Project Overview

In a collaborative academic project, our team reproduced and validated key findings from Chen et al.’s seminal work on predicting CRISPR-Cas9 DNA repair outcomes. We successfully replicated the core functionality of their machine learning model, Lindel, confirming both its superiority over a baseline model and the high reproducibility of mutational outcomes across experiments.


🔍 Individual Research Focus

My individual contribution centered on interpreting the model’s internal logic to critically evaluate the authors’ claims regarding sequence context importance.


🎯 Objective

Assess whether the logistic regression model used to predict the insertion–deletion (indel) ratio learns the same sequence context importance as reported in the original paper.


🛠️ Methods

To interpret and validate the model’s behavior, I applied a complementary set of interpretability techniques:

  • Learned weights analysis of the neural network
  • SHAP (SHapley Additive exPlanations) for feature importance
  • Multiple Correspondence Analysis (MCA) to examine relationships in categorical sequence data

✅ Key Findings

  • Confirmed the strong influence of a single nucleotide (‘T’) at the cut site (position 17),
    though its impact on model predictions differed from the paper’s biochemical interpretation.
  • Questioned the reported importance of specific dinucleotides (e.g., TG, CG, GA),
    as their effects were not consistently supported across interpretability analyses.
  • Concluded that the model relies more heavily on single-nucleotide features than originally proposed,
    offering a more nuanced and critical understanding of the model’s decision-making process.