publications | Ben Dai

2025

ICML

EnsLoss: Stochastic Calibrated Loss Ensembles for Preventing Overfitting in Classification

Ben Dai

In Proceedings of the Forty-Second International Conference on Machine Learning, 2025

Abs PDF Github

Empirical risk minimization (ERM) with a computationally feasible surrogate loss is a widely accepted approach for classification. Notably, the convexity and calibration (CC) properties of a loss function ensure consistency of ERM in maximizing accuracy, thereby offering a wide range of options for surrogate losses. In this article, we propose a novel ensemble method, namely EnsLoss, which extends the ensemble learning concept to combine loss functions within the ERM framework. A key feature of our method is the consideration on preserving the “legitimacy” of the combined losses, i.e., ensuring the CC properties. Specifically, we first transform the CC conditions of losses into loss-derivatives, thereby bypassing the need for explicit loss functions and directly generating calibrated loss-derivatives. Therefore, inspired by Dropout, EnsLoss enables loss ensembles through one training process with doubly stochastic gradient descent (i.e., random batch samples and random calibrated loss-derivatives). We theoretically establish the statistical consistency of our approach and provide insights into its benefits. The numerical effectiveness of EnsLoss compared to fixed loss methods is demonstrated through experiments on a broad range of 14 OpenML tabular datasets and 46 image datasets with various deep learning architectures. Python repository and source code are available on GitHub.

2024

AoAS

A bootstrap model comparison test for identifying genes with context-specific patterns of genetic regulation

Mykhaylo M Malakhov, Ben Dai, Xiaotong Shen, and Wei Pan

Annals of Applied Statistics, 2024

Abs PDF Github

Understanding how genetic variation affects gene expression is essential for a complete picture of the functional pathways that give rise to complex traits. Although numerous studies have established that many genes are differentially expressed in distinct human tissues and cell types, no tools exist for identifying the genes whose expression is differentially regulated. Here we introduce DRAB (Differential Regulation Analysis by Bootstrapping), a gene-based method for testing whether patterns of genetic regulation are significantly different between tissues or other biological contexts. DRAB first leverages the elastic net to learn context-specific models of local genetic regulation and then applies a novel bootstrap-based model comparison test to check their equivalency. Unlike previous model comparison tests, our proposed approach can determine whether population-level models have equal predictive performance by accounting for the variability of feature selection and model training. We validated DRAB on mRNA expression data from a variety of human tissues in the Genotype-Tissue Expression (GTEx) Project. DRAB yielded biologically reasonable results and had sufficient power to detect genes with tissue-specific regulatory profiles while effectively controlling false positives. By providing a framework that facilitates the prioritization of differentially regulated genes, our study enables future discoveries on the genetic architecture of molecular phenotypes.
CLeaR

Inference of Nonlinear Causal Effects with Application to TWAS with GWAS Summary Data

Ben Dai*, Chunlin Li*, Haoran Xue, Wei Pan, and Xiaotong Shen

In Proceedings of the Third Conference on Causal Learning and Reasoning, PMLR, 2024

Abs HTML PDF Github Docs

Large-scale genome-wide association studies (GWAS) have offered an exciting opportunity to discover putative causal genes or risk factors associated with diseases by using SNPs as instrumental variables (IVs). However, conventional approaches assume linear causal relations partly for simplicity and partly for the availability of GWAS summary data. In this work, we propose a novel model for transcriptome-wide association studies (TWAS) to incorporate nonlinear relationships across IVs, an exposure/gene, and an outcome, which is robust against violations of the valid IV assumptions, permits the use of GWAS summary data, and covers two-stage least squares as a special case. We decouple the estimation of a marginal causal effect and a nonlinear transformation, where the former is estimated via sliced inverse regression and a sparse instrumental variable regression, and the latter is estimated by a ratio-adjusted inverse regression. On this ground, we propose an inferential procedure. An application of the proposed method to the ADNI gene expression data and the IGAP GWAS summary data identifies 18 causal genes associated with Alzheimer’s disease, including APOE and TOMM40, in addition to 7 other genes missed by two-stage least squares considering only linear relationships. Our findings suggest that nonlinear modeling is required to unleash the power of IV regression for identifying potentially nonlinear gene-trait associations. Accompanying this paper is our Python library nonlinear-causal that implements the proposed method.

2023

NeurIPS

ReHLine: Regularized Composite ReLU-ReHU Loss Minimization with Linear Computation and Linear Convergence

Ben Dai*, and Yixuan Qiu*

In Thirty-seventh Conference on Neural Information Processing Systems, 2023

Abs PDF Github Docs Slides Website

Empirical risk minimization (ERM) is a crucial framework that offers a general approach to handling a broad range of machine learning tasks. In this paper, we propose a novel algorithm, called ReHLine, for minimizing a set of regularized ERMs with convex piecewise linear-quadratic loss functions and optional linear constraints. The proposed algorithm can effectively handle diverse combinations of loss functions, regularizations, and constraints, making it particularly well-suited for complex domain-specific problems. Examples of such problems include FairSVM, elastic net regularized quantile regression, Huber minimization, etc. In addition, ReHLine enjoys a provable linear convergence rate and exhibits a per-iteration computational complexity that scales linearly with the sample size. The algorithm is implemented with both Python and R interfaces, and its performance is benchmarked on various tasks and datasets. Our experimental results demonstrate that ReHLine significantly surpasses generic optimization solvers in terms of computational efficiency on large-scale datasets. Moreover, it also outperforms specialized solvers such as liblinear in SVM, hqreg in Huber minimization and lightning(SAGA, SAG, SDCA, SVRG) in smooth SVM, exhibiting exceptional flexibility and efficiency.
JMLR

RankSEG: A Consistent Ranking-based Framework for Segmentation

Ben Dai, and Chunlin Li

Journal of Machine Learning Research, 2023

Abs HTML PDF Github Slides

Segmentation has emerged as a fundamental field of computer vision and natural language processing, which assigns a label to every pixel/feature to extract regions of interest from an image/text. To evaluate the performance of segmentation, the Dice and IoU metrics are used to measure the degree of overlap between the ground truth and the predicted segmentation. In this paper, we establish a theoretical foundation of segmentation with respect to the Dice/IoU metrics, including the Bayes rule and Dice-/IoU-calibration, analogous to classification-calibration or Fisher consistency in classification. We prove that the existing thresholding-based framework with most operating losses are not consistent with respect to the Dice/IoU metrics, and thus may lead to a suboptimal solution. To address this pitfall, we propose a novel consistent ranking-based framework, namely RankDice/RankIoU, inspired by plug-in rules of the Bayes segmentation rule. Three numerical algorithms with GPU parallel execution are developed to implement the proposed framework in large-scale and high-dimensional segmentation. We study statistical properties of the proposed framework. We show it is Dice-/IoU-calibrated, and its excess risk bounds and the rate of convergence are also provided. The numerical effectiveness of RankDice/mRankDice is demonstrated in various simulated examples and Fine-annotated CityScapes, Pascal VOC and Kvasir-SEG datasets with state-of-the-art deep learning architectures. Python module and source code are available on Github at (https://github.com/statmlben/rankseg).
TMLR

Supervised Knowledge May Hurt Novel Class Discovery Performance

Ziyun Li, Jona Otholt, Ben Dai, Di Hu, Christoph Meinel, and Haojin Yang

Transactions on Machine Learning Research, 2023

HTML PDF
AoAS

Data-adaptive discriminative feature localization with statistically guaranteed interpretation

Ben Dai, Xiaotong Shen, Lin Yee Chen, Chunlin Li, and Wei Pan

The Annals of Applied Statistics, 2023

Abs PDF Github Slides

In explainable artificial intelligence, discriminative feature localization is critical to reveal a blackbox model’s decision-making process from raw data to prediction. In this article, we use two real datasets, the MNIST handwritten digits and MIT-BIH Electrocardiogram (ECG) signals, to motivate key characteristics of discriminative features, namely adaptiveness, predictive importance and effectiveness. Then, we develop a localization framework based on adversarial attacks to effectively localize discriminative features. In contrast to existing heuristic methods, we also provide a statistically guaranteed interpretability of the localized features by measuring a generalized partial R2. We apply the proposed method to the MNIST dataset and the MIT-BIH dataset with a convolutional auto-encoder. In the first, the compact image regions localized by the proposed method are visually appealing. Similarly, in the second, the identified ECG features are biologically plausible and consistent with cardiac electrophysiological principles while locating subtle anomalies in a QRS complex that may not be discernible by the naked eye. Overall, the proposed method compares favorably with state-of-the-art competitors. Accompanying this paper is a Python library dnn-locate (this https URL) that implements the proposed approach.
CVPR

ImbaGCD: Imbalanced Generalized Category Discovery

Ziyun Li, Ben Dai, Furkan Simsek, Christoph Meinel, and Haojin Yang

In The 2nd Workshop on Computer Vision in the Wild, CVPR 2023, 2023

PDF

2022

Optica

Full Poincaré polarimetry enabled through physical inference

Chao He, Jianyu Lin, Jintao Chang, Jacopo Antonello, Ben Dai, Jingyu Wang, Jiahe Cui, Ji Qi, Min Wu, Daniel S Elson, and 1 more author

Optica, 2022

arXiv PDF
TNNLS

Significance tests of feature relevance for a black-box learner

Ben Dai, Xiaotong Shen, and Wei Pan

IEEE Transactions on Neural Networks and Learning Systems, 2022

Abs PDF Supp Github Docs Slides

An exciting recent development is the uptake of deep neural networks in many scientific fields, where the main objective is outcome prediction with a black-box nature. Significance testing is promising to address the black-box issue and explore novel scientific insights and interpretations of the decision-making process based on a deep learning model. However, testing for a neural network poses a challenge because of its black-box nature and unknown limiting distributions of parameter estimates while existing methods require strong assumptions or excessive computation. In this article, we derive one-split and two-split tests relaxing the assumptions and computational complexity of existing black-box tests and extending to examine the significance of a collection of features of interest in a dataset of possibly a complex type, such as an image. The one-split test estimates and evaluates a black-box model based on estimation and inference subsets through sample splitting and data perturbation. The two-split test further splits the inference subset into two but requires no perturbation. Also, we develop their combined versions by aggregating the p -values based on repeated sample splitting. By deflating the bias-sd-ratio, we establish asymptotic null distributions of the test statistics and the consistency in terms of Type 2 error. Numerically, we demonstrate the utility of the proposed tests on seven simulated examples and six real datasets. Accompanying this article is our python library dnn-inference (https://dnn-inference.readthedocs.io/en/latest/) that implements the proposed tests.
AP

Revealing complex optical phenomena through vectorial metrics

Chao He, Jintao Chang, Patrick S Salter, Yuanxing Shen, Ben Dai, Pengcheng Li, Yihan Jin, Samlan Chandran Thodika, Mengmeng Li, Aziz Tariq, and 1 more author

Advanced Photonics, 2022
JASA

Coupled Generation

Ben Dai, Xiaotong Shen, and Wing Wong

Journal of the American Statistical Association, 2022

Abs HTML

Instance generation creates representative examples to interpret a learning model, as in regression and classification. For example, representative sentences of a topic of interest describe the topic specifically for sentence categorization. In such a situation, a large number of unlabeled observations may be available in addition to labeled data, for example, many unclassified text corpora (unlabeled instances) are available with only a few classified sentences (labeled instances). In this article, we introduce a novel generative method, called a coupled generator, producing instances given a specific learning outcome, based on indirect and direct generators. The indirect generator uses the inverse principle to yield the corresponding inverse probability, enabling to generate instances by leveraging an unlabeled data. The direct generator learns the distribution of an instance given its learning outcome. Then, the coupled generator seeks the best one from the indirect and direct generators, which is designed to enjoy the benefits of both and deliver higher generation accuracy. For sentence generation given a topic, we develop an embedding-based regression/classification in conjuncture with an unconditional recurrent neural network for the indirect generator, whereas a conditional recurrent neural network is natural for the corresponding direct generator. Moreover, we derive finite-sample generation error bounds for the indirect and direct generators to reveal the generative aspects of both methods thus explaining the benefits of the coupled generator. Finally, we apply the proposed methods to a real benchmark of abstract classification and demonstrate that the coupled generator composes reasonably good sentences from a dictionary to describe a specific topic of interest. Supplementary materials for this article are available online.
JASA

Embedding learning

Ben Dai, Xiaotong Shen, and Junhui Wang

Journal of the American Statistical Association, 2022

Abs HTML Github

Numerical embedding has become one standard technique for processing and analyzing unstructured data that cannot be expressed in a predefined fashion. It stores the main characteristics of data by mapping it onto a numerical vector. An embedding is often unsupervised and constructed by transfer learning from large-scale unannotated data. Given an embedding, a downstream learning method, referred to as a two-stage method, is applicable to unstructured data. In this article, we introduce a novel framework of embedding learning to deliver a higher learning accuracy than the two-stage method while identifying an optimal learning-adaptive embedding. In particular, we propose a concept of U-minimal sufficient learning-adaptive embeddings, based on which we seek an optimal one to maximize the learning accuracy subject to an embedding constraint. Moreover, when specializing the general framework to classification, we derive a graph embedding classifier based on a hyperlink tensor representing multiple hypergraphs, directed or undirected, characterizing multi-way relations of unstructured data. Numerically, we design algorithms based on blockwise coordinate descent and projected gradient descent to implement linear and feed-forward neural network classifiers, respectively. Theoretically, we establish a learning theory to quantify the generalization error of the proposed method. Moreover, we show, in linear regression, that the one-hot encoder is more preferable among two-stage methods, yet its dimension restriction hinders its predictive performance. For a graph embedding classifier, the generalization error matches up to the standard fast rate or the parametric rate for linear or nonlinear classification. Finally, we demonstrate the utility of the classifiers on two benchmarks in grammatical classification and sentiment analysis. Supplementary materials for this article are available online.

2021

JASA

Scalable collaborative ranking for personalized prediction

Ben Dai, Xiaotong Shen, Junhui Wang, and Annie Qu

Journal of the American Statistical Association, 2021

Abs HTML

Personalized prediction presents an important yet challenging task, which predicts user-specific preferences on a large number of items given limited information. It is often modeled as certain recommender systems focusing on ordinal or continuous ratings, as in collaborative filtering and content-based filtering. In this article, we propose a new collaborative ranking system to predict most-preferred items for each user given search queries. Particularly, we propose a ψ-ranker based on ranking functions incorporating information on users, items, and search queries through latent factor models. Moreover, we show that the proposed nonconvex surrogate pairwise ψ-loss performs well under four popular bipartite ranking losses, such as the sum loss, pairwise zero-one loss, discounted cumulative gain, and mean average precision. We develop a parallel computing strategy to optimize the intractable loss of two levels of nonconvex components through difference of convex programming and block successive upper-bound minimization. Theoretically, we establish a probabilistic error bound for the ψ-ranker and show that its ranking error has a sharp rate of convergence in the general framework of bipartite ranking, even when the dimension of the model parameters diverges with the sample size. Consequently, this result also indicates that the ψ-ranker performs better than two major approaches in bipartite ranking: pairwise ranking and scoring. Finally, we demonstrate the utility of the ψ-ranker by comparing it with some strong competitors in the literature through simulated examples as well as Expedia booking data. Supplementary materials for this article are available online.
EJS

Two-level monotonic multistage recommender systems

Ben Dai, Xiaotong Shen, and Wei Pan

Electronic Journal of Statistics, 2021
EJS

Sentiment analysis with covariate-assisted word embeddings

Shirong Xu, Ben Dai, and Junhui Wang

Electronic Journal of Statistics, 2021

2019

JMLR

Smooth neighborhood recommender systems

Ben Dai, Junhui Wang, Xiaotong Shen, and Annie Qu

Journal of Machine Learning Research, 2019

HTML PDF
EJS

Query-dependent ranking and its asymptotic properties

Ben Dai, and Junhui Wang

Electronic Journal of Statistics, 2019
NC

Complex vectorial optics through gradient index lens cascades

Chao He, Jintao Chang, Qi Hu, Jingyu Wang, Jacopo Antonello, Honghui He, Shaoxiong Liu, Jianyu Lin, Ben Dai, Daniel S Elson, and 1 more author

Nature Communications, 2019