Grant | Myoungkyu Song

Intellectual Merit

My contributions centered on advancing ML-driven software assurance techniques to improve the reliability, transparency, and maintainability of data-intensive scientific software. The work addressed fundamental challenges in ML-based systems, including debugging black-box models, validating data-driven behavior, and supporting developers in understanding evolving ML programs and large text corpora.

Specifically, this project led to:

ML-based debugging and validation frameworks for identifying data and model anomalies in ML programs.
Tool-supported software assurance techniques that integrate program analysis, topic modeling, and human-in-the-loop visualization.
Scalable analysis of large bioengineering text corpora, enabling systematic document retrieval, topic discovery, and knowledge exploration.
Deep learning–assisted code review and change inspection tools that detect systematic and inconsistent changes in evolving software.

These contributions form a coherent software engineering foundation for trustworthy ML systems, directly supporting the center’s data-driven bioengineering mission while advancing generalizable SE research.

Broader Impacts

The project had strong workforce and educational impacts. I contributed to:

Workforce development and training, including a hands-on research workshop involving 10 undergraduate and 13 graduate students.
Mentoring students at the intersection of software engineering, machine learning, and data science, preparing them for careers in ML-enabled scientific software development.
Dissemination of open research tools and IDE-based prototypes that lower barriers for students and researchers to adopt ML-based software assurance techniques.

Publications Resulting from the Project

The SE and ML assurance thrust produced peer-reviewed conference papers and journal article, including work on:

Topic modeling and visualization for bioengineering corpora
Learning-to-rank for scientific document retrieval
Topic mining for bug report analysis
Deep learning–based code change inspection
Debugging and validation of ML programs

This completed project provides a strong foundation for extending ML-based techniques to secure and trustworthy software development. The tools, methods, and insights developed for software assurance in ML-driven scientific systems directly inform future work on secure software engineering, AI-assisted programming, and dependable ML-enabled infrastructure.

Tool Support for Improving Software Quality in Machine Learning Programs (Information 2023)

Kwok Sun Cheng, Pei-Chi Huang, Tae-Hyuk Ahn, and Myoungkyu Song

This paper introduces MLVAL, an interactive quality validation and debugging framework designed to improve the software quality of machine learning (ML) programs. MLVAL addresses challenges in validating data-driven and black-box ML systems by enabling developers to inspect training data, learned features, and model evolution through an Eclipse IDE plug-in. The approach integrates human-in-the-loop visualization, data version diffing, and model behavior comparison to diagnose ML-specific bugs and anomalies. Evaluation on 23,500 bioengineering documents demonstrates improved model reliability, transparency, and maintainability.

Debugging Support for Machine Learning Applications in Bioengineering Text Corpora (COMPSAC 2022)

Kwok Sun Cheng, Tae-Hyuk Ahn, and Myoungkyu Song

This paper presents MLDBUG, an interactive debugging framework for machine learning applications that helps developers identify data and model anomalies in black-box ML systems. MLDBUG supports feature inspection, data version diffing, and model behavior comparison through a human-in-the-loop visualization environment. Implemented as an Eclipse IDE plug-in, the approach enables effective model tuning and bug diagnosis. Experiments on 23,500 bioengineering documents show that MLDBUG improves debugging efficiency and enhances ML software reliability.

Learning to Rank Relevant Documents for Information Retrieval in Bioengineering Text Corpora (COMPSAC 2021)

Kwok Sun Cheng and Myoungkyu Song

This paper introduces LTREXPLORER, a learning-to-rank–based information retrieval framework for efficiently identifying relevant documents in large-scale bioengineering text corpora. The approach combines BM25, vector space models, and citation-aware features extracted from document sections and reference networks. By leveraging 29 domain-specific features, LTREXPLORER achieves up to 97% recall and approximately 85% ranking accuracy on a corpus of 23,500 documents, significantly improving research productivity and scalable literature exploration.

Analyzing Bug Reports by Topic Mining in Software Evolution (COMPSAC 2021)

Uy Nguyen, Kwok Sun Cheng, Samuel Sungmin Cho, and Myoungkyu Song

This paper presents BUGEXPLORER, a topic mining–based framework for analyzing bug reports during software evolution. Using LDA, Hierarchical LDA, and topical phrase mining, the approach extracts latent semantic topics from noisy bug report text. BUGEXPLORER supports bug triage, similarity analysis, and reviewer recommendation. Evaluation on five large open-source projects shows improved understanding of defect trends and topic evolution.

Tool Support for Code Change Inspection with Deep Learning in Evolving Software (EIT 2020)

Krishna Teja Ayinala, Kwok Sun Cheng, Kwangsung Oh, and Myoungkyu Song

This paper introduces SIL (Similar Changes Inspection with Deep Learning), a code review tool that summarizes and inspects systematic and recurring code changes in evolving software systems. SIL combines AST-based edit scripts, data and control dependence analysis, and a deep learning classifier trained on four clone types mined from 25,000 open-source programs. Implemented as an Eclipse plug-in, SIL detects inconsistent or missing changes, reducing review effort and improving code review accuracy.

TopExplorer: Tool Support for Extracting and Visualizing Topic Models in Bioengineering Text Corpora (EIT 2020)

Kwok Sun Cheng, Zhipeng Wang, Pei-Chi Huang, Parvathi Chundi, and Myoungkyu Song

This paper presents TopExplorer, an interactive topic modeling and visualization tool for exploring large-scale bioengineering document collections. TopExplorer integrates LDA, Hierarchical LDA, and phrase mining to uncover latent thematic structures and relationships across documents. The tool provides interactive visual analytics, including topic distributions, hierarchical topic trees, and document-level views. Evaluation on 600 bioengineering articles shows improved knowledge discovery, trend analysis, and decision-making.