Survival analysis and prediction of lung cancer in patients based on clinical and image features using machine learning

Date

2023-01-15

Journal Title

Journal ISSN

Volume Title

Publisher

Laurentian University Library & Archives

Abstract

Lung cancer develops in lung tissues, most commonly in the cells that line the airways. It is the leading cause of death from cancer in both men and women. To estimate the prevalence of lung cancer in the coming years, it is necessary to diagnose it in the early stages. This thesis work proposes to perform a reliable diagnosis of patients with lung cancer. The goal of this research is to analyze the important variables impacting lung cancer based on p-value using image features as well as clinical data and is focused on quality analysis. Further, to enable early diagnosis of cancer with high efficiency, this work proposes to classify the patient’s images into cancer using a Convolutional Neural Network (CNN) to enable its early diagnosis. The thesis discusses the dataset, data pre-processing steps, survival rate risk analysis, classification, and performance evaluation of the process. This study used two kinds of data, clinical and image data. The Genomic Data Commons (GDC) Data Portal and The Cancer Imaging Archive (TCIA) were used as the data source. The Random Forest regression estimation method was used to fill in the missing values. It first imputes all missing data with the mean/mode, then fits a random forest on the observed part and predicts the missing part for each variable with missing values. Three models are used to test the significance of variables on cancer survival rates: Kaplan Meier (KM), Cox Proportional Hazards (CPH), and Accelerated Failure Time (AFT). The analysis took into account three types of data: clinical only, image only, and combined clinical and image data. All three models have been effectively applied and the outcome revealed the most robust data and the crucial variable to be focused upon for further experimentation. For classification, a Convolutional Neural Network (CNN), with low computational cost and time overhead is used. The output of statistical models demonstrates the robustness of image data among all types, as it has the fewest chances of producing false results. Image data, which is common in clinical data collection is less prone to human error. As a result of the data's robustness, only image features data was preferred over clinical data and combined in the next step to perform the classification of images for cancer prediction. Based on the accuracy, the CNN results were compared to the two other ensemble approaches, Random forest (RF) and XgBoost. CNN achieved an accuracy of 99% in image classification, which was higher than the accuracy rates of Random forest (RF) and XgBoost, which were 95.83% and 95.83%, respectively. As a result, the CNN model can be applied to new Computerized Tomography (CT) scan images for lung cancer diagnosis to conduct additional research and to assist clinicians.

Description

Keywords

Lung Cancer Survival Analysis, Random Forest, XgBoost, Convolutional Neural Network, Genomic Data, The Cancer Imaging Archive, Kaplan Meier, Cox-Proportional hazards model, Accelerated Failure Time Model

Citation