Assignment 6: Data Science Insight No. 2 - Survival Analysis

Expiration has always been of interest to humans. Whether it concerns the spoiling of food, how long a customer might provide consistent profit to a company, a student’s probability of continued enrollment, the lifespan of a virus, or even the death of another being, the calculating of “time-to-event” (i.e., the termination of a constantly-occurring data point) can be found in virtually every realm of industry and development. This process is commonly known today as survival analysis, and its modern applications range from the mundane to the extraordinary. First developed in 1669 by Dutch mathematician Christiaan Huygens as a simple method of estimating the probability of surviving past a certain age, survival analysis has since evolved to map the complex relationship between any given topic of interest – often described mathematically through the implementation of multiple quantified covariates – and the hazards that may threaten its continued existence over the course of a specific period of time. Huygens originally represented his model of survivability through a basic retention curve, but in this day and age, the graphs produced by high-functioning processors and machine-learning programs can plot regressive functions of survivability in excruciating detail. Rather than settling for facile illustrations of proportional comparisons between survivability and hazard factors, most current forms of time-to-event analysis utilize non-parametric and semi-parametric equations to accurately estimate the probability of an observed element’s retention of life, preservation of frequency in a dataset, or continuance of engagement with a separate program, such as a mailing list or business subscription. Paradigms of these equations include Weibull distribution functions, Kaplan-Meier estimators, and Cox regression, typically in combination with chi-square or log-rank tests to rule out discrepancies. Most contemporary incarnations of survival analysis also rely on a process known as censoring to eliminate outlying data points, which essentially breaks the entire set into quickly-computable chunks that can be used to train complex neural networks like support vector machines. Additionally, in some non-parametric circumstances, a hazard function – which takes into account the potential occurrence of any of three types of statistical hazards– can be applied to examine probability of death within an infinitesimally narrow time frame.

Survival analysis can be understood succinctly as the function S(t) = P(T ≥ t), where P represents the number of features under consideration, T represents a vector of event times, and t represents a set timeframe. Approximating the probability of survival under a smaller time frame requires the implementation of a hazard function, represented by the equation:

equation

(All of the variables in this function operate in the same manner as they do in the survival analysis equation outlined above.) Applied creatively, this concise function can generate powerful results. For example, during the early months of the coronavirus pandemic, a group of researchers utilized survival analysis to test the impacts of sex, age, disease history, and location on the average recovery rate of 1,182 COVID-19 patients. Aided by seven different machine-learning techniques, the directors of the study were able to comprise a concrete set of priori data to use in the calculation of “time-to-hospital discharge.” The results of this project serve as an important and relevant example of the power of data – more than predicting survivability, the information produced by the methodical implementation of a single function had the capability to augment the allocation of hospital funds and supplies, direct the treatment regimens of thousands of patients, and provide preemptive indicators of overcrowding (as well as the potential dangers it poses in the form of increased infection and transmission rates). In the past, survival analysis has been viewed as a tool of businesses and fiscally-focused data mining endeavors. However, there is clearly great opportunity for this remarkable data science technique to aid development experts in the future. The ability to succinctly describe connections between covariates like age, gender, sexuality, location, disease history, and stratified social status could be applied to predict “time-to-events” beyond the economic realm, such as the probability of abuse occurring in domestic environments or of civil unrest evolving into violence, before they occur. All in all, survival analysis has a lot of potential as a tool for developmental planning. Hopefully, this predictive method of analysis will find a way to not only survive, but thrive in the future of data science.

Works Cited

Nemati, Mohammadreza, et al. “Machine-Learning Approaches in COVID-19 Survival Analysis and Discharge-Time Likelihood Prediction Using Clinical Data.” Patterns, vol. 1, no. 5, 14 Aug 2020. Science Direct, https://www.sciencedirect.com/science/article/pii/S2666389920300945?via%3Dihub.

Pölsterl, Sebastian. “Survival Analysis for Deep Learning.” 29 Jul 2019. https://k-d-w.org/blog/2019/07/survival-analysis-for-deep-learning/.

Sawarker, Kunal. “Survival Analysis – What is it? (Part 1 – And how can it solve my business problems?).” Inside Machine Learning, 17 Jul 2019. Medium, https://medium.com/inside-machine-learning/survival-analysis-cb5832ffcd78.

Wang, Ping, et al. “Machine Learning for Survival Analysis: A Survey.” ACM Computing Surveys, vol. 51, no. 6, Feb 2019. https://dl.acm.org/doi/10.1145/3214306.