# Analyzing Employee Aspirations in Data Science Training
Written on
Chapter 1: Introduction to Data Science Aspirations
In today's competitive job market, many individuals aspire to become data scientists. This analysis focuses on a Kaggle dataset that evaluates which employees are inclined to pursue further training in data science.
The dataset in question was utilized to illustrate how logarithmic regression methods can inform business strategies, particularly for my capstone project in applied linear modeling. With the escalating costs of higher education over the past four decades, it is vital for employers to invest in their workforce to ensure a steady talent pipeline, thereby gaining a competitive advantage. Data science, a rapidly expanding field in the U.S., combines elements of computer science and statistics to generate valuable business insights. While traditional educational institutions offer courses that prepare students for careers in data science, many companies may find that this approach does not effectively broaden their talent pool. Additionally, labor shortages are evident in related fields like data analytics and GIS. Consequently, numerous organizations have initiated in-house training programs to upskill existing employees as data scientists.
As companies aim to transition their personnel into new roles, it is essential for management to craft targeted training messages for the right employee demographics. Understanding which employees benefit most from training initiatives is crucial for equitable job training. A dataset derived from companies that invested in internal data science training has been compiled, anonymized, and made available on Kaggle, a platform for data science projects. This study investigates how prior experiences influence whether a training recipient seeks a position in the data science field post-training. Key variables such as gender, highest level of education completed, years of experience, and participation in STEM degree paths will serve as control variables in the analysis. A logistic regression model will estimate the probability that a trainee will pursue additional training in the industry based on their previous experiences.
Examining the Dataset:
The dataset comprises 19,158 employee entries who underwent training. After excluding records with missing information, 8,955 entries remained. Of those, 7,452 trained employees did not seek new opportunities, while 1,483 expressed a desire for further training. Gender distribution revealed that 8,973 employees identified as male, 804 as female, and 84 did not disclose their gender.
Regarding educational attainment, 70% of the employees in the dataset held a bachelor's degree, 27% possessed a master's degree, and 3% had a doctorate. Additionally, a striking 90% of the trainees were from STEM backgrounds, compared to only 18% of national graduates in STEM fields. This suggests that the participating firms are disproportionately hiring from STEM talent pools.
In terms of experience, the average participant had around 7 years of work experience, with a standard deviation of 5.57 years. The dataset captured experience ranging from 0 to 20 years, with 22% of individuals having no prior experience before training.
Dataset Insights:
The data exhibited notable skewness concerning gender, educational attainment, and major discipline. The sample was predominantly male, with most participants holding bachelor's degrees rather than advanced degrees such as master's or Ph.D. Furthermore, the representation of women in this dataset (9%) is lower than their presence in the broader workforce, where women occupy 25% of computer science roles and 15% of engineering positions. This indicates a potential underrepresentation of women in the data science training pool.
Model Analysis:
To assess how previous experience affects the likelihood of pursuing data science roles, I employed a logistic regression model. The dependent variable was coded as 1 for those who wished to continue training and 0 for those who did not. Two critical assumptions were tested: multicollinearity and the linear relationship between continuous variables and the logistic odds of the model. The only continuous variable in this case was experience.
The variable inflation factor analysis indicated no multicollinearity among the variables. Additionally, a linear relationship was established between the log odds of the dependent variable and the experience variable.
Three models were analyzed to investigate the impact of relevant experience on the likelihood of trainees seeking further training. Model 1 examined experience's direct effect on the target variable. Model 2 incorporated the control variables, while Model 3 included both prior experience and control variables. Results indicated that individuals without data science experience are approximately 5% more likely to pursue additional training at a 99% confidence level. Conversely, those with a master's degree were found to be 3% less likely, and Ph.D. holders showed a 6% reduction in likelihood at the 95% confidence level. Holding a STEM degree correlated with an 11% increase in the desire for continued training at a 99% confidence level. Interestingly, each additional year of experience was associated with a 0.03% decrease in the likelihood of wanting further training at a 99% confidence level.
Discussion:
The analysis suggests that previous experience in data science and possessing a STEM degree positively influences trainees' desire to continue exploring opportunities in the field. However, it is also crucial to note the gender disparity within the dataset, which may indicate barriers to entry for women in data science training programs. This disparity could reflect broader issues of gender discrimination within hiring practices, where women in STEM fields often face higher qualification expectations than their male counterparts.
As a strategic approach, I would recommend focusing training efforts on employees who hold less than a master's degree and have experience in STEM fields but not specifically in data science. Future research could benefit from incorporating variables related to racial identity and compensation, as these factors significantly impact representation in STEM and the financial burdens of education.
In conclusion, if an employer were to inquire, "Who wants to be a data scientist?" the dataset suggests that the most likely candidates are STEM graduates seeking a career transition. It is equally important for employers to consider who stands to gain the most from apprenticeship programs and ensure that these opportunities are equitable.
This video discusses how even highly skilled individuals, like Kaggle Grandmasters, can face challenges in securing data science positions, shedding light on the competitive landscape of the job market.
The second video provides insights on effectively leveraging Kaggle as a platform to enhance one's data science skills and improve employability in the field.