Discussion:
Short Description
Data Science is one of the fastest-growing job in 21st century, it is a popular and lucrative profession. As a college student who wants to pursue a career in Data Science/Analytics, analyzing this dataset will help me better prepare myself for my future job.
Dataset link: https://www.kaggle.com/nikhilbhathi/data-scientist-salary-us-glassdoor
The dataset is obtained from Kaggle, it was created by Nikhi Bhathi by crapping the job postings related to the positioon of “Data Scentist” from www.glassdoor.com in USA. There a total number of 742 rows and 42 columns in the original dataset. Each row in the dataset represents a data science related job title, each column has different information regarding the job.
Dataset Limitation
One limitation of the dataset is that it only contains 742 unique data science related job, I feel like it might be too small to provide a useful insight. Another limitation is that the author of the dataset didn’t provide information about how he scrapped the data from the website, thus some information could potentially be misleading.
ds_jobs = read_csv('data_cleaned_2021.csv')
names(ds_jobs) = gsub(' ', '_', names(ds_jobs))
names(ds_jobs) = tolower(names(ds_jobs))
skill_vector = c('python', 'spark', 'aws', 'excel', 'sql', 'sas', 'keras', 'pytorch', 'scikit', 'tensor', 'hadoop', 'tableau', 'bi', 'flink', 'mongo', 'google_an')
ds_jobs = ds_jobs%>%
separate('location', into = c('job_city', 'job_state'), sep = ', ', convert = TRUE) %>%
separate('headquarters', into = c('headquarters_ city', 'headquarters_state'), sep = ', ', convert = TRUE) %>%
pivot_longer(skill_vector, names_to = 'skill', values_to = 'required')
job_count = ds_jobs %>%
group_by(job_location) %>%
summarise(cnt = n()) %>%
mutate(perc = paste0(sprintf("%4.1f", cnt / sum(cnt) * 100), "%"), ) %>%
arrange(desc(cnt)) %>%
slice(1:10)
job_count %>%
mutate(perc = if_else(row_number() == 1, paste(perc, "of all data science jobs in U.S."), perc)) %>%
ggplot(aes(x = cnt, y = reorder(job_location, cnt))) +
geom_bar(stat = 'identity', fill = "skyblue2") +
geom_text(aes(label = perc), hjust = 1.01, vjust = .3, nudge_x = -.1, size = 3.5) +
ggtitle("Top 10 States with the Most Number of Data Science Related Jobs") +
labs(x= 'Number of Data Sceience Related Jobs', y = 'Top 10 States') +
scale_x_continuous(expand = c(0, 0, 0.1, 0.1)) +
theme(axis.ticks = element_blank(),
panel.grid.major.y = element_blank()) +
theme_minimal()
For the first plot, I am curious to see which state has the most total number of data science related jobs.
In order to create this plot, first I group the jobs into its location (State), making two new columns that record the total number of jobs in each state and its percentage. Since I want to visualize both count and percentage in a bar plot, I used mutate() to add “of all data science jobs in US” in the first row of the percentage column so that the visualization is more clear. Before I make the plot, I sort my data and select the top 10 states with most data science jobs. Finally I used geom_bar() to create the bar graph.
Unsurprising, California has the most number of data science related jobs. One unexpecting finding I have when looking at the graph is that Massachusetts has a large number of data science jobs.
top_state = ds_jobs %>%
group_by(job_location) %>%
summarise(cnt = n()) %>%
ungroup() %>%
slice_max(cnt, n = 10) %>%
pull(job_location)
ds_jobs = ds_jobs %>%
filter(job_title_sim !='na') %>%
filter(job_title_sim !='director')
job_salary = ds_jobs%>%
group_by(job_title_sim, job_location) %>%
summarise(n_title = n(), ave_salary = sum(`avg_salary(k)`/n_title)) %>%
filter(job_location %in% top_state)
For the second plot, I want to analyze whether different job positions might lead to different salary in each state.
After visualizing this graph, I am surprised to see that machine learning engineer has the overall highest average salary. In California, a machine learning engineer can earn around 175k per month. Therefore I should probably take more machine learning classes in college and learn more machine learning algorithms.
ggplot(job_salary) +
geom_tile(aes(job_location, reorder(job_title_sim, ave_salary),fill = ave_salary), col = "white", lwd = 0.5) +
scale_fill_viridis_c(option = "magma") +
ggtitle("Average Salary (K) for Different Data Science Related Job") +
labs(x= 'State', y = 'Job Title', fill = "Average Salary (K)") +
coord_equal()+
theme(panel.background=element_rect(fill="white", colour="white"))
top_5_skills = ds_jobs %>%
select(c(skill, required)) %>%
filter(required == 1) %>%
group_by(skill) %>%
count(required, sort = TRUE) %>%
head(5) %>%
pull(skill)
skill_count = ds_jobs %>%
select(c(job_title_sim, skill, required)) %>%
filter(skill %in% top_5_skills) %>%
filter(required == 1) %>%
group_by(skill,job_title_sim) %>%
count(required, sort = TRUE)
For the third plot, I want to analyze whether different job positions have different requirement for skills.
ggplot(skill_count, aes(x = n, y = reorder(job_title_sim, n), fill = reorder(skill, n))) +
geom_bar(stat = 'identity') +
ggtitle("Number of Skills Required for Each Data Science Related Jobs") +
labs(x= 'Count', y = 'Job Title', fill = 'Skill Required(Top 5)') +
scale_x_continuous(expand = c(0, 0, 0.1, 0.1)) +
theme(axis.ticks = element_blank(),
panel.grid.major.y = element_blank()) +
theme_minimal()
Lastly, I decided to extend my first portfolio by adding one plot exploring the skills required for each data science job. Therefore, I first select the top 5 most frequent skills required and stored them in a vector. Then I count the number of different data science-related jobs and also count how many of each 5 skills are required by them. Finally, I used geom_bar to show the number of each data science-related job and how many of those skills are required by each related job.
As we can see a large number of data scientist jobs requires applicants to master python, excel and sql, whereas analyst only cares about excel and sql.