First of all,due to Quarantine I wanted to give a huge shoutout to all of the nurses, doctors, grocery clerks, public administrators and anyone else that’s putting their lives at risk to serve their communities.
Let’s not take this for granted. Take this time in isolation to learn new skills, read books, and improve yourself. For those interested in data, data analytics, or data science, I’m providing a list of fourteen data science projects that you can do during your spare time!
There are three types of projects:
- Visualization projects
- Exploratory data analysis (EDA) projects
- Prediction modeling
Perhaps the quickest projects to complete are data visualizations! Below are three interesting datasets that you can use to create some intriguing visualizations to add to your portfolio.
Learn how to build dynamic visualizations using Plotly to show how the coronavirus has spread globally over time like the one above! Plotly is an amazing library that makes data visualizations dynamic, appealing, and simple.
Australian Wildfire Visualizations
The 2019–2020 bushfire season, also known as the black summer, consisted of several extreme wildfires starting in June 2019. The fires burnt an estimated 18.6 million hectares and over 5,900 buildings according to Wikipedia.
This makes for an interesting project! Leverage your data visualization skills using Plotly or Matplotlib to show the magnitude and geographical impact of the wildfires.
Earth Surface Temperature Visualization
Have any climate change deniers? Create some data visualizations to show how the Earth’s surface temperatures have changed over time. You can do this by creating a line graph or another animated Choropleth map!
Bonus: create a prediction model that shows what Earth’s temperatures are expected to be in fifty years.
Exploratory Data Analysis Projects
Exploratory Data Analysis (EDA), also known as Data Exploration, is a step in the Data Analysis Process, where a number of techniques are used to better understand the dataset being used.
New York Airbnb Data Exploration
Since 2008, guests and hosts have used Airbnb to expand on traveling possibilities and present more personalized ways of experiencing the world. This dataset contains information on 2019 listings in New York and its geographical information, prices, number of reviews, and more.
Some questions that you can try to answer are as follows:
- Which hosts are the busiest and why?
- What areas have more traffic than others and why is that the case?
- Are there any relationships between prices, number of reviews, and the number of days that a given listing is booked?
Most Important Factors related to Employee Attrition and Performance
IBM created a synthetic dataset that you can use to understand how various factors affect employee attrition and satisfaction. Some of the variables include education, job involvement, performance rating, and work-life balance.
Explore this dataset and see if there are any significant variables that indeed affect employee satisfaction. Take it a step further and see if you can rank the variables from most important to the least.
World University Rankings
Do you think your country has the best university in the world? What does it mean to be the ‘best’ university to start with? This dataset contains three global university rankings. Using this data, see if you can answer the following questions:
- What countries are the top universities in?
- What are the main factors that determine one’s world ranking?
Alcohol and school success
Does alcohol affect students’ grades? If not, what does? This data was obtained in a survey from students in math and Portuguese language courses in secondary school. It contains several variables like alcohol consumption, family size, involvement in extracurriculars.
Using this, explore the relationship between school performance and various factors. As a bonus, see if you can predict a student’s final grade based on other variables!
Pokemon Data Exploration
For all of you gamers out there, here’s a dataset that contains information on all 802 Pokemon from all seven generations. Here are several questions that you can try to answer!
- Which generation has the strongest Pokemon? Which has the weakest?
- What Pokemon type is the strongest? The weakest?
- Is it possible to build a classifier to identify a legendary Pokemon?
- Are there any correlations between physical traits and strength stats (attack, defense, speed, etc.)?
Exploring Factors of Life Expectancy
WHO created a dataset of the health status of all countries over time and includes statistics on life expectancy, adult mortality, and more. Using this dataset, explore the relationships between various variables. What has the biggest impact on life expectancy?
This dataset was created to answer the following questions:
- Do various predicting factors that has been chosen initially really affect Life expectancy? What are the predicting variables actually affecting the life expectancy?
- Should a country having a lower life expectancy value(<65) increase its healthcare expenditure in order to improve its average lifespan?
- How do Infant and Adult mortality rates affect life expectancy?
- Does Life Expectancy have a positive or negative correlation with eating habits, lifestyle, exercise, smoking, drinking alcohol, etc.
- What is the impact of schooling on the lifespan of humans?
- Does Life Expectancy have a positive or negative relationship with drinking alcohol?
- Do densely populated countries tend to have a lower life expectancy?
- What is the impact of Immunization coverage on life Expectancy?
Time Series Forecast on Energy Consumption
This dataset is composed of power consumption data from PJM’s website. PJM is a regional transmission organization in the United States. Using this dataset, see if you can build a time series model to predict energy consumption. In addition that, see if you can find trends around hours of the day, holiday energy usage, and long term trends!
Loan Prediction Forecast
Taken from Analytics Vidhya, this dataset as 615 rows and 13 columns on past loans that have and haven’t been approved. See if you can create a model that predicts whether a loan will get approved or not.
Used Car Price Estimator
Craigslist is the world’s largest collection of used vehicles for sale. This dataset is composed of scraped data of Craigslist and is updated every few months. Using this data set, see if you can create a dataset that predicts whether a car listing is over or underpriced.
Detecting Credit Card Fraud
This dataset presents transactions that occurred in two days, with 492 frauds out of 284,807 transactions. The dataset is highly unbalanced, the positive class (frauds) account for 0.172% of all transactions. Learn how to work with unbalanced datasets and build a credit card fraud detection model.
Originally Publish at: https://towardsdatascience.com/