Feature Engineering with Hitters Dataset

M. Akif Bıyıklı
Analytics Vidhya
Published in
5 min readJul 10, 2021

--

An end-to-end machine learning project.

Hello,
In this article, I will show the feature engineering steps I did on the Hitters dataset. I will talk about which information we take into account when creating features and which applications we prefer. Since some of the decisions I made may vary from project to project, I did not share the relevant codes of each stage. I will share the code related to feature engineering which is our main focus. The main goal of this project is to make an end-to-end machine learning project that will predict the salary of players after feature engineering and preprocessing.

Hitters dataset includes various statistics of the players who played in the 1986–1987 season. Dataset consist of 20 variables and 322 observations, only the salary variable has missing observations. The definitions of the variables of the dataset are as follows:

AtBat: Number of shots made with a baseball bat during the 1986–1987 season

Hits: Number of hits made in the 1986–1987 season

HmRun: Most valuable hits in the 1986–1987 season

Runs: The points he earned for his team in the 1986–1987 season

RBI: Number of players a batsman had jogged when he hit in the season

Walks: Number of mistakes made by the opposing player

Years: Player’s playing time in major league (in year)

CAtBat: Number of shots made with a baseball bat in career

CHits: Number of hits made in the career

CHmRun: Most valuable hits in the career

CRuns: The points he earned for his team in his career

CRBI: Number of players a batsman had jogged when he hit in the career

CWalks: Number of mistakes made by the opposing player in career

League: A factor with A and N levels showing the league in which the player played until the end of the season

Division: A factor with levels E and W indicating the position played by the player at the end of 1986

PutOuts: Helping your teammate in-game

Assists: Number of assists made by the player in the 1986–1987 season

Errors: Player’s errors in the 1986–1987 season

Salary: The salary of the player in the 1986–1987 season (in thousand)

NewLeague: A factor with A and N levels showing the player’s league at the start of the 1987 season

First of all, I will talk about the features that I obtained by comparing the numerical variables to each other. One of the points to be considered in this process is to prevent having na or inf values ​​by adding 1 to numeric variables with a minimum value of 0. After this process, we can divide numeric variables into each other.

Quantiles of numeric variables

The first group of features is about calculating the ratios of the relevant season to the entire career. We obtained it by proportioning the variables we gave definitions above. We tried to create new types of players. The Runner's type of player allows his friends on the field to score points by completing the run by making accurate shots. On the other hand, Hit and Run represent the players who can move on to the next stage of their run after making an accurate shot.

Players must complete the run by passing all the bases

The second group of features consists of career statistics divided by years. We expect players with high annual averages to receive high salaries.

The third group of features is based on leveling players based on the years they’ve played. We named players who have been playing for 2 years or less as Junior, players between 2 and 5 years as Mid, those who have played between 5 and 10 years as Senior, and those who have been playing for more than 10 years as Expert.

We named the fourth feature group based on the division and year it plays. E division stands for East and W division stands for West.

The fifth feature group is about which league the players are playing in the next season. The American League, represented by the A, is more commonly known as the Junior Circuit. The National League, represented by N, is known as the Senior Circuit. The champions of the two leagues are playing a final match. In this group, we represented those who played in the A league in the first season and played in the N league the following season with Ascend. We called the players with the opposite Descend.

In the last group of features, there are various ratios of multiple variables related to each other. The features in this group are derived from the variables in the relevant season.

In preprocessing stage, we suppressed outliers. In this process, our preferred range was to keep the values ​​between 1% and 99% of the data and to suppress the outliers according to these values. We kept the interval low because the number of observations in our dataset was small. We performed label, ordinal, and one hot encoding transformations for categorical variables. To fill in the missing values, we filled the missing values ​​with knnImputer and scaled the data with robust-scaler. We did not scale our target variable, Salary, because we wanted to compare values according to RMSE. Lastly, we set up our model with Multiple Linear Regression. We found our RMSE and R-Square values ​​for the test dataset ( R-Square:0.6506, RMSE:268.4507).

Finally, we managed to have more important variables than the original variables. Below you can find the importance graph.

Feature Importance Graph.

--

--