- Cleaning Data
- Check Missing Values
- Numerical Imputation
- Categorical Imputation
- Handling Outliers
- Outlier Detection with Standard Deviation
- Outlier Detection with Percentiles
- Binning
- Drop Duplicates
- Data Manipulations
- Add String to Each Value in Column
- Drop Rows that Contain a Specific Value
- Guide to Data Preprocessing for Data Science
- Data Preprocessing: 6 Necessary Steps for Data Scientists
- What Is Data Preprocessing in ML?
- Guide to Data Cleaning for Data Science
- Data Cleaning and Preprocessing for Beginners
-
The most simple solution to the missing values is to drop the rows or the entire column. There is not an optimum threshold for dropping but you can use 70% as an example value and try to drop the rows and columns which have missing values with higher than this threshold.
threshold = 0.7 #Dropping columns with missing value rate higher than threshold data = data[data.columns[data.isnull().mean() < threshold]] #Dropping rows with missing value rate higher than threshold data = data.loc[data.isnull().mean(axis=1) < threshold]
Considering a possible default value of missing values in the column. For example, if you have a column that only has 1 and NA, then it is likely that the NA rows correspond to 0.
Except for the case of having a default value for missing values, I think the best imputation way is to use the medians of the columns. As the averages of the columns are sensitive to the outlier values, while medians are more solid in this respect.
#Filling all missing values with 0
data = data.fillna(0)
#Filling missing values with medians of the columns
data = data.fillna(data.median())
Replacing the missing values with the maximum occurred value in a column is a good option for handling categorical columns. But if you think the values in the column are distributed uniformly and there is not a dominant value, imputing a category like “Other” might be more sensible, because in such a case, your imputation is likely to converge a random selection.
#Max fill function for categorical columns
data['column_name'].fillna(data['column_name'].value_counts()
.idxmax(), inplace=True)
If a value has a distance to the average higher than x * standard deviation, it can be assumed as an outlier. Then what x should be?
There is no trivial solution for x, but usually, a value between 2 and 4 seems practical.
```@python
#Dropping the outlier rows with standard deviation
factor = 3
upper_lim = data['column'].mean () + data['column'].std () * factor
lower_lim = data['column'].mean () - data['column'].std () * factor
data = data[(data['column'] < upper_lim) & (data['column'] > lower_lim)]
```
In addition, z-score can be used instead of the formula above. Z-score (or standard score) standardizes the distance between a value and the mean using the standard deviation.
Another mathematical method to detect outliers is to use percentiles. You can assume a certain percent of the value from the top or the bottom as an outlier. The key point is here to set the percentage value once again, and this depends on the distribution of your data as mentioned earlier.
Additionally, a common mistake is using the percentiles according to the range of the data. In other words, if your data ranges from 0 to 100, your top 5% is not the values between 96 and 100. Top 5% means here the values that are out of the 95th percentile of data.
```@python
#Dropping the outlier rows with Percentiles
upper_lim = data['column'].quantile(.95)
lower_lim = data['column'].quantile(.05)
data = data[(data['column'] < upper_lim) & (data['column'] > lower_lim)]
```
Another option for handling outliers is to cap them instead of dropping. So you can keep your data size and at the end of the day, it might be better for the final model performance.
On the other hand, capping can affect the distribution of the data, thus it better not to exaggerate it.
```@python
# Capping the outlier rows with Percentiles
upper_lim = data['column'].quantile(.95)
lower_lim = data['column'].quantile(.05)
data.loc[(df[column] > upper_lim),column] = upper_lim
data.loc[(df[column] < lower_lim),column] = lower_lim
```
Binning can be applied on both categorical and numerical data:
# Numerical Binning Example
Value Bin
0-30 -> Low
31-70 -> Mid
71-100 -> High
# Categorical Binning Example
Value Bin
Spain -> Europe
Italy -> Europe
Chile -> South America
Brazil -> South America
The main motivation of binning is to make the model more robust and prevent overfitting, however, it has a cost to the performance. Every time you bin something, you sacrifice information and make your data more regularized.
The trade-off between performance and overfitting is the key point of the binning process. In my opinion, for numerical columns, except for some obvious overfitting cases, binning might be redundant for some kind of algorithms, due to its effect on model performance.
However, for categorical columns, the labels with low frequencies probably affect the robustness of statistical models negatively. Thus, assigning a general category to these less frequent values helps to keep the robustness of the model. For example, if your data size is 100,000 rows, it might be a good option to unite the labels with a count less than 100 to a new category like “Other”.
# Numerical Binning Example
data['bin'] = pd.cut(data['value'], bins=[0,30,70,100], labels=["Low", "Mid", "High"])
value bin
0 2 Low
1 45 Mid
2 7 Low
3 85 High
4 28 Low
# Categorical Binning Example
Country
0 Spain
1 Chile
2 Australia
3 Italy
4 Brazil
conditions = [
data['Country'].str.contains('Spain'),
data['Country'].str.contains('Italy'),
data['Country'].str.contains('Chile'),
data['Country'].str.contains('Brazil')]
choices = ['Europe', 'Europe', 'South America', 'South America']
data['Continent'] = np.select(conditions, choices, default='Other')
Country Continent
0 Spain Europe
1 Chile South America
2 Australia Other
3 Italy Europe
4 Brazil South America
- pandas.DataFrame.duplicated
-
Show all duplicated rows
df[df.duplicated(keep = False)]
-
- Dropping Rows And Columns In pandas Dataframe
df['my_column'] = 'some_string' + df['my_column'].astype(str)
📰 Articles:
#drop rows that contain specific 'value' in 'column_name'
df = df[df.column_name != value]
📰 Articles:
- Choose specific rows and columns
subset_df = df.iloc[[0,5,7],[1,2,3,4,5]]