Removes lines outside of a certain character length
- min (int) : Minimum length (inclusive)
- max (int) : Maximum length (inclusive)
Removes lines when the sum of certain characters between source and target is not the same.
- chars (str) : Characters to check (()[]?!:."“”{})
Removes lines that contain these words
- words (list(str)) : List of words
Removes lines when there are digits in source and not in target, or vice-versa
Removes lines when the ratio of numerical characters to the total length of the line
is greather than max.
- max (float) : Maximum ratio (0.4)
Remove lines when source is the same as target
Selects a partial dataset located between top % and bottom % of a large dataset (useful with very large ones).
- top_percentile (float) : dataset percentile where data collection begins
- bottom_percentile (float) : percentile where data collection ends
Removes lines when the first character is a letter but the case is mismatched, or the first character in source is not the same as the first character in target.
Removes lines when the sum of non-alphanumeric characters (except spaces) between source and target is not the same
Removes lines when the ratio of non-alphanumeric characters to the total length of the line
is greather than max.
- max (float) : Maximum ratio (0.4)
Removes lines when the ratio (len(source) / len(target)) is outside of bounds
- min (float) : Lower bound (inclusive)
- max (float) : Upper bound (inclusive)
Only add the top X% lines from the dataset
- percent (float) : Percentage of dataset to include
Removes lines when source and target have a different number of uppercase letters