Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add files via upload #1

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1,529 changes: 1,529 additions & 0 deletions 02_PYTHON/01_Naive_Bayes_Classifier/Naive_Bayes_Classifier.ipynb

Large diffs are not rendered by default.

54 changes: 54 additions & 0 deletions 02_PYTHON/01_Naive_Bayes_Classifier/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,54 @@
### Task
In this assignment you will implement naive Bayes classifiers based on histograms

### Command-line Arguments

You must implement a program that learns a naive Bayes classifier for a classification problem.
Given some training data and some additional options. In particular, your program will be invoked as follows:
naive_bayes <training_file> <test_file> histograms <number>

### Training: Histograms

If the third commandline argument is histograms, then you should model P(x | class) as a histogram separately for each dimension of the data.
The number of bins for each histogram is specified by the fourth command line argument.
Suppose that you are building a histogram of N bins for the j-th dimension of the data and for the c-th class.
Let S be the smallest and L be the largest value in the j-th dimension among all training data belonging to the c-th class.
Let G = (L-S)/(N-3). G will be the width of all bins, except for bin 0 and bin N-1, whose width is infinite.
If you get a value of G that is less than 0.0001, then set G to 0.0001. Your bins should have the following ranges:

Bin 0, covering interval (-infinity, S-G/2).
Bin 1, covering interval [S-G/2, S+G/2).
Bin 2, covering interval [S+G/2, S+G+G/2).
Bin 3, covering interval [S+G+G/2, S+2G+G/2).
...
Bin N-2, covering interval [S+(N-4)G+G/2, S+(N-3)G+G/2). This interval is the same as [L-G/2, L+G/2).
Bin N-1, covering interval [S+(N-3)G+G/2, +infinity). This interval is the same as [L+G/2, +infinity).

The output of the training phase should be a sequence of lines like this:
Class %d, attribute %d, bin %d, P(bin | class) = %.2f
The output lines should be sorted by class number. Within the same class, lines should be sorted by attribute number.
Within the same attribute, lines should be sorted by bin number. Attributes and bins should be numbered starting from 0, not from 1.
In computing the value that you store at each bin of each histogram, you must use Equation 2.241 on page 120 of the textbook.
Notice that the width of the bin appears in the denominator of that equation. As mentioned above, the minimum width should be 0.0001. If your value of G is less than 0.0001, you should set G to 0.0001.

In your answers.pdf document, provide the output produced by the training stage of your program when given yeast_training.txt as the input file, using seven bins for each histogram.

### Classification

For each test object you should print a line containing the following info:
object ID. This is the line number where that object occurs in the test file. Start with 0 in numbering the objects, not with 1.
predicted class (the result of the classification). If your classification result is a tie among two or more classes, choose one of them randomly.
probability of the predicted class given the data.
true class (from the last column of the test file).
accuracy. This is defined as follows:
If there were no ties in your classification result, and the predicted class is correct, the accuracy is 1.
If there were no ties in your classification result, and the predicted class is incorrect, the accuracy is 0.
If there were ties in your classification result, and the correct class was one of the classes that tied for best, the accuracy is 1 divided by the number of classes that tied for best.
If there were ties in your classification result, and the correct class was NOT one of the classes that tied for best, the accuracy is 0.
To produce this output in a uniform manner, use these printing statements:
For C or C++, use:
printf("ID=%5d, predicted=%3d, probability = %.4lf, true=%3d, accuracy=%4.2lf\n",
object_id, probability, predicted_class, true_class, accuracy);
For Java, use:
System.out.printf("ID=%5d, predicted=%3d, probability = %.4f, true=%3d, accuracy=%4.2f\n",
object_id, predicted_class, probability, true_class, accuracy);
3,498 changes: 3,498 additions & 0 deletions 02_PYTHON/01_Naive_Bayes_Classifier/data/pendigits_test.txt

Large diffs are not rendered by default.

7,494 changes: 7,494 additions & 0 deletions 02_PYTHON/01_Naive_Bayes_Classifier/data/pendigits_training.txt

Large diffs are not rendered by default.

2,000 changes: 2,000 additions & 0 deletions 02_PYTHON/01_Naive_Bayes_Classifier/data/satellite_test.txt

Large diffs are not rendered by default.

4,435 changes: 4,435 additions & 0 deletions 02_PYTHON/01_Naive_Bayes_Classifier/data/satellite_training.txt

Large diffs are not rendered by default.

484 changes: 484 additions & 0 deletions 02_PYTHON/01_Naive_Bayes_Classifier/data/yeast_test.txt

Large diffs are not rendered by default.

1,000 changes: 1,000 additions & 0 deletions 02_PYTHON/01_Naive_Bayes_Classifier/data/yeast_training.txt

Large diffs are not rendered by default.

976 changes: 976 additions & 0 deletions 02_PYTHON/02_Gaussian_Naive_Bayes/Gaussian_Naive_Bayes.ipynb

Large diffs are not rendered by default.

20 changes: 20 additions & 0 deletions 02_PYTHON/02_Gaussian_Naive_Bayes/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
### Task

You must implement a program that learns a naive Bayes classifier for a classification problem, given some training data and some additional options.
In particular, your program will be invoked as follows:
- naive_bayes <training_file> <test_file> gaussians

### Training: Gaussians

If the third commandline argument is gaussians, then you should model P(x | class) as a Gaussian separately for each dimension of the data.
The output of the training phase should be a sequence of lines like this:
Class %d, attribute %d, mean = %.2f, std = %.2f
The output lines should be sorted by class number.
Within the same class, lines should be sorted by attribute number.
Attributes should be numbered starting from 0, not from 1.
In certain cases, it is possible that value computed for the standard deviation is equal to zero.
Your code should make sure that the variance of the Gaussian is NEVER smaller than 0.0001.
Since the variance is the square of the standard deviation, this means that the standard deviation should never be smaller than sqrt(0.0001) = 0.01.
Any time the value for the standard deviation is computed to be smaller than 0.01, your code should replace that value with 0.01.

In your answers.pdf document, provide the output produced by the training stage of your program when given yeast_training.txt as the input file.
Loading