Handling Big Data

In non-test learning dataset, missing data is common. Not all students finish all the items. When the number of students and items are large, the data can be extremely sparse.

The package deals with the sparse structure in two ways:

Efficient memory storage.

Use collapsed list to index data. The memory usage is about 3 times of the text data file. If the workstation has 6G free memory, it can handle 2G data file.

If the data are truly big, say billions, a mongo DAO is implemented to go beyond the memory limit of a workstation by using a cloud/server storage.

No joint estimation.

Under conditional independence assumption, estimate each item's parameter SEPARATELY is consistent but inefficient.

The scipy minimize is as good as cvxopt.cp and matlab fmincon on item parameter estimation to the 6th decimal point, which can be viewed as identical for all practical purposes.

However, the convergence is pretty slow. It requires about 10k obeverations per item to recover the parameter to the 0.01 precision.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handling Big Data

Clone this wiki locally