Movie Data Set

This is a dataset of Movies. The original data from the UCI database have been cleaned up and converted from HTML to CSV.

Syntax errors in the original HTML files have been cleaned up.
Script for converting HTML files into CSV was added (see file data/make).
Some additional columns (eg. Ref:) were omitted from the CSV output as they can not be parsed easily by a computer (the HTML files were intact, you can continue working on this easily).
The following list of files should be “ready”:
- actors.csv
- casts.csv
- remakes.csv
- studios.csv
- synonyms.csv
- main.html
- people.html

How to contribute

Despite the effort, there are many ways this dataset can be enhanced:

Characterize what type of films the database contains. Currently, the description only says over 10000 films including many older, odd, and cult films, which is not a very precise definition...
Clean up one of these files: awtypes.html, locales.html, quotes.html or sayings.html.
Devise a way of making the non-machine-readable columns (usually the last one) machine-readable.
Make the date columns machine-readable (lot of noise currently).
Check for errors in the data.