phraug

A set of simple Python scripts for pre-processing large files, things like splitting and format conversion. The names phraug comes from a great book, Made to Stick, by Chip and Dan Heath.

See http://fastml.com/processing-large-files-line-by-line/ for the basic idea.

There's always at least one input file and usually one or more output files. An input file always stays unchanged.

Format conversion

csv2libsvm.py <input file> <output file> [<label index = 0>] [<skip headers = 0>]

Convert CSV to LIBSVM format. If there are no labels in the input file, specify label index = -1. If there are headers in the input file, specify skip headers = 1.

csv2vw.py <input file> <output file> [<label index = 0>] [<skip headers = 0>]

Convert CSV to VW format. Arguments as above.

libsvm2csv.py <input file> <output file> <input file dimensionality>

Convert LIBSVM to CSV. You need to specify dimensionality, that is a number of columns (not counting a label).

libsvm2vw.py <input file> <output file>

Convert LIBSVM to VW.

tsv2csv.py <input file> <output file>

Convert tab-separated file to comma-separated file.

Other operations

chunk.py <input file> <number of output files> [<random seed>]

Split a file randomly line by line into a number of smaller files. Might be useful for preparing cross-validation. Output files will have the base nume suffixed with a chunk number, for example data.csv will be chunked into data_0.csv, data_1.csv etc.

count.py <input file>

Count lines in a file. On Unix you can do it with wc -l

sample.py <input file> <output file> [<P = 0.5>]

Sample lines from an input file with probability P. Similiar to split.py, but there's only one output file. Useful for sampling large datasets.

split.py <input file> <output file 1> <output file 2> [<P = 0.9>] [<random seed>]

Split a file into two randomly. Default P (probability of writing to the first file) is 0.9. You can specify any string as a seed for random number generator.

subset.py <input file> <output file> [<offset = 0>] [<lines = 100>]

Save a subset of lines from an input file to an output file. Start at offset (default 0), save lines (default 100).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

phraug

Format conversion

Other operations

About

Releases

Packages

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
chunk.py		chunk.py
count.py		count.py
csv2libsvm.py		csv2libsvm.py
csv2vw.py		csv2vw.py
libsvm2csv.py		libsvm2csv.py
libsvm2vw.py		libsvm2vw.py
sample.py		sample.py
split.py		split.py
subset.py		subset.py
tsv2csv.py		tsv2csv.py

EVaisman/phraug

Folders and files

Latest commit

History

Repository files navigation

phraug

Format conversion

Other operations

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages