Skip to content

A Spark Streaming ML Cuisine Classifier pipeline for a recipe data-set provided by Yummly.com

License

Notifications You must be signed in to change notification settings

BigBossAnwer/Cuisine-Classifier

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Cuisine Classification Pipeline

An accurate, low-latency Spark Streaming Dataset ML cuisine classifier pipeline for a recipe dataset provided by Yummly.com

For a full discussion and analysis, see the online report. Alternatively, download the report and view it in your browser

Prerequisites

  • If using prebuilt .jar:
    • Java
  • If building from source:
    • Scala 2.11.8
    • Spark 2.4.2
    • Hadoop 2.7.3
    • sbt (see build.sbt for dependencies)

Usage

Prebuilt .jar targets Scala 2.11.8, Spark 2.4.2, Hadoop 2.7.3, and assumes running on a distributed cloud instance. Uncomment line 20 in CuisinePipeline.scala & build for local usage

See Yummly Dataset Schema for assumed input.json data schema (alternatively see the included train.json sample)

Sample .jar usage:

java -jar /pathToJar/cuisineclassifier_2.11-1.0-Prod Input.json /pathToOutputDir

Sample AWS EMR .jar usage:

spark-submit --deploy-mode cluster --class CuisinePipeline s3://pathToJar/cuisineclassifier_2.11-1.0-Prod s3://pathToInput/input.json s3://pathToOutputDir

Helper usage:

scala SquashParts /pathToOutputDir

Where /pathToOutputDir contains the various output directories housing the part files produced by CuisinePipeline:

  • /pathToOutputDir/optParams
  • /pathToOutputDir/confusionMatrix
  • etc.

File Listing

CuisineClassifier/...
   	cuisineclassifier_2.11-1.0-Prod - Prebuilt Project Jar
	build.sbt - Project dependencies sbt file
	src/...
		CuisinePipeline.scala - Core classifier pipeline source code
		SquashParts.scala - Helper object to squash CuisinePipeline output part files into consolidated CSV and text files
		tester.scala - Debugging and alternative solutions testbed class
	resources/
	    train.json - Sample input file used for reported result # Source: https://www.kaggle.com/c/whats-cooking/data
	out/
		* - Sample result files 

About

A Spark Streaming ML Cuisine Classifier pipeline for a recipe data-set provided by Yummly.com

Resources

License

Stars

Watchers

Forks