Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change the architecture of salento to one that is feasible to streaming #3

Open
cogumbreiro opened this issue Nov 14, 2017 · 0 comments

Comments

@cogumbreiro
Copy link
Collaborator

Salento expects as an input a sequence of packages.
The problem is that the file format that contains the sequence of packages is a JSON objects, which means that all packages must fit into memory to read them. We currently have some use cases where the datasets do not fit memory, so this architecture is a bottleneck for scalability.

We need to:

  1. change the file format to something amenable to streaming packages
  2. change the internals (say, train.py) such that data is loaded lazily and use as much as possible generators (versus creating lists upfront)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant