-
Notifications
You must be signed in to change notification settings - Fork 32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bad_alloc with even the smallest datasets #1
Comments
I'm mostly offline until next week but in the spirit of providing a quick
response: have you tried integers for the preference observations (i.e., in
train/validation/test.tsv)? If that's not it, I'll have a better look once
I'm back.
…On Sun, Jul 29, 2018 at 2:23 PM david-cortes ***@***.***> wrote:
I've been trying to run this software on an artifically-generated dataset,
and I am constantly running out of memory (bad_alloc) even in small
datasets.
As an example, I generated the following random data in a Python script:
import numpy as np, pandas as pdfrom scipy.sparse import coo_matrixfrom sklearn.model_selection import train_test_split
nusers = 200
nitems = 300
ntopics = 30
nwords = 250
np.random.seed(1)
a=.3 + np.random.gamma(.1, .05)
b=.3 + np.random.gamma(.1, .05)
c=.3 + np.random.gamma(.1, .05)
d=.3 + np.random.gamma(.1, .05)
e=.3 + np.random.gamma(.1, .05)
f=.5 + np.random.gamma(.1, .05)
g=.3 + np.random.gamma(.1, .05)
h=.5 + np.random.gamma(.1, .05)
np.random.seed(1)
Beta = np.random.gamma(a, b, size=(nwords, ntopics))
Theta = np.random.gamma(c, d, size=(nitems, ntopics))
W = np.random.poisson(Theta.dot(Beta.T) + np.random.gamma(1, 1, size=(nitems, nwords)), size=(nitems, nwords))
Eta = np.random.gamma(e, f, size=(nusers, ntopics))
Epsilon = np.random.gamma(g, h, size=(nitems, ntopics))
R = np.random.poisson(Eta.dot(Theta.T+Epsilon.T) + np.random.gamma(1, 1, size=(nusers, nitems)), size=(nusers, nitems))
Rcoo=coo_matrix(R)
df = pd.DataFrame({
'UserId':Rcoo.row,
'ItemId':Rcoo.col,
'Count':Rcoo.data
})
df_train, df_test = train_test_split(df, test_size=0.3, random_state=1)
df_test, df_val = train_test_split(df_test, test_size=0.33, random_state=2)
df_train.sort_values(['UserId', 'ItemId'], inplace=True)
df_test.sort_values(['UserId', 'ItemId'], inplace=True)
df_val.sort_values(['UserId', 'ItemId'], inplace=True)
df_train['Count'] = df_train.Count.values.astype('float32')
df_test['Count'] = df_test.Count.values.astype('float32')
df_val['Count'] = df_val.Count.values.astype('float32')
df_train.to_csv("<dir>/train.tsv", sep='\t', index=False, header=False)
df_test.to_csv("<dir>/test.tsv", sep='\t', index=False, header=False)
df_val.to_csv("<dir>/validation.tsv", sep='\t', index=False, header=False)
pd.DataFrame({"UserId":list(set(list(df_test.UserId.values)))})\
.to_csv("<dir>/test_users.tsv", index=False, header=False)
Wcoo = coo_matrix(W)
Wdf = pd.DataFrame({
'ItemId':Wcoo.row,
'WordId':Wcoo.col,
'Count':Wcoo.data
})def mix(a, b):
nx = len(a)
out=str(nx) + " "
for i in range(nx):
out += str(a[i]) + ":" + str(float(b[i])) + " "
return out
Wdf.groupby('ItemId').agg(lambda x: tuple(x)).apply(lambda x: mix(x['WordId'], x['Count']), axis=1)\
.to_frame().to_csv("<dir>/mult.dat", index=False, header=False)
pd.DataFrame({'col1':np.arange(nwords)}).to_csv("<dir>/vocab.dat", index=False, header=False)
Generating files that look as follows:
- train.tsv:
0 0 4.0
0 1 6.0
0 5 5.0
0 7 5.0
0 9 2.0
0 10 5.0
- test.tsv:
0 2 1.0
0 4 4.0
0 12 4.0
0 14 3.0
0 16 4.0
- validation.tsv
0 23 5.0
0 30 3.0
0 32 1.0
0 33 2.0
0 46 3.0
- test_users.tsv:
0
1
2
3
4
- vocab.dat:
0
1
2
3
4
5
- mult.dat:
141 0:2.0 1:4.0 2:1.0 3:2.0 5:1.0 6:2.0 9:2.0 11:1.0 15:2.0 16:3.0 17:4.0 19:3.0 21:1.0 22:4.0 23:1.0 24:3.0 26:1.0 27:1.0 29:1.0 32:3.0 33:2.0 34:1.0 35:2.0 36:1.0 39:2.0 41:1.0 42:6.0 44:1.0 45:2.0 47:1.0 48:1.0 53:5.0 54:2.0 57:1.0 63:6.0 65:1.0 66:2.0 67:1.0 68:1.0 69:1.0 72:1.0 73:1.0 76:5.0 78:1.0 79:5.0 80:1.0 83:2.0 84:3.0 86:1.0 88:5.0 89:1.0 90:4.0 92:1.0 93:2.0 94:1.0 96:2.0 98:1.0 100:4.0 107:2.0 108:1.0 109:2.0 112:2.0 113:4.0 116:1.0 119:1.0 120:2.0 124:3.0 125:7.0 129:2.0 130:1.0 132:3.0 136:1.0 137:1.0 138:3.0 139:2.0 140:1.0 143:4.0 144:2.0 145:2.0 146:10.0 148:2.0 149:2.0 150:1.0 152:4.0 155:6.0 156:2.0 157:3.0 159:2.0 161:4.0 162:1.0 163:2.0 170:1.0 171:1.0 173:3.0 174:4.0 175:3.0 176:1.0 177:1.0 180:2.0 183:1.0 185:1.0 186:2.0 187:4.0 189:1.0 190:2.0 194:1.0 196:2.0 197:2.0 198:2.0 199:4.0 200:3.0 202:2.0 204:1.0 205:1.0 206:1.0 208:1.0 209:1.0 210:3.0 212:2.0 214:1.0 217:1.0 218:1.0 219:2.0 220:1.0 221:1.0 223:2.0 226:1.0 227:1.0 228:1.0 231:1.0 232:4.0 233:4.0 235:1.0 236:2.0 238:3.0 239:1.0 242:1.0 243:1.0 246:4.0 248:2.0 249:2.0
156 1:1.0 2:1.0 3:3.0 5:2.0 7:1.0 8:1.0 9:1.0 10:1.0 13:1.0 15:2.0 17:1.0 19:2.0 21:3.0 22:3.0 23:2.0 24:1.0 26:1.0 27:1.0 28:1.0 31:1.0 33:1.0 34:5.0 36:2.0 38:1.0 39:4.0 40:1.0 41:1.0 42:1.0 43:4.0 44:2.0 46:2.0 47:3.0 50:1.0 52:1.0 53:3.0 54:2.0 56:2.0 57:1.0 58:4.0 59:2.0 60:3.0 63:1.0 66:1.0 67:2.0 69:2.0 74:2.0 75:2.0 77:1.0 78:3.0 79:1.0 81:3.0 82:2.0 83:1.0 84:3.0 85:2.0 86:3.0 88:2.0 89:3.0 92:1.0 94:1.0 96:1.0 97:2.0 98:1.0 99:3.0 100:1.0 101:2.0 103:1.0 104:1.0 106:3.0 110:1.0 113:1.0 115:1.0 118:2.0 120:4.0 121:3.0 122:1.0 123:3.0 128:1.0 133:3.0 135:1.0 137:1.0 138:2.0 139:2.0 141:1.0 143:2.0 147:1.0 148:2.0 149:1.0 151:1.0 154:1.0 155:4.0 157:1.0 158:1.0 160:4.0 161:2.0 162:5.0 163:1.0 164:5.0 165:1.0 166:1.0 167:4.0 168:3.0 170:1.0 172:1.0 175:1.0 177:1.0 180:4.0 181:1.0 183:1.0 184:1.0 186:1.0 187:1.0 189:1.0 190:5.0 193:2.0 194:3.0 195:7.0 197:2.0 198:2.0 200:1.0 201:1.0 202:2.0 207:2.0 208:2.0 209:1.0 210:3.0 212:8.0 213:2.0 214:2.0 216:1.0 217:1.0 218:1.0 220:4.0 222:1.0 223:1.0 224:2.0 225:4.0 226:1.0 227:1.0 228:6.0 229:3.0 230:1.0 231:1.0 232:1.0 236:2.0 237:1.0 238:2.0 240:2.0 242:1.0 243:2.0 244:2.0 245:2.0 246:3.0 247:6.0 248:2.0 249:2.0
(tried varying between integers and decimals for the values in this last
one, but it didn't make a difference)
Which I think seem to fit the description of the files in the main page.
However, after trying to run this program on this data (with and without
the last two argments):
collabtm -dir ~/<dir> -nusers 200 -ndocs 300 -nvocab 250 -k 20 -fixeda -lda-init
It starts allocating a lot of memory, until allocating around 8GB, after
which it throws bad_alloc and terminates.
Am I missing something?
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#1>, or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABaWSHto9HnQ1oDvWoQZs6n3-P6R2DJMks5uLf2IgaJpZM4VlhDR>
.
|
Yes, I tried it that way too, but then I get the following error message:
|
I took a better look.
+ The dataset you were generating was too small (it created problems
when splitting cold start documents). While this is a limitation of
the code, I'm not sure it's worth fixing since this is a toy case at
best.
+ The other thing is that I output integers instead of decimal values.
df_train['Count'] = df_train.Count.values.astype('float32')
df_test['Count'] = df_test.Count.values.astype('float32')
df_val['Count'] = df_val.Count.values.astype('float32')
df_train['Count'] = df_train.Count.values.astype('int32')
df_test['Count'] = df_test.Count.values.astype('int32')
df_val['Count'] = df_val.Count.values.astype('int32')
+ Finally you should remove the -lda-init switch unless you have
actually an LDA fit.
I hope that helps.
Best,
Laurent
…On Mon, Jul 30, 2018 at 9:04 AM david-cortes ***@***.***> wrote:
Yes, I tried it that way too, but then I get the following error message:
collabtm: matrix.hh:1166: void D2Array<T>::load(std::__cxx11::string, uint32_t, bool) const [with T = double; std::__cxx11::string = std::__cxx11::basic_string<char>; uint32_t = unsigned int]: Assertion `f' failed.
Aborted
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub, or mute the thread.
|
After trying with larger datasets, it seems to run the inference procedure, and what I guess is computing precision metrics, but it still seems to fail due to the cold-start part at the end:
|
By default the code needs access to the lda fits for coldstart
inference. The log file infer.log (in the experiment's folder) has
details about what it's looking for exactly.
Alternatively, you could try commenting out line 2227 in collabtm.cc.
Then it should simply re-use the topics learned during the run using
the other documents.
Hope that helps!
…On Wed, Aug 8, 2018 at 1:36 AM david-cortes ***@***.***> wrote:
After trying with larger datasets, it seems to run the inference procedure, and what I guess is computing precision metrics, but it still seems to fail due to the cold-start part at the end:
coldstart local inference and HOL
collabtm: matrix.hh:1166: void D2Array<T>::load(std::__cxx11::string, uint32_t, bool) const [with T = double; std::__cxx11::string = std::__cxx11::basic_string<char>; uint32_t = unsigned int]: Assertion `f' failed.
Aborted
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub, or mute the thread.
|
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
I've been trying to run this software on an artifically-generated dataset, and I am constantly running out of memory (
bad_alloc
) even in small datasets.As an example, I generated the following random data in a Python script:
Generating files that look as follows:
(tried varying between integers and decimals for the values in this last one, but it didn't make a difference)
Which I think seem to fit the description of the files in the main page.
However, after trying to run this program on this data (with and without the last two argments):
It starts allocating a lot of memory, until allocating around 8GB, after which it throws
bad_alloc
and terminates.Am I missing something?
The text was updated successfully, but these errors were encountered: