Bad_alloc with even the smallest datasets #1

david-cortes · 2018-07-29T18:23:04Z

I've been trying to run this software on an artifically-generated dataset, and I am constantly running out of memory (bad_alloc) even in small datasets.

As an example, I generated the following random data in a Python script:

import numpy as np, pandas as pd
from scipy.sparse import coo_matrix
from sklearn.model_selection import train_test_split

nusers = 200
nitems = 300
ntopics = 30
nwords = 250

np.random.seed(1)
a=.3 + np.random.gamma(.1, .05)
b=.3 + np.random.gamma(.1, .05)
c=.3 + np.random.gamma(.1, .05)
d=.3 + np.random.gamma(.1, .05)
e=.3 + np.random.gamma(.1, .05)
f=.5 + np.random.gamma(.1, .05)
g=.3 + np.random.gamma(.1, .05)
h=.5 + np.random.gamma(.1, .05)

np.random.seed(1)
Beta = np.random.gamma(a, b, size=(nwords, ntopics))
Theta = np.random.gamma(c, d, size=(nitems, ntopics))
W = np.random.poisson(Theta.dot(Beta.T) + np.random.gamma(1, 1, size=(nitems, nwords)), size=(nitems, nwords))

Eta = np.random.gamma(e, f, size=(nusers, ntopics))
Epsilon = np.random.gamma(g, h, size=(nitems, ntopics))
R = np.random.poisson(Eta.dot(Theta.T+Epsilon.T) + np.random.gamma(1, 1, size=(nusers, nitems)), size=(nusers, nitems))

Rcoo=coo_matrix(R)
df = pd.DataFrame({
    'UserId':Rcoo.row,
    'ItemId':Rcoo.col,
    'Count':Rcoo.data
})

df_train, df_test = train_test_split(df, test_size=0.3, random_state=1)
df_test, df_val = train_test_split(df_test, test_size=0.33, random_state=2)

df_train.sort_values(['UserId', 'ItemId'], inplace=True)
df_test.sort_values(['UserId', 'ItemId'], inplace=True)
df_val.sort_values(['UserId', 'ItemId'], inplace=True)

df_train['Count'] = df_train.Count.values.astype('float32')
df_test['Count'] = df_test.Count.values.astype('float32')
df_val['Count'] = df_val.Count.values.astype('float32')

df_train.to_csv("<dir>/train.tsv", sep='\t', index=False, header=False)
df_test.to_csv("<dir>/test.tsv", sep='\t', index=False, header=False)
df_val.to_csv("<dir>/validation.tsv", sep='\t', index=False, header=False)
pd.DataFrame({"UserId":list(set(list(df_test.UserId.values)))})\
.to_csv("<dir>/test_users.tsv", index=False, header=False)


Wcoo = coo_matrix(W)
Wdf = pd.DataFrame({
    'ItemId':Wcoo.row,
    'WordId':Wcoo.col,
    'Count':Wcoo.data
})
def mix(a, b):
    nx = len(a)
    out=str(nx) + " "
    for i in range(nx):
        out += str(a[i]) + ":" + str(float(b[i])) + " "
    return out
Wdf.groupby('ItemId').agg(lambda x: tuple(x)).apply(lambda x: mix(x['WordId'], x['Count']), axis=1)\
.to_frame().to_csv("<dir>/mult.dat", index=False, header=False)

pd.DataFrame({'col1':np.arange(nwords)}).to_csv("<dir>/vocab.dat", index=False, header=False)

Generating files that look as follows:

train.tsv:

test.tsv:

validation.tsv

test_users.tsv:

vocab.dat:

mult.dat:

141 0:2.0 1:4.0 2:1.0 3:2.0 5:1.0 6:2.0 9:2.0 11:1.0 15:2.0 16:3.0 17:4.0 19:3.0 21:1.0 22:4.0 23:1.0 24:3.0 26:1.0 27:1.0 29:1.0 32:3.0 33:2.0 34:1.0 35:2.0 36:1.0 39:2.0 41:1.0 42:6.0 44:1.0 45:2.0 47:1.0 48:1.0 53:5.0 54:2.0 57:1.0 63:6.0 65:1.0 66:2.0 67:1.0 68:1.0 69:1.0 72:1.0 73:1.0 76:5.0 78:1.0 79:5.0 80:1.0 83:2.0 84:3.0 86:1.0 88:5.0 89:1.0 90:4.0 92:1.0 93:2.0 94:1.0 96:2.0 98:1.0 100:4.0 107:2.0 108:1.0 109:2.0 112:2.0 113:4.0 116:1.0 119:1.0 120:2.0 124:3.0 125:7.0 129:2.0 130:1.0 132:3.0 136:1.0 137:1.0 138:3.0 139:2.0 140:1.0 143:4.0 144:2.0 145:2.0 146:10.0 148:2.0 149:2.0 150:1.0 152:4.0 155:6.0 156:2.0 157:3.0 159:2.0 161:4.0 162:1.0 163:2.0 170:1.0 171:1.0 173:3.0 174:4.0 175:3.0 176:1.0 177:1.0 180:2.0 183:1.0 185:1.0 186:2.0 187:4.0 189:1.0 190:2.0 194:1.0 196:2.0 197:2.0 198:2.0 199:4.0 200:3.0 202:2.0 204:1.0 205:1.0 206:1.0 208:1.0 209:1.0 210:3.0 212:2.0 214:1.0 217:1.0 218:1.0 219:2.0 220:1.0 221:1.0 223:2.0 226:1.0 227:1.0 228:1.0 231:1.0 232:4.0 233:4.0 235:1.0 236:2.0 238:3.0 239:1.0 242:1.0 243:1.0 246:4.0 248:2.0 249:2.0 
156 1:1.0 2:1.0 3:3.0 5:2.0 7:1.0 8:1.0 9:1.0 10:1.0 13:1.0 15:2.0 17:1.0 19:2.0 21:3.0 22:3.0 23:2.0 24:1.0 26:1.0 27:1.0 28:1.0 31:1.0 33:1.0 34:5.0 36:2.0 38:1.0 39:4.0 40:1.0 41:1.0 42:1.0 43:4.0 44:2.0 46:2.0 47:3.0 50:1.0 52:1.0 53:3.0 54:2.0 56:2.0 57:1.0 58:4.0 59:2.0 60:3.0 63:1.0 66:1.0 67:2.0 69:2.0 74:2.0 75:2.0 77:1.0 78:3.0 79:1.0 81:3.0 82:2.0 83:1.0 84:3.0 85:2.0 86:3.0 88:2.0 89:3.0 92:1.0 94:1.0 96:1.0 97:2.0 98:1.0 99:3.0 100:1.0 101:2.0 103:1.0 104:1.0 106:3.0 110:1.0 113:1.0 115:1.0 118:2.0 120:4.0 121:3.0 122:1.0 123:3.0 128:1.0 133:3.0 135:1.0 137:1.0 138:2.0 139:2.0 141:1.0 143:2.0 147:1.0 148:2.0 149:1.0 151:1.0 154:1.0 155:4.0 157:1.0 158:1.0 160:4.0 161:2.0 162:5.0 163:1.0 164:5.0 165:1.0 166:1.0 167:4.0 168:3.0 170:1.0 172:1.0 175:1.0 177:1.0 180:4.0 181:1.0 183:1.0 184:1.0 186:1.0 187:1.0 189:1.0 190:5.0 193:2.0 194:3.0 195:7.0 197:2.0 198:2.0 200:1.0 201:1.0 202:2.0 207:2.0 208:2.0 209:1.0 210:3.0 212:8.0 213:2.0 214:2.0 216:1.0 217:1.0 218:1.0 220:4.0 222:1.0 223:1.0 224:2.0 225:4.0 226:1.0 227:1.0 228:6.0 229:3.0 230:1.0 231:1.0 232:1.0 236:2.0 237:1.0 238:2.0 240:2.0 242:1.0 243:2.0 244:2.0 245:2.0 246:3.0 247:6.0 248:2.0 249:2.0

(tried varying between integers and decimals for the values in this last one, but it didn't make a difference)

Which I think seem to fit the description of the files in the main page.

However, after trying to run this program on this data (with and without the last two argments):

collabtm -dir ~/<dir> -nusers 200 -ndocs 300 -nvocab 250 -k 20 -fixeda -lda-init

It starts allocating a lot of memory, until allocating around 8GB, after which it throws bad_alloc and terminates.

Am I missing something?

The text was updated successfully, but these errors were encountered:

lcharlin · 2018-07-30T11:48:23Z

I'm mostly offline until next week but in the spirit of providing a quick response: have you tried integers for the preference observations (i.e., in train/validation/test.tsv)? If that's not it, I'll have a better look once I'm back.

…

On Sun, Jul 29, 2018 at 2:23 PM david-cortes ***@***.***> wrote: I've been trying to run this software on an artifically-generated dataset, and I am constantly running out of memory (bad_alloc) even in small datasets. As an example, I generated the following random data in a Python script: import numpy as np, pandas as pdfrom scipy.sparse import coo_matrixfrom sklearn.model_selection import train_test_split nusers = 200 nitems = 300 ntopics = 30 nwords = 250 np.random.seed(1) a=.3 + np.random.gamma(.1, .05) b=.3 + np.random.gamma(.1, .05) c=.3 + np.random.gamma(.1, .05) d=.3 + np.random.gamma(.1, .05) e=.3 + np.random.gamma(.1, .05) f=.5 + np.random.gamma(.1, .05) g=.3 + np.random.gamma(.1, .05) h=.5 + np.random.gamma(.1, .05) np.random.seed(1) Beta = np.random.gamma(a, b, size=(nwords, ntopics)) Theta = np.random.gamma(c, d, size=(nitems, ntopics)) W = np.random.poisson(Theta.dot(Beta.T) + np.random.gamma(1, 1, size=(nitems, nwords)), size=(nitems, nwords)) Eta = np.random.gamma(e, f, size=(nusers, ntopics)) Epsilon = np.random.gamma(g, h, size=(nitems, ntopics)) R = np.random.poisson(Eta.dot(Theta.T+Epsilon.T) + np.random.gamma(1, 1, size=(nusers, nitems)), size=(nusers, nitems)) Rcoo=coo_matrix(R) df = pd.DataFrame({ 'UserId':Rcoo.row, 'ItemId':Rcoo.col, 'Count':Rcoo.data }) df_train, df_test = train_test_split(df, test_size=0.3, random_state=1) df_test, df_val = train_test_split(df_test, test_size=0.33, random_state=2) df_train.sort_values(['UserId', 'ItemId'], inplace=True) df_test.sort_values(['UserId', 'ItemId'], inplace=True) df_val.sort_values(['UserId', 'ItemId'], inplace=True) df_train['Count'] = df_train.Count.values.astype('float32') df_test['Count'] = df_test.Count.values.astype('float32') df_val['Count'] = df_val.Count.values.astype('float32') df_train.to_csv("<dir>/train.tsv", sep='\t', index=False, header=False) df_test.to_csv("<dir>/test.tsv", sep='\t', index=False, header=False) df_val.to_csv("<dir>/validation.tsv", sep='\t', index=False, header=False) pd.DataFrame({"UserId":list(set(list(df_test.UserId.values)))})\ .to_csv("<dir>/test_users.tsv", index=False, header=False) Wcoo = coo_matrix(W) Wdf = pd.DataFrame({ 'ItemId':Wcoo.row, 'WordId':Wcoo.col, 'Count':Wcoo.data })def mix(a, b): nx = len(a) out=str(nx) + " " for i in range(nx): out += str(a[i]) + ":" + str(float(b[i])) + " " return out Wdf.groupby('ItemId').agg(lambda x: tuple(x)).apply(lambda x: mix(x['WordId'], x['Count']), axis=1)\ .to_frame().to_csv("<dir>/mult.dat", index=False, header=False) pd.DataFrame({'col1':np.arange(nwords)}).to_csv("<dir>/vocab.dat", index=False, header=False) Generating files that look as follows: - train.tsv: 0 0 4.0 0 1 6.0 0 5 5.0 0 7 5.0 0 9 2.0 0 10 5.0 - test.tsv: 0 2 1.0 0 4 4.0 0 12 4.0 0 14 3.0 0 16 4.0 - validation.tsv 0 23 5.0 0 30 3.0 0 32 1.0 0 33 2.0 0 46 3.0 - test_users.tsv: 0 1 2 3 4 - vocab.dat: 0 1 2 3 4 5 - mult.dat: 141 0:2.0 1:4.0 2:1.0 3:2.0 5:1.0 6:2.0 9:2.0 11:1.0 15:2.0 16:3.0 17:4.0 19:3.0 21:1.0 22:4.0 23:1.0 24:3.0 26:1.0 27:1.0 29:1.0 32:3.0 33:2.0 34:1.0 35:2.0 36:1.0 39:2.0 41:1.0 42:6.0 44:1.0 45:2.0 47:1.0 48:1.0 53:5.0 54:2.0 57:1.0 63:6.0 65:1.0 66:2.0 67:1.0 68:1.0 69:1.0 72:1.0 73:1.0 76:5.0 78:1.0 79:5.0 80:1.0 83:2.0 84:3.0 86:1.0 88:5.0 89:1.0 90:4.0 92:1.0 93:2.0 94:1.0 96:2.0 98:1.0 100:4.0 107:2.0 108:1.0 109:2.0 112:2.0 113:4.0 116:1.0 119:1.0 120:2.0 124:3.0 125:7.0 129:2.0 130:1.0 132:3.0 136:1.0 137:1.0 138:3.0 139:2.0 140:1.0 143:4.0 144:2.0 145:2.0 146:10.0 148:2.0 149:2.0 150:1.0 152:4.0 155:6.0 156:2.0 157:3.0 159:2.0 161:4.0 162:1.0 163:2.0 170:1.0 171:1.0 173:3.0 174:4.0 175:3.0 176:1.0 177:1.0 180:2.0 183:1.0 185:1.0 186:2.0 187:4.0 189:1.0 190:2.0 194:1.0 196:2.0 197:2.0 198:2.0 199:4.0 200:3.0 202:2.0 204:1.0 205:1.0 206:1.0 208:1.0 209:1.0 210:3.0 212:2.0 214:1.0 217:1.0 218:1.0 219:2.0 220:1.0 221:1.0 223:2.0 226:1.0 227:1.0 228:1.0 231:1.0 232:4.0 233:4.0 235:1.0 236:2.0 238:3.0 239:1.0 242:1.0 243:1.0 246:4.0 248:2.0 249:2.0 156 1:1.0 2:1.0 3:3.0 5:2.0 7:1.0 8:1.0 9:1.0 10:1.0 13:1.0 15:2.0 17:1.0 19:2.0 21:3.0 22:3.0 23:2.0 24:1.0 26:1.0 27:1.0 28:1.0 31:1.0 33:1.0 34:5.0 36:2.0 38:1.0 39:4.0 40:1.0 41:1.0 42:1.0 43:4.0 44:2.0 46:2.0 47:3.0 50:1.0 52:1.0 53:3.0 54:2.0 56:2.0 57:1.0 58:4.0 59:2.0 60:3.0 63:1.0 66:1.0 67:2.0 69:2.0 74:2.0 75:2.0 77:1.0 78:3.0 79:1.0 81:3.0 82:2.0 83:1.0 84:3.0 85:2.0 86:3.0 88:2.0 89:3.0 92:1.0 94:1.0 96:1.0 97:2.0 98:1.0 99:3.0 100:1.0 101:2.0 103:1.0 104:1.0 106:3.0 110:1.0 113:1.0 115:1.0 118:2.0 120:4.0 121:3.0 122:1.0 123:3.0 128:1.0 133:3.0 135:1.0 137:1.0 138:2.0 139:2.0 141:1.0 143:2.0 147:1.0 148:2.0 149:1.0 151:1.0 154:1.0 155:4.0 157:1.0 158:1.0 160:4.0 161:2.0 162:5.0 163:1.0 164:5.0 165:1.0 166:1.0 167:4.0 168:3.0 170:1.0 172:1.0 175:1.0 177:1.0 180:4.0 181:1.0 183:1.0 184:1.0 186:1.0 187:1.0 189:1.0 190:5.0 193:2.0 194:3.0 195:7.0 197:2.0 198:2.0 200:1.0 201:1.0 202:2.0 207:2.0 208:2.0 209:1.0 210:3.0 212:8.0 213:2.0 214:2.0 216:1.0 217:1.0 218:1.0 220:4.0 222:1.0 223:1.0 224:2.0 225:4.0 226:1.0 227:1.0 228:6.0 229:3.0 230:1.0 231:1.0 232:1.0 236:2.0 237:1.0 238:2.0 240:2.0 242:1.0 243:2.0 244:2.0 245:2.0 246:3.0 247:6.0 248:2.0 249:2.0 (tried varying between integers and decimals for the values in this last one, but it didn't make a difference) Which I think seem to fit the description of the files in the main page. However, after trying to run this program on this data (with and without the last two argments): collabtm -dir ~/<dir> -nusers 200 -ndocs 300 -nvocab 250 -k 20 -fixeda -lda-init It starts allocating a lot of memory, until allocating around 8GB, after which it throws bad_alloc and terminates. Am I missing something? — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#1>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABaWSHto9HnQ1oDvWoQZs6n3-P6R2DJMks5uLf2IgaJpZM4VlhDR> .

david-cortes · 2018-07-30T13:04:05Z

Yes, I tried it that way too, but then I get the following error message:

collabtm: matrix.hh:1166: void D2Array<T>::load(std::__cxx11::string, uint32_t, bool) const [with T = double; std::__cxx11::string = std::__cxx11::basic_string<char>; uint32_t = unsigned int]: Assertion `f' failed.
Aborted

lcharlin · 2018-08-07T02:43:54Z

I took a better look. + The dataset you were generating was too small (it created problems when splitting cold start documents). While this is a limitation of the code, I'm not sure it's worth fixing since this is a toy case at best. + The other thing is that I output integers instead of decimal values. df_train['Count'] = df_train.Count.values.astype('float32') df_test['Count'] = df_test.Count.values.astype('float32') df_val['Count'] = df_val.Count.values.astype('float32') df_train['Count'] = df_train.Count.values.astype('int32') df_test['Count'] = df_test.Count.values.astype('int32') df_val['Count'] = df_val.Count.values.astype('int32') + Finally you should remove the -lda-init switch unless you have actually an LDA fit. I hope that helps. Best, Laurent

…

On Mon, Jul 30, 2018 at 9:04 AM david-cortes ***@***.***> wrote: Yes, I tried it that way too, but then I get the following error message: collabtm: matrix.hh:1166: void D2Array<T>::load(std::__cxx11::string, uint32_t, bool) const [with T = double; std::__cxx11::string = std::__cxx11::basic_string<char>; uint32_t = unsigned int]: Assertion `f' failed. Aborted — You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

david-cortes · 2018-08-08T05:36:15Z

After trying with larger datasets, it seems to run the inference procedure, and what I guess is computing precision metrics, but it still seems to fail due to the cold-start part at the end:

coldstart local inference and HOL
collabtm: matrix.hh:1166: void D2Array<T>::load(std::__cxx11::string, uint32_t, bool) const [with T = double; std::__cxx11::string = std::__cxx11::basic_string<char>; uint32_t = unsigned int]: Assertion `f' failed.
Aborted

lcharlin · 2018-08-09T20:44:14Z

By default the code needs access to the lda fits for coldstart inference. The log file infer.log (in the experiment's folder) has details about what it's looking for exactly. Alternatively, you could try commenting out line 2227 in collabtm.cc. Then it should simply re-use the topics learned during the run using the other documents. Hope that helps!

…

On Wed, Aug 8, 2018 at 1:36 AM david-cortes ***@***.***> wrote: After trying with larger datasets, it seems to run the inference procedure, and what I guess is computing precision metrics, but it still seems to fail due to the cold-start part at the end: coldstart local inference and HOL collabtm: matrix.hh:1166: void D2Array<T>::load(std::__cxx11::string, uint32_t, bool) const [with T = double; std::__cxx11::string = std::__cxx11::basic_string<char>; uint32_t = unsigned int]: Assertion `f' failed. Aborted — You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bad_alloc with even the smallest datasets #1

Bad_alloc with even the smallest datasets #1

david-cortes commented Jul 29, 2018

lcharlin commented Jul 30, 2018 via email

david-cortes commented Jul 30, 2018

lcharlin commented Aug 7, 2018 via email

david-cortes commented Aug 8, 2018

lcharlin commented Aug 9, 2018 via email

Bad_alloc with even the smallest datasets #1

Bad_alloc with even the smallest datasets #1

Comments

david-cortes commented Jul 29, 2018

lcharlin commented Jul 30, 2018 via email

david-cortes commented Jul 30, 2018

lcharlin commented Aug 7, 2018 via email

david-cortes commented Aug 8, 2018

lcharlin commented Aug 9, 2018 via email