Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failed to reproduce the results with sample data 'input.txt' provided #17

Open
AnthonyruihChen opened this issue Aug 18, 2021 · 0 comments

Comments

@AnthonyruihChen
Copy link

Hi xychang,

Thank you for sharing your great work! I would greatly appreciate if you could help me resolve the below issue.

I first tried the CLI interface, and was able to generate 'results.json' and 'vis.json'. However, it didn't allow me to http://localhost:8000/multi_color.html?json=vis.json, so I decided to give Python interface a try.

I am using the below code and parameter configuration to reproduce 'results.json' and 'vis.json'.

import recursiveHierarchicalClustering as rhc
import recursiveHierarchicalClusteringFast as rhcFast
data = rhc.getSidNgramMap(inputPath)
treeData = rhcFast.run(inputPath, data, outPath)

environment: Jupyter Notebook

inputPath: I added your input.txt file to one directory and set inputPath = '/home/chenruihao/test_clustering/input.txt'

outPath: I didn't find description of outPath but found outputPath which is "The directory to place all temporary files as well as the final result.". I suppose outPath and outputPath are both the directory to store output files. so I set outPath = '/home/chenruihao/test_clustering/output/'

I got below error when I try to run the above code:

/home/chenruihao/test_clustering/recursiveHierarchicalClustering.py:247: FutureWarning: `rcond` parameter will change to the default of machine precision times ``max(M, N)`` where M and N are the input matrix dimensions.
To use the future default and silence this warning we advise to pass `rcond=None`, to keep using the old, explicitly pass `rcond=-1`.
  result = np.linalg.lstsq(A, y)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-57-3760f95de317> in <module>
----> 1 treeData = rhcFast.run(inputPath, data, outPath)

~/test_clustering/recursiveHierarchicalClusteringFast.py in run(ngramPath, sid_seq, outPath)
    416 
    417     hc = HCClustering(
--> 418         matrix, sid_seq, outPath, [], idxToSid,
    419         sizeThreshold=0.05 * len(sid_seq), idfMap=idfMap)
    420     result = hc.runDiana()

~/test_clustering/recursiveHierarchicalClusteringFast.py in runDiana(self)
    337                     matrix = calculateDistance.partialMatrix(
    338                         sids,
--> 339                         rhc.excludeFeatures(rhc.getIdf(self.sid_seq, sids),
    340                                             newExclusions),
    341                         ngramPath,

NameError: name 'ngramPath' is not defined

Q1: How may I fix this error?
Fix trial: 'ngramPath' is called in 'recursiveHierarchicalClusteringFast.py', so I hard coded it in the below way:

  1. looks like the run function under ngramPath seems to be the same as sys.argv[1], and by definition, ngramPath is the path to the computed pattern dataset, so I hard code ngramPath = '/home/chenruihao/test_clustering/input.txt', same as the inputPath, but I still got the above error...Would love to hear your thoughts.

Q2 I also want to understand what user_id were clustered, their membership, and their corresponding action-gap-action similar to the issue discussed in another thread. Would it be possible to just use the result.json file to answer my question as well as the question in the above thread, rather than modify the code?

My understanding is that from the result.json, it looks like for each level of cluster,

  • key = 1 stores the user_ids that were clustered in that level of cluster;
  • key = 2, exclusions stores the action-gap-action/token members of the cluster

Thanks!
Anthony

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant