Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fitting issues with standard configuration #61

Open
polakowo opened this issue Aug 13, 2018 · 25 comments
Open

Fitting issues with standard configuration #61

polakowo opened this issue Aug 13, 2018 · 25 comments

Comments

@polakowo
Copy link

Hello,

first of all thank you for providing us with such a comprehensive package. I'm a first time user and cannot figure out how to produce meaningful results despite reading your paper.

I have the following distribution: Counter({1: 233, 2: 33, 3: 13, 4: 10, 5: 2, 6: 3, 7: 2, 8: 2, 9: 1, 11: 1, 15: 1, 18: 1})

When trying to feed the powerlaw.Fit method with my dataset I get warning "RuntimeWarning: invalid value encountered in true_divide (Theoretical_CDF * (1 - Theoretical_CDF))", which can be ignored as I discovered in older threads.

The issue is poor scaling parameters as an output: alpha = 3.81, xmin = 6. When plotting the hypothesized power-law distribution, you can clearly see the poor fit.
unknown-1

On the other hand when using the code provided by Clauset (http://www.santafe.edu/~aaronc/powerlaws/), a better fit is produced with alpha = 2.59, xmin = 1.
unknown

I tried a couple of different configurations, but the output stays the same, wondering what's going wrong?

Best regards

@jeffalstott
Copy link
Owner

jeffalstott commented Aug 14, 2018 via email

@polakowo
Copy link
Author

Counter is for demo, usually I pass a flat array of integers from 1 to 18. I also tried your suggestion with xmin but getting alpha=4.87
unknown-2

Could it be that I'm using Python 3.6? I had to upgrade the code provided by Clauset to work with my environment

@jeffalstott
Copy link
Owner

jeffalstott commented Aug 14, 2018 via email

@polakowo
Copy link
Author

I downloaded the Manufscript_code.ipynb file and rerun all cells - output the same as yours.

Is the algorithm in powerlab different from http://tuvalu.santafe.edu/~aaronc/powerlaws/plfit.py?

@jeffalstott
Copy link
Owner

jeffalstott commented Aug 14, 2018 via email

@polakowo
Copy link
Author

Hmm in Jupyter the 1.4.3 is displayed while pip tells that the version 1.4.4 is installed, am I using the older version? Seems the problem is with conda...

@jeffalstott
Copy link
Owner

jeffalstott commented Aug 15, 2018 via email

@polakowo
Copy link
Author

After some refactoring: the version in powerlaw.py was not updated to 1.4.4 that's why I'm seeing 1.4.3 :p Now I'm sure its the latest version, floats make no difference.

@jeffalstott
Copy link
Owner

jeffalstott commented Aug 15, 2018 via email

@polakowo
Copy link
Author

I've created a pdf with some simple tests.

tests.pdf

@jeffalstott
Copy link
Owner

jeffalstott commented Aug 15, 2018 via email

@polakowo
Copy link
Author

polakowo commented Aug 15, 2018

Tried, output the same
Edit: not the same but similar

@polakowo
Copy link
Author

I'll go grab some sleep, we have 4 am :) Please keep me updated on your progress.

@jeffalstott
Copy link
Owner

jeffalstott commented Aug 15, 2018 via email

@polakowo
Copy link
Author

powerlaw.Fit([500,150,90,81,75,75,70,65,60,58,49,47,40]).alpha
3.1151851915645903

powerlaw.Fit([500,150,90,81,75,75,70,65,60,58,49,47,40], discrete=True).alpha
3.0771455810069

plfit.plfit([500,150,90,81,75,75,70,65,60,58,49,47,40])[0]
2.71

@polakowo
Copy link
Author

xmin = 58.0, xmin = 58.0, and xmin = 47 accordingly

@polakowo
Copy link
Author

unknown-3
unknown-4
unknown-5

@jeffalstott
Copy link
Owner

jeffalstott commented Aug 15, 2018 via email

@polakowo
Copy link
Author

My data has length of ~300

plfit.py gives xmin=1, alpha=2.59
unknown-6

powerlaw.py with discrete=True gives xmin=2, alpha=2.38
unknown-7

@polakowo
Copy link
Author

The same applies to data which is 10x larger

@jeffalstott
Copy link
Owner

jeffalstott commented Aug 15, 2018 via email

@polakowo
Copy link
Author

For floats are results are the same. With discrete=True and integer data powerlaw.py produces slightly lower alpha but the same xmin.

If I multiply each element in dataset with 100, results are the same.
If I add to each element 100, results vary greatly.

There seem to be some issue with the scale.

@jeffalstott
Copy link
Owner

jeffalstott commented Aug 16, 2018 via email

@rhjohnstone
Copy link

rhjohnstone commented Sep 21, 2018

I guess I'm having similar problems to @polakowo. I get a very poor fit just using the most basic example in the readme (result is very similar for both discrete=True and discrete=False, as well as recording the data as int or float:

# Number of occurrences, starting from 1
counts = np.array("534602 192300 61903 25675 12380 6472 3727 2082 1441 908 650"
                " 488 341 239 182 123 87 76 46 48 43 27 24 16 11 13 14 7 8 9"
                " 7 7 4 3 3 1 0 2 0 1 1 0 0 0 0 1".split()).astype(int)
data = []
for i, count in enumerate(counts):
    data += [i+1]*count
data = np.array(data)

powerlaw.plot_pdf(data, label="Data as PDF")

fit = powerlaw.Fit(data, discrete=True)
fit.plot_pdf(label="Fitted PDF")
plt.legend(loc=3, fontsize=14);

which outputs
image

Am I just plotting the results incorrectly, or is something deeper going on?

Edit: I see now that the "fitted" plot is just using data points >= fit.power_law.xmin, so it's just plotting the data but re-normalized? In which case, is there a simple function to just plot the fitted power curve?

Edit 2: I should be using fit.power_law.plot_pdf(label="Fitted PDF"), and not fit.plot_pdf(label="Fitted PDF"). This is the updated version:

fit = powerlaw.Fit(data, discrete=True)
powerlaw.plot_pdf(data[data>=fit.power_law.xmin], label="Data as PDF")
print(fit.power_law.alpha)
print(fit.power_law.xmin)
R, p = fit.distribution_compare('power_law', 'lognormal')
fit.power_law.plot_pdf(label="Fitted PDF", ls=":")
plt.legend(loc=3, fontsize=14);

which outputs
image
which I guess is OK, but I'm surprised at how large fit.power_law.xmin is. Just inspecting the original data plot, the line appears straight enough down to 2 or 3. Is there an obvious reason (that I can fix) why it's choosing such a high value?

@jeffalstott
Copy link
Owner

@rhjohnstone Thanks for the edits; it saved me typing! :)

To your question, xmin is determined by finding the value that minimizes the KS distance between the empirical distribution and a perfect power law. Take a look at the papers Clauset et al. or Alstott et al. for more info.

The larger issue is that the data is almost certainly not a power law, and if it is it doesn't have the interesting properties we look for in a power law. That data shows p(x) dropping ~6 orders of magnitude over ~1.5 OOM of x. That would be an alpha of ~4, which is very steep. The interesting properties of power laws happen at smaller/shallower alphas (mean undefined: 2 ; std undefined: 3). Also, the data is only over 1.5 OOM, so there's likely not enough data to differentiate it from an exponential. If you use 'distribution_compare' you'll probably find the power law fit is less likely than the exponential.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants