Fitting issues with standard configuration #61

polakowo · 2018-08-13T23:47:53Z

Hello,

first of all thank you for providing us with such a comprehensive package. I'm a first time user and cannot figure out how to produce meaningful results despite reading your paper.

I have the following distribution: Counter({1: 233, 2: 33, 3: 13, 4: 10, 5: 2, 6: 3, 7: 2, 8: 2, 9: 1, 11: 1, 15: 1, 18: 1})

When trying to feed the powerlaw.Fit method with my dataset I get warning "RuntimeWarning: invalid value encountered in true_divide (Theoretical_CDF * (1 - Theoretical_CDF))", which can be ignored as I discovered in older threads.

The issue is poor scaling parameters as an output: alpha = 3.81, xmin = 6. When plotting the hypothesized power-law distribution, you can clearly see the poor fit.

On the other hand when using the code provided by Clauset (http://www.santafe.edu/~aaronc/powerlaws/), a better fit is produced with alpha = 2.59, xmin = 1.

I tried a couple of different configurations, but the output stays the same, wondering what's going wrong?

Best regards

jeffalstott · 2018-08-14T00:38:30Z

Thanks Oleg, I'm unfamiliar with Counters. What happens if you pass the data as an array? Also, what happens if you mandate xmin=1?

…

On Mon, Aug 13, 2018 at 7:47 PM Oleg Polakow ***@***.***> wrote: Hello, first of all thank you for providing us with such a comprehensive package. I'm a first time user and cannot figure out how to produce meaningful results despite reading your paper. I have the following distribution: *Counter({1: 233, 2: 33, 3: 13, 4: 10, 5: 2, 6: 3, 7: 2, 8: 2, 9: 1, 11: 1, 15: 1, 18: 1})* When trying to feed the powerlaw.Fit method with my dataset I get warning "*RuntimeWarning: invalid value encountered in true_divide (Theoretical_CDF * (1 - Theoretical_CDF))*", which can be ignored as I discovered in older threads. The issue is poor scaling parameters as an output: *alpha = 3.81, xmin = 6*. When plotting the hypothesized power-law distribution, you can clearly see the poor fit. [image: unknown-1] <https://user-images.githubusercontent.com/2664108/44064066-92927b2c-9f63-11e8-8651-9063dabe7560.png> On the other hand when using the code provided by Clauset ( http://www.santafe.edu/~aaronc/powerlaws/), a better fit is produced with *alpha = 2.59, xmin = 1*. [image: unknown] <https://user-images.githubusercontent.com/2664108/44064073-9b1d7864-9f63-11e8-82f9-df286ce29022.png> I tried a couple of different configurations, but the output stays the same, wondering what's going wrong? Best regards — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#61>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AA6_r1Fy5Feuqpwt2q-xjDbOYoZciUEpks5uQhApgaJpZM4V7k3r> .

polakowo · 2018-08-14T08:13:52Z

Counter is for demo, usually I pass a flat array of integers from 1 to 18. I also tried your suggestion with xmin but getting alpha=4.87

Could it be that I'm using Python 3.6? I had to upgrade the code provided by Clauset to work with my environment

jeffalstott · 2018-08-14T23:10:41Z

Interesting. What happens if you use any of the example datasets from the paper?

…

On Tue, Aug 14, 2018 at 4:13 AM Oleg Polakow ***@***.***> wrote: Counter is for demo, usually I pass a flat array of integers from 1 to 18. I also tried your suggestion with xmin but getting alpha=4.87 [image: unknown-2] <https://user-images.githubusercontent.com/2664108/44079698-0ae587cc-9faa-11e8-9169-22f126c8b192.png> Could it be that I'm using Python 3.6? I had to upgrade the code provided by Clauset to work with my environment — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#61 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AA6_r5P6NhxLnBuRetNBI7IZzszYv3PHks5uQobBgaJpZM4V7k3r> .

polakowo · 2018-08-14T23:41:28Z

I downloaded the Manufscript_code.ipynb file and rerun all cells - output the same as yours.

Is the algorithm in powerlab different from http://tuvalu.santafe.edu/~aaronc/powerlaws/plfit.py?

jeffalstott · 2018-08-14T23:47:35Z

powerlaw and plfit and (hopefully) other implementations all implement the same algorithm has in the Clauset et al. paper. The present behavior is certainly some non-core algorithm related bug, possibly due to something breaking with Python 3.6. The fact that the demos from manuscript_code.ipynb still work is heartening. What happens when you pass floats instead of ints? And just for my situational awareness, are you running the latest powerlaw version from PyPI?

…

On Tue, Aug 14, 2018 at 7:41 PM Oleg Polakow ***@***.***> wrote: I downloaded the Manufscript_code.ipynb file and rerun all cells - output the same as yours. Is the algorithm in powerlab different from http://tuvalu.santafe.edu/~aaronc/powerlaws/plfit.py? — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#61 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AA6_rwKXN6DTF5uRwNCzZbK2Qm9bEUmAks5uQ2AogaJpZM4V7k3r> .

polakowo · 2018-08-14T23:58:51Z

Hmm in Jupyter the 1.4.3 is displayed while pip tells that the version 1.4.4 is installed, am I using the older version? Seems the problem is with conda...

jeffalstott · 2018-08-15T00:03:50Z

Conda installs alongside pip installs always seems to cause heartache. Try the latest version, then see what happens if you pass floats in an array.

…

On Tue, Aug 14, 2018 at 7:58 PM Oleg Polakow ***@***.***> wrote: Hmm in Jupyter the 1.4.3 is displayed while pip tells that the version 1.4.4 is installed, am I using the older version? Seems the problem is with conda... — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#61 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AA6_rxZlYLlMkz-DwzoSIX3ObDqY1q2-ks5uQ2Q8gaJpZM4V7k3r> .

polakowo · 2018-08-15T01:05:19Z

After some refactoring: the version in powerlaw.py was not updated to 1.4.4 that's why I'm seeing 1.4.3 :p Now I'm sure its the latest version, floats make no difference.

jeffalstott · 2018-08-15T01:11:42Z

Oops! I better fix that. I'll hold off on pushing an update until we've figured out this issue, though. Can you post minimum working examples of getting accurate results on one of the demo datasets and inaccurate results on your dataset?

…

On Tue, Aug 14, 2018 at 9:05 PM Oleg Polakow ***@***.***> wrote: After some refactoring: the *version* in powerlaw.py was not updated to 1.4.4 that's why I'm seeing 1.4.3 :p Now I'm sure its the latest version, floats make no difference. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#61 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AA6_r7eZ0zpfTYru9EwjT5th2dNwcXa9ks5uQ3PPgaJpZM4V7k3r> .

polakowo · 2018-08-15T01:59:00Z

I've created a pdf with some simple tests.

tests.pdf

jeffalstott · 2018-08-15T02:03:34Z

Are you not using the discrete=True flag?

…

On Tue, Aug 14, 2018 at 9:59 PM Oleg Polakow ***@***.***> wrote: I've created a pdf with some simple tests. tests.pdf <https://github.com/jeffalstott/powerlaw/files/2288965/tests.pdf> — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#61 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AA6_r5lacX3dBrNYzKthilySfoISmeaQks5uQ4BkgaJpZM4V7k3r> .

polakowo · 2018-08-15T02:06:21Z

Tried, output the same
Edit: not the same but similar

polakowo · 2018-08-15T02:19:50Z

I'll go grab some sleep, we have 4 am :) Please keep me updated on your progress.

jeffalstott · 2018-08-15T12:13:17Z

It cannot be the same output; the discrete and continuous versions are different algorithms. This is described in the powerlaw and Clauset et al. papers. plfit.py assumes inputs that look like integers are discrete, whereas powerlaw does not. Can you show output with using the discrete=True option for fitting?

…

On Tue, Aug 14, 2018 at 10:19 PM Oleg Polakow ***@***.***> wrote: I'll go grab some sleep, we have 4 am :) Please keep me updated on your progress. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#61 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AA6_r62c6SOg8ASxa-NM7jDAIu9nIl-Jks5uQ4VHgaJpZM4V7k3r> .

polakowo · 2018-08-15T12:16:11Z

powerlaw.Fit([500,150,90,81,75,75,70,65,60,58,49,47,40]).alpha
3.1151851915645903

powerlaw.Fit([500,150,90,81,75,75,70,65,60,58,49,47,40], discrete=True).alpha
3.0771455810069

plfit.plfit([500,150,90,81,75,75,70,65,60,58,49,47,40])[0]
2.71

polakowo · 2018-08-15T12:19:43Z

xmin = 58.0, xmin = 58.0, and xmin = 47 accordingly

polakowo · 2018-08-15T12:22:27Z

jeffalstott · 2018-08-15T12:23:56Z

For some realistic data (e.g. your original data, which I hope is large), what does the visualized fit look like with discrete=True? What happens if the length of the data is 100 or greater, so that plfit doesn't do its finite-size adjustment? Very small datasets, such as the one just given, are problematic for a bunch of reasons.

…

On Wed, Aug 15, 2018 at 8:19 AM Oleg Polakow ***@***.***> wrote: xmin = 58.0, xmin = 58.0, and xmin = 47 accordingly — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#61 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AA6_r6Kth1JzuIBqEnRFGdSGRhTeYrEHks5uRBHfgaJpZM4V7k3r> .

polakowo · 2018-08-15T12:27:47Z

My data has length of ~300

plfit.py gives xmin=1, alpha=2.59

powerlaw.py with discrete=True gives xmin=2, alpha=2.38

polakowo · 2018-08-15T12:30:26Z

The same applies to data which is 10x larger

jeffalstott · 2018-08-15T12:53:12Z

Using data on the scale of 3,000 data points, what is the result if you dictate to both that they use the same xmin? And that's using the discrete=True flag. I would start with a dataset that you know is giving the correct answer (one of the discrete demos) and slowly modify to look like a problematic dataset, then observe what the difference is. At one point many years ago I knew all the internals of plfit.py, and while I do not remember all of it now, I recall it making some decisions that powerlaw purposefully does not replicate (e.g. assuming data that "looks" discrete is discrete, applying a finite-size "correction", etc.). Now that the discrete=True flag is in use for this example, we likely have moved very far.

…

On Wed, Aug 15, 2018 at 8:30 AM Oleg Polakow ***@***.***> wrote: The same applies to data which is 10x larger — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#61 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AA6_rwrrPmROWMwJf-oenDOEuBeezHfVks5uRBRigaJpZM4V7k3r> .

polakowo · 2018-08-15T14:20:01Z

For floats are results are the same. With discrete=True and integer data powerlaw.py produces slightly lower alpha but the same xmin.

If I multiply each element in dataset with 100, results are the same.
If I add to each element 100, results vary greatly.

There seem to be some issue with the scale.

jeffalstott · 2018-08-16T11:45:21Z

I can't speak to what plfit.py is doing, but adding a translation by simply adding a factor k does/should change the probabilities. Look at Clauset et al. eq. 2.2. Take alpha = 2, xmin=1 and x = 2. (1/1.5)(2.5/1.5)^-2 = .24 Add a translation of 1, thereby increasing xmin and x by 1 (1/2.5)(3.5/2.5)^-2 = .204... It sounds like powerlaw is giving expected results on demo data, on synthetically generated float data, and synthetically generated discrete data (as long as the discrete flag is turned on). So it sounds like there's as yet no bug here. Again, we can't speak to what plfit.py is doing, though it does appear to implement a "finite-size correction" for small data. powerlaw doesn't implement this, because its author thought it very philosophically dubious; if you don't have lots of data, you cannot assess if the structure of that data does/does not follow a power law over several orders of magnitude (and if you don't have several orders of magnitude, you shouldn't be trying to justify it's a power law). With that said, if someone wanted to implement optional finite-size corrections in powerlaw and submit it as a pull request, they'd be taken on board.

…

On Wed, Aug 15, 2018 at 10:20 AM Oleg Polakow ***@***.***> wrote: For floats are results are the same. With discrete=True and integer data powerlaw.py produces slightly lower alpha but the same xmin. If I multiply each element in dataset with 100, results are the same. If I add to each element 100, results vary greatly. There seem to be some issue with the scale. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#61 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AA6_r-zAELs33vfXCJ7peqMqeS7GjTdEks5uRC42gaJpZM4V7k3r> .

rhjohnstone · 2018-09-21T08:02:47Z

I guess I'm having similar problems to @polakowo. I get a very poor fit just using the most basic example in the readme (result is very similar for both discrete=True and discrete=False, as well as recording the data as int or float:

# Number of occurrences, starting from 1
counts = np.array("534602 192300 61903 25675 12380 6472 3727 2082 1441 908 650"
                " 488 341 239 182 123 87 76 46 48 43 27 24 16 11 13 14 7 8 9"
                " 7 7 4 3 3 1 0 2 0 1 1 0 0 0 0 1".split()).astype(int)
data = []
for i, count in enumerate(counts):
    data += [i+1]*count
data = np.array(data)

powerlaw.plot_pdf(data, label="Data as PDF")

fit = powerlaw.Fit(data, discrete=True)
fit.plot_pdf(label="Fitted PDF")
plt.legend(loc=3, fontsize=14);

which outputs

Am I just plotting the results incorrectly, or is something deeper going on?

Edit: I see now that the "fitted" plot is just using data points >= fit.power_law.xmin, so it's just plotting the data but re-normalized? In which case, is there a simple function to just plot the fitted power curve?

Edit 2: I should be using fit.power_law.plot_pdf(label="Fitted PDF"), and not fit.plot_pdf(label="Fitted PDF"). This is the updated version:

fit = powerlaw.Fit(data, discrete=True)
powerlaw.plot_pdf(data[data>=fit.power_law.xmin], label="Data as PDF")
print(fit.power_law.alpha)
print(fit.power_law.xmin)
R, p = fit.distribution_compare('power_law', 'lognormal')
fit.power_law.plot_pdf(label="Fitted PDF", ls=":")
plt.legend(loc=3, fontsize=14);

which outputs

which I guess is OK, but I'm surprised at how large fit.power_law.xmin is. Just inspecting the original data plot, the line appears straight enough down to 2 or 3. Is there an obvious reason (that I can fix) why it's choosing such a high value?

jeffalstott · 2018-09-22T11:37:04Z

@rhjohnstone Thanks for the edits; it saved me typing! :)

To your question, xmin is determined by finding the value that minimizes the KS distance between the empirical distribution and a perfect power law. Take a look at the papers Clauset et al. or Alstott et al. for more info.

The larger issue is that the data is almost certainly not a power law, and if it is it doesn't have the interesting properties we look for in a power law. That data shows p(x) dropping ~6 orders of magnitude over ~1.5 OOM of x. That would be an alpha of ~4, which is very steep. The interesting properties of power laws happen at smaller/shallower alphas (mean undefined: 2 ; std undefined: 3). Also, the data is only over 1.5 OOM, so there's likely not enough data to differentiate it from an exponential. If you use 'distribution_compare' you'll probably find the power law fit is less likely than the exponential.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fitting issues with standard configuration #61

Fitting issues with standard configuration #61

polakowo commented Aug 13, 2018

jeffalstott commented Aug 14, 2018 via email

polakowo commented Aug 14, 2018

jeffalstott commented Aug 14, 2018 via email

polakowo commented Aug 14, 2018

jeffalstott commented Aug 14, 2018 via email

polakowo commented Aug 14, 2018

jeffalstott commented Aug 15, 2018 via email

polakowo commented Aug 15, 2018

jeffalstott commented Aug 15, 2018 via email

polakowo commented Aug 15, 2018

jeffalstott commented Aug 15, 2018 via email

polakowo commented Aug 15, 2018 •

edited

Loading

polakowo commented Aug 15, 2018

jeffalstott commented Aug 15, 2018 via email

polakowo commented Aug 15, 2018

polakowo commented Aug 15, 2018

polakowo commented Aug 15, 2018

jeffalstott commented Aug 15, 2018 via email

polakowo commented Aug 15, 2018

polakowo commented Aug 15, 2018

jeffalstott commented Aug 15, 2018 via email

polakowo commented Aug 15, 2018

jeffalstott commented Aug 16, 2018 via email

rhjohnstone commented Sep 21, 2018 •

edited

Loading

jeffalstott commented Sep 22, 2018

Fitting issues with standard configuration #61

Fitting issues with standard configuration #61

Comments

polakowo commented Aug 13, 2018

jeffalstott commented Aug 14, 2018 via email

polakowo commented Aug 14, 2018

jeffalstott commented Aug 14, 2018 via email

polakowo commented Aug 14, 2018

jeffalstott commented Aug 14, 2018 via email

polakowo commented Aug 14, 2018

jeffalstott commented Aug 15, 2018 via email

polakowo commented Aug 15, 2018

jeffalstott commented Aug 15, 2018 via email

polakowo commented Aug 15, 2018

jeffalstott commented Aug 15, 2018 via email

polakowo commented Aug 15, 2018 • edited Loading

polakowo commented Aug 15, 2018

jeffalstott commented Aug 15, 2018 via email

polakowo commented Aug 15, 2018

polakowo commented Aug 15, 2018

polakowo commented Aug 15, 2018

jeffalstott commented Aug 15, 2018 via email

polakowo commented Aug 15, 2018

polakowo commented Aug 15, 2018

jeffalstott commented Aug 15, 2018 via email

polakowo commented Aug 15, 2018

jeffalstott commented Aug 16, 2018 via email

rhjohnstone commented Sep 21, 2018 • edited Loading

jeffalstott commented Sep 22, 2018

polakowo commented Aug 15, 2018 •

edited

Loading

rhjohnstone commented Sep 21, 2018 •

edited

Loading