-
Notifications
You must be signed in to change notification settings - Fork 134
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fitting issues with standard configuration #61
Comments
Thanks Oleg,
I'm unfamiliar with Counters. What happens if you pass the data as an
array? Also, what happens if you mandate xmin=1?
…On Mon, Aug 13, 2018 at 7:47 PM Oleg Polakow ***@***.***> wrote:
Hello,
first of all thank you for providing us with such a comprehensive package.
I'm a first time user and cannot figure out how to produce meaningful
results despite reading your paper.
I have the following distribution: *Counter({1: 233, 2: 33, 3: 13, 4: 10,
5: 2, 6: 3, 7: 2, 8: 2, 9: 1, 11: 1, 15: 1, 18: 1})*
When trying to feed the powerlaw.Fit method with my dataset I get warning "*RuntimeWarning:
invalid value encountered in true_divide (Theoretical_CDF * (1 -
Theoretical_CDF))*", which can be ignored as I discovered in older
threads.
The issue is poor scaling parameters as an output: *alpha = 3.81, xmin =
6*. When plotting the hypothesized power-law distribution, you can
clearly see the poor fit.
[image: unknown-1]
<https://user-images.githubusercontent.com/2664108/44064066-92927b2c-9f63-11e8-8651-9063dabe7560.png>
On the other hand when using the code provided by Clauset (
http://www.santafe.edu/~aaronc/powerlaws/), a better fit is produced with *alpha
= 2.59, xmin = 1*.
[image: unknown]
<https://user-images.githubusercontent.com/2664108/44064073-9b1d7864-9f63-11e8-82f9-df286ce29022.png>
I tried a couple of different configurations, but the output stays the
same, wondering what's going wrong?
Best regards
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#61>, or mute the thread
<https://github.com/notifications/unsubscribe-auth/AA6_r1Fy5Feuqpwt2q-xjDbOYoZciUEpks5uQhApgaJpZM4V7k3r>
.
|
Interesting. What happens if you use any of the example datasets from the
paper?
…On Tue, Aug 14, 2018 at 4:13 AM Oleg Polakow ***@***.***> wrote:
Counter is for demo, usually I pass a flat array of integers from 1 to 18.
I also tried your suggestion with xmin but getting alpha=4.87
[image: unknown-2]
<https://user-images.githubusercontent.com/2664108/44079698-0ae587cc-9faa-11e8-9169-22f126c8b192.png>
Could it be that I'm using Python 3.6? I had to upgrade the code provided
by Clauset to work with my environment
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#61 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AA6_r5P6NhxLnBuRetNBI7IZzszYv3PHks5uQobBgaJpZM4V7k3r>
.
|
I downloaded the Manufscript_code.ipynb file and rerun all cells - output the same as yours. Is the algorithm in powerlab different from http://tuvalu.santafe.edu/~aaronc/powerlaws/plfit.py? |
powerlaw and plfit and (hopefully) other implementations all implement the
same algorithm has in the Clauset et al. paper. The present behavior is
certainly some non-core algorithm related bug, possibly due to something
breaking with Python 3.6. The fact that the demos from
manuscript_code.ipynb still work is heartening.
What happens when you pass floats instead of ints?
And just for my situational awareness, are you running the latest
powerlaw version from PyPI?
…On Tue, Aug 14, 2018 at 7:41 PM Oleg Polakow ***@***.***> wrote:
I downloaded the Manufscript_code.ipynb file and rerun all cells - output
the same as yours.
Is the algorithm in powerlab different from
http://tuvalu.santafe.edu/~aaronc/powerlaws/plfit.py?
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#61 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AA6_rwKXN6DTF5uRwNCzZbK2Qm9bEUmAks5uQ2AogaJpZM4V7k3r>
.
|
Hmm in Jupyter the 1.4.3 is displayed while pip tells that the version 1.4.4 is installed, am I using the older version? Seems the problem is with conda... |
Conda installs alongside pip installs always seems to cause heartache. Try
the latest version, then see what happens if you pass floats in an array.
…On Tue, Aug 14, 2018 at 7:58 PM Oleg Polakow ***@***.***> wrote:
Hmm in Jupyter the 1.4.3 is displayed while pip tells that the version
1.4.4 is installed, am I using the older version? Seems the problem is with
conda...
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#61 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AA6_rxZlYLlMkz-DwzoSIX3ObDqY1q2-ks5uQ2Q8gaJpZM4V7k3r>
.
|
After some refactoring: the version in powerlaw.py was not updated to 1.4.4 that's why I'm seeing 1.4.3 :p Now I'm sure its the latest version, floats make no difference. |
Oops! I better fix that. I'll hold off on pushing an update until we've
figured out this issue, though.
Can you post minimum working examples of getting accurate results on one of
the demo datasets and inaccurate results on your dataset?
…On Tue, Aug 14, 2018 at 9:05 PM Oleg Polakow ***@***.***> wrote:
After some refactoring: the *version* in powerlaw.py was not updated to
1.4.4 that's why I'm seeing 1.4.3 :p Now I'm sure its the latest version,
floats make no difference.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#61 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AA6_r7eZ0zpfTYru9EwjT5th2dNwcXa9ks5uQ3PPgaJpZM4V7k3r>
.
|
I've created a pdf with some simple tests. |
Are you not using the discrete=True flag?
…On Tue, Aug 14, 2018 at 9:59 PM Oleg Polakow ***@***.***> wrote:
I've created a pdf with some simple tests.
tests.pdf
<https://github.com/jeffalstott/powerlaw/files/2288965/tests.pdf>
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#61 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AA6_r5lacX3dBrNYzKthilySfoISmeaQks5uQ4BkgaJpZM4V7k3r>
.
|
Tried, output the same |
I'll go grab some sleep, we have 4 am :) Please keep me updated on your progress. |
It cannot be the same output; the discrete and continuous versions are
different algorithms. This is described in the powerlaw and Clauset et al.
papers. plfit.py assumes inputs that look like integers are discrete,
whereas powerlaw does not. Can you show output with using the discrete=True
option for fitting?
…On Tue, Aug 14, 2018 at 10:19 PM Oleg Polakow ***@***.***> wrote:
I'll go grab some sleep, we have 4 am :) Please keep me updated on your
progress.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#61 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AA6_r62c6SOg8ASxa-NM7jDAIu9nIl-Jks5uQ4VHgaJpZM4V7k3r>
.
|
|
xmin = 58.0, xmin = 58.0, and xmin = 47 accordingly |
For some realistic data (e.g. your original data, which I hope is large),
what does the visualized fit look like with discrete=True?
What happens if the length of the data is 100 or greater, so that
plfit doesn't do its finite-size adjustment?
Very small datasets, such as the one just given, are problematic for a
bunch of reasons.
…On Wed, Aug 15, 2018 at 8:19 AM Oleg Polakow ***@***.***> wrote:
xmin = 58.0, xmin = 58.0, and xmin = 47 accordingly
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#61 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AA6_r6Kth1JzuIBqEnRFGdSGRhTeYrEHks5uRBHfgaJpZM4V7k3r>
.
|
The same applies to data which is 10x larger |
Using data on the scale of 3,000 data points, what is the result if you
dictate to both that they use the same xmin? And that's using the
discrete=True flag.
I would start with a dataset that you know is giving the correct answer
(one of the discrete demos) and slowly modify to look like a problematic
dataset, then observe what the difference is.
At one point many years ago I knew all the internals of plfit.py, and while
I do not remember all of it now, I recall it making some decisions that
powerlaw purposefully does not replicate (e.g. assuming data that "looks"
discrete is discrete, applying a finite-size "correction", etc.). Now that
the discrete=True flag is in use for this example, we likely have moved
very far.
…On Wed, Aug 15, 2018 at 8:30 AM Oleg Polakow ***@***.***> wrote:
The same applies to data which is 10x larger
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#61 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AA6_rwrrPmROWMwJf-oenDOEuBeezHfVks5uRBRigaJpZM4V7k3r>
.
|
For floats are results are the same. With discrete=True and integer data powerlaw.py produces slightly lower alpha but the same xmin. If I multiply each element in dataset with 100, results are the same. There seem to be some issue with the scale. |
I can't speak to what plfit.py is doing, but adding a translation by simply
adding a factor k does/should change the probabilities. Look at Clauset et
al. eq. 2.2. Take alpha = 2, xmin=1 and x = 2.
(1/1.5)(2.5/1.5)^-2 = .24
Add a translation of 1, thereby increasing xmin and x by 1
(1/2.5)(3.5/2.5)^-2 = .204...
It sounds like powerlaw is giving expected results on demo data, on
synthetically generated float data, and synthetically generated discrete
data (as long as the discrete flag is turned on). So it sounds like there's
as yet no bug here.
Again, we can't speak to what plfit.py is doing, though it does appear to
implement a "finite-size correction" for small data. powerlaw doesn't
implement this, because its author thought it very philosophically dubious;
if you don't have lots of data, you cannot assess if the structure of that
data does/does not follow a power law over several orders of magnitude (and
if you don't have several orders of magnitude, you shouldn't be trying to
justify it's a power law). With that said, if someone wanted to implement
optional finite-size corrections in powerlaw and submit it as a pull
request, they'd be taken on board.
…On Wed, Aug 15, 2018 at 10:20 AM Oleg Polakow ***@***.***> wrote:
For floats are results are the same. With discrete=True and integer data
powerlaw.py produces slightly lower alpha but the same xmin.
If I multiply each element in dataset with 100, results are the same.
If I add to each element 100, results vary greatly.
There seem to be some issue with the scale.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#61 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AA6_r-zAELs33vfXCJ7peqMqeS7GjTdEks5uRC42gaJpZM4V7k3r>
.
|
I guess I'm having similar problems to @polakowo. I get a very poor fit just using the most basic example in the readme (result is very similar for both
Am I just plotting the results incorrectly, or is something deeper going on? Edit: I see now that the "fitted" plot is just using data points >= Edit 2: I should be using
which outputs |
@rhjohnstone Thanks for the edits; it saved me typing! :) To your question, xmin is determined by finding the value that minimizes the KS distance between the empirical distribution and a perfect power law. Take a look at the papers Clauset et al. or Alstott et al. for more info. The larger issue is that the data is almost certainly not a power law, and if it is it doesn't have the interesting properties we look for in a power law. That data shows p(x) dropping ~6 orders of magnitude over ~1.5 OOM of x. That would be an alpha of ~4, which is very steep. The interesting properties of power laws happen at smaller/shallower alphas (mean undefined: 2 ; std undefined: 3). Also, the data is only over 1.5 OOM, so there's likely not enough data to differentiate it from an exponential. If you use 'distribution_compare' you'll probably find the power law fit is less likely than the exponential. |
Hello,
first of all thank you for providing us with such a comprehensive package. I'm a first time user and cannot figure out how to produce meaningful results despite reading your paper.
I have the following distribution: Counter({1: 233, 2: 33, 3: 13, 4: 10, 5: 2, 6: 3, 7: 2, 8: 2, 9: 1, 11: 1, 15: 1, 18: 1})
When trying to feed the powerlaw.Fit method with my dataset I get warning "RuntimeWarning: invalid value encountered in true_divide (Theoretical_CDF * (1 - Theoretical_CDF))", which can be ignored as I discovered in older threads.
The issue is poor scaling parameters as an output: alpha = 3.81, xmin = 6. When plotting the hypothesized power-law distribution, you can clearly see the poor fit.
On the other hand when using the code provided by Clauset (http://www.santafe.edu/~aaronc/powerlaws/), a better fit is produced with alpha = 2.59, xmin = 1.
I tried a couple of different configurations, but the output stays the same, wondering what's going wrong?
Best regards
The text was updated successfully, but these errors were encountered: