Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

avoid memory overflow on corrupted file #20

Merged
merged 1 commit into from
Nov 4, 2023

Conversation

gmossessian
Copy link
Contributor

This is a copy of my PR against the original abandoned DAWG project: pytries/DAWG#44

This fixes an issue that can cause the dawg to allocate a lot of memory on load in the case of a corrupted file.

Unfortunately, the original dawgdic library looks like it has long since ceased to be maintained, so I am opening a PR against the python wrapper.

To replicate the issue, use the following script:

from dawg import BytesDAWG as dawg

with open('corrupt.dawg', 'w') as f:
    f.write('corrupt!')

try:
    d = dawg().load('corrupt.dawg')
except:
    print('failed, as expected')

If you run this under gtime on OSX, you will see somewhere between 4 and 6GB of RAM being used.

$ gtime -v python load_corrupted_dawg
failed, as expected
	Command being timed: "python load_corrupted_dawg"
	User time (seconds): 2.94
	System time (seconds): 3.00
	Percent of CPU this job got: 84%
	Elapsed (wall clock) time (h:mm:ss or m:ss): 0:07.04
	Average shared text size (kbytes): 0
	Average unshared data size (kbytes): 0
	Average stack size (kbytes): 0
	Average total size (kbytes): 0
	Maximum resident set size (kbytes): 6256108    <-------------- SIX GIGABYTES!!
	Average resident set size (kbytes): 0
	Major (requiring I/O) page faults: 258
	Minor (reclaiming a frame) page faults: 1890722
	Voluntary context switches: 338
	Involuntary context switches: 15360
	Swaps: 0
	File system inputs: 0
	File system outputs: 0
	Socket messages sent: 0
	Socket messages received: 0
	Signals delivered: 0
	Page size (bytes): 4096
	Exit status: 0

After this PR, this number is greatly reduced

	Maximum resident set size (kbytes): 35152

@insolor
Copy link
Member

insolor commented Nov 4, 2023

I think I'll merge it, but actually this is a part of dawgdic code. I think we can adopt it too, but I'm not sure yet how to add it as a dependency without copy-pasting of its code.

@insolor insolor merged commit 475edde into pymorphy2-fork:master Nov 4, 2023
15 checks passed
@insolor
Copy link
Member

insolor commented Nov 10, 2023

@gmossessian can you please add the corresponding change here? https://github.com/pymorphy2-fork/dawgdic

@insolor
Copy link
Member

insolor commented Nov 24, 2023

@gmossessian if you don't mind, I'll import these changes to dawgdic

@gmossessian
Copy link
Contributor Author

gmossessian commented Nov 24, 2023 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants