Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize crawling performance #151

Open
marco-c opened this issue Sep 18, 2018 · 3 comments
Open

Optimize crawling performance #151

marco-c opened this issue Sep 18, 2018 · 3 comments

Comments

@marco-c
Copy link
Collaborator

marco-c commented Sep 18, 2018

Right now the crawler is quite slow, I think the slowest part is finding all the elements. Perhaps we should apply a greedy approach instead, and just click on the first available element.

@MadinaB
Copy link
Contributor

MadinaB commented Oct 29, 2018

Are you talking about run_in_driver() method in crawler.py? It generates sequence sequence = run_in_driver(website, driver) and then for each element in sequence following is done:

for element in sequence:
    f.write(json.dumps(element) + '\n')

I think this part can be converted to some execution pool since it does not look to depend on outcomes of any: it simply runs some method and writes output to separate file.

for website in websites:
    data_folder = str(uuid.uuid4())
    os.makedirs(data_folder, exist_ok=True)
    try:
        sequence = run_in_driver(website, driver)
        with open('{}/steps.txt'.format(data_folder), 'w') as f:
            f.write('Website name: ' + website + '\n')
            for element in sequence:
                f.write(json.dumps(element) + '\n')
    except:  # noqa: E722
        traceback.print_exc(file=sys.stderr)
        close_all_windows_except_first(driver)

I think making some execution pool with async behavior would be good for this issue. I will try to run before and after this change with cProfile to see whether performance is being affected and tell if it will.

@rhcu
Copy link
Collaborator

rhcu commented Oct 29, 2018

@MadinaB Just a suggestion: one of the slowest parts of crawler after downloading artifacts and interacting with elements is diffing between two reports. This can be made faster by fixing this: mozilla/grcov#77 since doing this with Rust will be faster, as I think. Another one is parsing 'output.json' to HTML. Now, it is done by parsing coveralls to lcov, and then parsing lcov to html with 'genhtml'. This also may be done faster with solving an issue from grcov: mozilla/grcov#94

@marco-c
Copy link
Collaborator Author

marco-c commented Oct 30, 2018

@MadinaB no, run_in_driver should be fine in terms of performance. I was talking about the way we select the next element to interact with:

children = find_children(driver)
.

Finding all the elements is slow.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants