Optimize crawling performance #151

marco-c · 2018-09-18T23:07:33Z

Right now the crawler is quite slow, I think the slowest part is finding all the elements. Perhaps we should apply a greedy approach instead, and just click on the first available element.

MadinaB · 2018-10-29T12:42:12Z

Are you talking about run_in_driver() method in crawler.py? It generates sequence sequence = run_in_driver(website, driver) and then for each element in sequence following is done:

for element in sequence:
    f.write(json.dumps(element) + '\n')

I think this part can be converted to some execution pool since it does not look to depend on outcomes of any: it simply runs some method and writes output to separate file.

for website in websites:
    data_folder = str(uuid.uuid4())
    os.makedirs(data_folder, exist_ok=True)
    try:
        sequence = run_in_driver(website, driver)
        with open('{}/steps.txt'.format(data_folder), 'w') as f:
            f.write('Website name: ' + website + '\n')
            for element in sequence:
                f.write(json.dumps(element) + '\n')
    except:  # noqa: E722
        traceback.print_exc(file=sys.stderr)
        close_all_windows_except_first(driver)

I think making some execution pool with async behavior would be good for this issue. I will try to run before and after this change with cProfile to see whether performance is being affected and tell if it will.

rhcu · 2018-10-29T13:31:29Z

@MadinaB Just a suggestion: one of the slowest parts of crawler after downloading artifacts and interacting with elements is diffing between two reports. This can be made faster by fixing this: mozilla/grcov#77 since doing this with Rust will be faster, as I think. Another one is parsing 'output.json' to HTML. Now, it is done by parsing coveralls to lcov, and then parsing lcov to html with 'genhtml'. This also may be done faster with solving an issue from grcov: mozilla/grcov#94

marco-c · 2018-10-30T00:23:55Z

@MadinaB no, run_in_driver should be fine in terms of performance. I was talking about the way we select the next element to interact with:

coverage-crawler/coverage_crawler/crawler.py

Line 102 in de41978

children = find_children(driver)

.

Finding all the elements is slow.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize crawling performance #151

Optimize crawling performance #151

marco-c commented Sep 18, 2018

MadinaB commented Oct 29, 2018 •

edited

Loading

rhcu commented Oct 29, 2018 •

edited

Loading

marco-c commented Oct 30, 2018

Optimize crawling performance #151

Optimize crawling performance #151

Comments

marco-c commented Sep 18, 2018

MadinaB commented Oct 29, 2018 • edited Loading

rhcu commented Oct 29, 2018 • edited Loading

marco-c commented Oct 30, 2018

MadinaB commented Oct 29, 2018 •

edited

Loading

rhcu commented Oct 29, 2018 •

edited

Loading