Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extract peak cleaning from pepatac.py into a tools/ script #295

Open
syntonym opened this issue Nov 9, 2024 · 2 comments
Open

Extract peak cleaning from pepatac.py into a tools/ script #295

syntonym opened this issue Nov 9, 2024 · 2 comments

Comments

@syntonym
Copy link
Contributor

syntonym commented Nov 9, 2024

PEPATAC produces the PEPATAC_commands.sh and PEPATAC_log.md files, which makes it very easy to follow which commands were executed with which parameters and how each output file was exactly produced. Given the PEPATAC_command.sh file, it seems possible to execute that file and get the exact same results back. However, one step is directly implement in the pepatac.py file and is thus missing from PEPATAC_commands.sh and PEPATAC_log.md. In commit 461ae32c8ddf06bad5362aea1430b5dd714a3f3f this is the lines 2362-2384, containing the following code:

                df = pd.read_csv(peak_output_file, sep='\t', header=None,
                                 names=("V1","V2","V3","V4","V5","V6",
                                        "V7","V8","V9","V10"))
                nineNine = df['V5'].quantile(q=0.99)
                df.loc[df['V5'] > nineNine, 'V5'] = nineNine

                # rescale score to be between 0 and 1000
                df['V5'] = rescale(np.log(df['V5']), [0, 1000])

                cs = pd.read_csv(res.chrom_sizes, sep='\t', header=None,
                                 names=("V1","V2"))
                df = df.merge(cs, on="V1")
                df.columns = ["V1","V2","V3","V4","V5","V6",
                              "V7","V8","V9","V10","V11"]
                # make sure 'chromEnd' positions are not greater than the 
                # max chrom_size
                n = np.array(df['V3'].values.tolist())
                df['V3'] = np.where(n > df['V11'], df['V11'], n).tolist()

                df = df.drop(columns=["V11"])
                # ensure score is a whole integer value
                df['V5'] = pd.to_numeric(df['V5'].round(), downcast='integer')
                df.to_csv(temp.name, sep='\t', header=False, index=False)

Refactoring these lines out into a tools/ script and calling it through the normal mechanism would include the call into the PEPATAC_commands.sh and PEPATAC_log.md files, making it possible to track how each result file was produced exactly.

@syntonym
Copy link
Contributor Author

syntonym commented Nov 9, 2024

There is also a postprocess step for the fseq and fseq2 peak caller in lines 2107-2121, which does the same. That step overrides the peak file instead of creating a temporary file.

@donaldcampbelljr
Copy link
Member

Yes. This sounds good. We will attempt to add this with our next release.
Also, please feel free to submit a PR if you would like to take a shot at it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants