Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

uniwig output inconsistency: npy and wig #64

Closed
ClaudeHu opened this issue Dec 21, 2024 · 8 comments · Fixed by #66
Closed

uniwig output inconsistency: npy and wig #64

ClaudeHu opened this issue Dec 21, 2024 · 8 comments · Fixed by #66

Comments

@ClaudeHu
Copy link
Member

Problem description

When run gtars.uniwig, different output format sometimes give different values. Here is how to reproduce the issue:

Reproduce the issue

Input

combined_chrsorted.bed

chr1	9010079	9010350	.	1000	.	101.05233	-1.00000	4.65032	114
chr1	22654616	22655040	.	811	.	21.95423	-1.00000	4.09357	212
chr1	28893770	28894040	.	1000	.	106.77643	-1.00000	4.65032	115
chr1	51648385	51648701	.	915	.	30.27649	-1.00000	3.39557	158
chr1	99943523	99943839	.	1000	.	72.39423	-1.00000	3.66991	158
chr1	101506654	101506843	.	1000	.	70.09492	-1.00000	4.65032	89
chr1	115052528	115052820	.	1000	.	154.56021	-1.00000	4.65032	160
chr1	117929585	117929901	.	888	.	36.53481	-1.00000	3.68167	158
chr1	161440342	161440624	.	1000	.	104.87638	-1.00000	4.65032	144
chr1	174204713	174204966	EH38E1398749	0	.	174204713	174204966	255,205,0	dELS	All-data/Full-classification
chr2	46656910	46657334	.	1000	.	49.64269	-1.00000	4.25428	212
chr2	70686919	70687199	.	1000	.	251.58387	-1.00000	4.11962	129
chr2	108636387	108636737	EH38E2023468	0	.	108636387	108636737	255,205,0	dELS,CTCF-bound	All-data/Full-classification
chr2	170820946	170821262	.	895	.	39.84464	-1.00000	3.78148	158
chr2	225463039	225463355	.	943	.	47.65063	-1.00000	3.90689	158
chr3	14234807	14235012	EH38E2180627	0	.	14234807	14235012	255,205,0	dELS	All-data/Full-classification
chr3	45545530	45545878	EH38E2197342	0	.	45545530	45545878	255,205,0	dELS	All-data/Full-classification
chr3	53219902	53220260	.	1000	.	188.62341	-1.00000	4.65032	171
chr3	65617840	65618031	EH38E2210302	0	.	65617840	65618031	255,205,0	dELS	All-data/Full-classification
chr3	122355628	122355944	.	1000	.	74.56709	-1.00000	3.65234	158
chr4	83455863	83456179	.	955	.	44.78550	-1.00000	3.94133	158
chr4	86916396	86916712	.	1000	.	79.59141	-1.00000	4.11962	158
chr4	114807841	114808124	.	1000	.	109.11340	-1.00000	4.65032	136
chr4	146095137	146095426	.	1000	.	120.70640	-1.00000	4.65032	147
chr5	40688183	40688493	.	1000	.	164.09236	-1.00000	4.65032	145
chr5	173885123	173885342	EH38E2433420	0	.	173885123	173885342	255,205,0	dELS	All-data/Full-classification
chr5	177601577	177602001	.	577	.	17.63237	-1.00000	4.14660	212
chr5	181107731	181108047	.	889	.	58.97310	-1.00000	3.79251	158
chr6	31652815	31653131	.	1000	.	32.90333	-1.00000	3.54086	158
chr6	37061234	37061582	EH38E2463885	0	.	37061234	37061582	255,205,0	dELS	All-data/Full-classification
chr6	90403685	90403992	EH38E2485592	0	.	90403685	90403992	255,205,0	dELS,CTCF-bound	All-data/Full-classification
chr6	131910002	131910300	.	1000	.	127.21766	-1.00000	4.65032	147
chr6	137588484	137588908	.	702	.	18.16283	-1.00000	4.13918	212
chr6	138371763	138372079	.	1000	.	55.22526	-1.00000	3.82898	158
chr6	142030173	142030324	EH38E2509972	0	.	142030173	142030324	255,205,0	dELS	All-data/Full-classification
chr6	161137318	161137742	.	1000	.	56.70154	-1.00000	4.21832	212
chr7	26689149	26689465	.	1000	.	73.45975	-1.00000	3.66033	158
chr7	45260021	45260437	.	1000	.	286.04104	-1.00000	4.65032	193
chr7	91880570	91881059	.	1000	.	342.70797	-1.00000	4.65032	244
chr7	104096892	104097316	.	549	.	14.16353	-1.00000	4.00339	212
chr7	105765202	105765481	.	1000	.	104.97511	-1.00000	4.65032	113
chr7	123538527	123538760	EH38E2587041	0	.	123538527	123538760	255,205,0	dELS	All-data/Full-classification
chr8	47203145	47203557	.	1000	.	192.69082	-1.00000	4.65032	210
chr8	61810033	61810304	.	1000	.	104.56865	-1.00000	4.65032	143
chr8	67064082	67064479	.	1000	.	212.79825	-1.00000	4.65032	170
chr8	97412036	97412460	.	711	.	22.09344	-1.00000	4.09207	212
chr8	105718148	105718309	.	1000	.	58.06510	-1.00000	4.21059	95
chr8	109316140	109316564	.	1000	.	37.25075	-1.00000	4.33114	212
chr9	100601193	100601467	.	1000	.	98.47223	-1.00000	4.65032	140
chr9	127629708	127630043	EH38E2728640	0	.	127629708	127630043	255,205,0	dELS,CTCF-bound	All-data/Full-classification
chr9	129023903	129024111	EH38E2730038	0	.	129023903	129024111	255,205,0	dELS	All-data/Full-classification
chr9	136787604	136788010	.	1000	.	182.21205	-1.00000	4.65032	207
chr9	137672812	137673132	.	1000	.	157.60890	-1.00000	4.65032	168
chr10	30542466	30542816	EH38E1459133	0	.	30542466	30542816	255,205,0	dELS	All-data/Full-classification
chr10	58065342	58065555	.	1000	.	68.11183	-1.00000	4.65032	102
chr10	70618163	70618479	.	789	.	35.42008	-1.00000	3.69799	158
chr10	70619318	70619606	.	1000	.	129.45292	-1.00000	4.65032	141
chr10	75230554	75230930	.	1000	.	288.90741	-1.00000	4.65032	208
chr10	75395292	75395689	.	1000	.	34.07007	-1.00000	4.35422	393
chr10	75926992	75927416	.	1000	.	7.46931	-1.00000	3.45607	212
chr11	326966	327282	.	631	.	26.35802	-1.00000	3.08680	158
chr11	3824655	3825149	.	1000	.	338.75753	-1.00000	4.65032	256
chr11	17208370	17208656	EH38E1524547	0	.	17208370	17208656	255,205,0	dELS	All-data/Full-classification
chr11	34653928	34654352	.	944	.	30.42535	-1.00000	4.38188	212
chr11	44538452	44538828	.	1000	.	237.41600	-1.00000	4.65032	210
chr11	44951088	44951512	.	590	.	6.45509	-1.00000	3.17923	212
chr11	108225680	108225830	EH38E1567625	0	.	108225680	108225830	255,205,0	dELS	All-data/Full-classification
chr12	911449	911799	EH38E1585794	0	.	911449	911799	255,205,0	dELS	All-data/Full-classification
chr12	53211840	53212123	.	1000	.	131.35950	-1.00000	4.65032	134
chr12	56286976	56287109	.	973	.	33.34393	-1.00000	4.35951	57
chr12	58176260	58176684	.	1000	.	37.88792	-1.00000	4.32637	212
chr13	40974666	40975013	EH38E1670499	0	.	40974666	40975013	255,205,0	dELS,CTCF-bound	All-data/Full-classification
chr13	93249544	93249860	.	671	.	33.51003	-1.00000	3.53200	158
chr14	30656356	30656780	.	583	.	7.01736	-1.00000	3.34142	212
chr14	34331730	34331969	EH38E1707911	0	.	34331730	34331969	255,205,0	dELS,CTCF-bound	All-data/Full-classification
chr14	65412222	65412646	.	1000	.	46.73653	-1.00000	4.27034	212
chr14	94307388	94307812	.	827	.	32.24248	-1.00000	4.36778	212
chr15	26838932	26839356	.	595	.	5.33301	-1.00000	2.70236	212
chr15	50260487	50260848	.	1000	.	154.85421	-1.00000	4.11962	184
chr15	81265862	81266166	EH38E1782384	0	.	81265862	81266166	255,205,0	dELS,CTCF-bound	All-data/Full-classification
chr15	83092216	83092387	EH38E1783292	0	.	83092216	83092387	255,205,0	dELS	All-data/Full-classification
chr15	84951974	84952140	EH38E1784147	0	.	84951974	84952140	255,205,0	dELS	All-data/Full-classification
chr16	11615692	11616008	.	1000	.	55.63653	-1.00000	3.82514	158
chr16	53892166	53892373	.	1000	.	70.46429	-1.00000	4.65032	70
chr17	64063745	64064097	.	1000	.	150.97659	-1.00000	4.65032	161
chr17	64491122	64491314	.	1000	.	60.69952	-1.00000	4.19729	107
chr17	68457855	68458171	.	619	.	23.64178	-1.00000	2.66143	158
chr17	75326210	75326517	EH38E1885800	0	.	75326210	75326517	255,205,0	dELS	All-data/Full-classification
chr17	78082429	78082779	EH38E1888686	0	.	78082429	78082779	255,205,0	dELS	All-data/Full-classification
chr18	46507152	46507576	.	867	.	29.06062	-1.00000	4.39358	212
chr19	23996546	23996970	.	1000	.	46.80163	-1.00000	4.26994	212
chr19	34841913	34842337	.	968	.	3.71930	-1.00000	1.71740	212
chr19	38675572	38675911	EH38E1952669	0	.	38675572	38675911	255,205,0	dELS	All-data/Full-classification
chr19	47790341	47790756	.	1000	.	181.45576	-1.00000	4.65032	186
chr19	49862752	49863176	.	984	.	38.71874	-1.00000	4.32089	212
chr20	2314455	2314805	EH38E2092757	0	.	2314455	2314805	255,205,0	dELS	All-data/Full-classification
chr20	32631914	32632225	EH38E2107184	0	.	32631914	32632225	255,205,0	dELS	All-data/Full-classification
chr20	45436864	45437282	.	1000	.	227.12597	-1.00000	4.65032	206
chr20	62386880	62387196	.	564	.	28.44988	-1.00000	3.26145	158
chr21	32569236	32569440	EH38E2137462	0	.	32569236	32569440	255,205,0	dELS,CTCF-bound	All-data/Full-classification

Chromosome size reference

Can be replaced with only main

chr1	248956422
chr2	242193529
chr3	198295559
chr4	190214555
chr5	181538259
chr6	170805979
chr7	159345973
chr8	145138636
chr9	138394717
chr10	133797422
chr11	135086622
chr12	133275309
chr13	114364328
chr14	107043718
chr15	101991189
chr16	90338345
chr17	83257441
chr18	80373285
chr19	58617616
chr20	64444167
chr21	46709983
chr22	50818468
chrX	156040895
chrY    57227415

Create npy

cargo run uniwig -f $INPUTDIR/combined_chrsorted.bed -c /home/claudehu/Desktop/bed/hg38.chrom.sizes -m 5 -s 50 -l $OUTPUTDIR/npy/ -y npy

Create wig

cargo run uniwig -f $INPUTDIR/combined_chrsorted.bed -c /home/claudehu/Desktop/bed/hg38.chrom.sizes -m 5 -s 50 -l $OUTPUTDIR/wig/all -y wig

Example error

On chr14

wig

$ sed -n '34369195,34369205p' all_start.wig
fixedStep chrom=chr14 start=30656351 step=50
2
2
2
2
2
2
2
2
2
2
$ sed -n '35896932,35896942p' all_start.wig
5
5
5
5
5
5
5
5
5
5
fixedStep chrom=chr15 start=26838927 step=50

npy

import os
import numpy as np

npy = np.load(os.path.expandvars("$OUTPUTDIR/npy/chr14_start.npy"))

len(npy) # 1527747 = 35896942 - 34369195

# expected to be [2, 2, 2, ..., 2]
npy[:10]  # array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0], dtype=uint32)

# expected to be [5, 5, 5, ..., 5]
npy[-10:] # array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0], dtype=uint32)
@donaldcampbelljr
Copy link
Member

I'm having a difficult time reproducing this with smaller test files.

Out of curiosity, did you clear out any output files before re-running uniwig?

BED file

chr1	2	6
chr1	4	7
chr1	5	9
chr1	7	12

chromsizes

chr1	30

used flags:

-m 2 -s 1

python code:

import numpy as np  
  
print("Starts")  
loaded_array = np.load('/home/drc/Downloads/unwig_testing_19dec2024/output/chr1_start.npy')  
print(loaded_array)  
print(loaded_array.shape)  
  
print("End")  
loaded_array = np.load('/home/drc/Downloads/unwig_testing_19dec2024/output/chr1_end.npy')  
print(loaded_array)  
print(loaded_array.shape)  
  
print("Core")  
loaded_array = np.load('/home/drc/Downloads/unwig_testing_19dec2024/output/chr1_core.npy')  
print(loaded_array)  
print(loaded_array.shape)

npy output:

Starts
[2 3 4 4 4 3 2 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
(29,)
End
[2 3 3 4 4 2 2 2 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
(26,)
Core
[2 2 3 4 2 2 2 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
(28,)

Wiggles:

start

fixedStep chrom=chr1 start=1 step=1
2
2
3
4
4
3
3
2
1
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0

end

fixedStep chrom=chr1 start=4 step=1
2
3
3
4
4
2
2
2
1
1
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0

core

fixedStep chrom=chr1 start=2 step=1
2
2
3
2
1
2
1
1
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0

@donaldcampbelljr
Copy link
Member

I also tried a subset of the above bed file (taking just chr 14 and chr 15) and was unable to reproduce the issue either.

For example, I see values at the beginning of chr14's npy files as expected:

Starts
[2 2 0 ... 0 0 0]
(25462455,)
End
[2 2 2 ... 0 0 0]
(25462314,)
Core
[2 2 2 ... 0 0 0]
(25462454,)

meta header

fixedStep chrom=chr15 start=26838928 step=3
fixedStep chrom=chr14 start=30656352 step=3

wiggle header

drc@databio:~/Downloads/unwig_testing_19dec2024/output$ head -10 _start.wig
fixedStep chrom=chr14 start=30656352 step=3
2
2
2
0
0
0

@ClaudeHu
Copy link
Member Author

I'm having a difficult time reproducing this with smaller test files.

Out of curiosity, did you clear out any output files before re-running uniwig?

BED file

chr1	2	6
chr1	4	7
chr1	5	9
chr1	7	12

chromsizes

chr1	30

used flags:

-m 2 -s 1

python code:

import numpy as np  
  
print("Starts")  
loaded_array = np.load('/home/drc/Downloads/unwig_testing_19dec2024/output/chr1_start.npy')  
print(loaded_array)  
print(loaded_array.shape)  
  
print("End")  
loaded_array = np.load('/home/drc/Downloads/unwig_testing_19dec2024/output/chr1_end.npy')  
print(loaded_array)  
print(loaded_array.shape)  
  
print("Core")  
loaded_array = np.load('/home/drc/Downloads/unwig_testing_19dec2024/output/chr1_core.npy')  
print(loaded_array)  
print(loaded_array.shape)

npy output:

Starts
[2 3 4 4 4 3 2 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
(29,)
End
[2 3 3 4 4 2 2 2 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
(26,)
Core
[2 2 3 4 2 2 2 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
(28,)

Wiggles:

start

fixedStep chrom=chr1 start=1 step=1
2
2
3
4
4
3
3
2
1
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0

end

fixedStep chrom=chr1 start=4 step=1
2
3
3
4
4
2
2
2
1
1
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0

core

fixedStep chrom=chr1 start=2 step=1
2
2
3
2
1
2
1
1
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0

I don't recall clearing out anythings before re-running.

Regarding the .npy and .wig files, are the values supposed to match exactly when other parameters remain constant?

For instance, looking at the start array from the .npy file:
[2, 3, 4, 4, 4, 3, 2, 1, 1, ...]

Shouldn't the start wiggle file begin like this:

fixedStep chrom=chr1 start=1 step=1  
2  
3  
4  
4  
4  
3  
2  
1  
1  
...  

Instead, it starts with:

fixedStep chrom=chr1 start=1 step=1  
2  
2  
3  
4  
4  
3  
3  
2  
1  
1  
...  

@donaldcampbelljr
Copy link
Member

Those differences are due to the shift from 1-based to 0-based for wig to npy (the shift factor is carried into the counting function and affect counting). We can revisit that logic and see if there is a way to get these to be exactly 1:1.

Above, I was focused on attempting to reproduce the items from your original post where it appeared that the npy arrays were empty or padded with zeroes (i.e. very different than the wiggle outputs).

Are you getting counts in a npy array that are very similar to the wiggles? Or are they completely different?

@donaldcampbelljr
Copy link
Member

TODO:

  • account for discrepancies between wig and npy outputs (see above for initial value truncation)

@ClaudeHu
Copy link
Member Author

Those differences are due to the shift from 1-based to 0-based for wig to npy (the shift factor is carried into the counting function and affect counting). We can revisit that logic and see if there is a way to get these to be exactly 1:1.

Above, I was focused on attempting to reproduce the items from your original post where it appeared that the npy arrays were empty or padded with zeroes (i.e. very different than the wiggle outputs).

Are you getting counts in a npy array that are very similar to the wiggles? Or are they completely different?

It would be ideal if the outputs can be exactly 1:1, so the inputs can be same from either .npy or .bw (converted from wiggle by wigToBigWig) for geniml.universe to make an hmm universe.

@ClaudeHu
Copy link
Member Author

Those differences are due to the shift from 1-based to 0-based for wig to npy (the shift factor is carried into the counting function and affect counting). We can revisit that logic and see if there is a way to get these to be exactly 1:1.

Above, I was focused on attempting to reproduce the items from your original post where it appeared that the npy arrays were empty or padded with zeroes (i.e. very different than the wiggle outputs).

Are you getting counts in a npy array that are very similar to the wiggles? Or are they completely different?

I rerun it but with chrom.size containing only chromosomes 14 & 15, plus -m 5 -s 3 . Then I compare arrays from .npy and arrays from .bw converted from .wig. This time they are very similar. However, I did double check that first problem with step size of 50 and chrom.sizes of all main chromosomes also existed...

@donaldcampbelljr
Copy link
Member

An interesting data point, I made a new test case:

chr1	10	15
chr1	12	18
chr1	16	22

Now the starts, ends are the same for npy and wig, but the cores are different.

From Core:

fixedStep chrom=chr1 start=10 step=1
2
2
3
3
1
1
2
1
1
Core
[2 
2 
3 
3 
3 
1
2 
2 
1 
1 
1 
1 

with settings:

        let smoothsize: i32 = 5;
        let output_type = "npy";
        //let output_type = "wig";
        let filetype = "bed";
        let num_threads = 6;
        let score = false;
        let stepsize = 1;

donaldcampbelljr added a commit that referenced this issue Jan 9, 2025
donaldcampbelljr added a commit that referenced this issue Jan 10, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants