Generated by Chatgpt & revised by Yibing Zeng (EM-seq is mainly from Jonathan I. Gent):
Any script questions are welcome to email Yibing Zeng ([email protected]).
This repository contains data, code, and resources supporting the paper "Reshaping the maize karyotype using synthetic centromeres" by Yibing Zeng et al. We demonstrate the feasibility of engineering functional synthetic centromeres in maize and their accurate segregation throughout plant development, including meiosis. By tethering the key centromere protein CENP-A/CENH3 to synthetic repeat sequences, we generated neochromosomes derived from chromosome 4 and characterized their structural and functional stability. Notably, we recovered and analyzed a truncated 4a chromosome paired with a complementing 4b neochromosome, which, when homozygous, supports normal plant growth, meiosis, and gene expression. This work establishes a foundation for centromere engineering to reshape plant karyotypes and accelerate artificial chromosome development. The repository includes sequencing data, methylation and gene expression analyses, and all custom scripts used in this study.
workflow:
step 1. The primary contigs of ABS4 inbred assembly is generated by hifiasm with -l 0
parameter (-l 0
is for homozygosity assumption). -u
command enables postjoining the overlapping region but create assembly errors else where, som we disabled it here.
step 2. Misassembled contigs (a contig is wrongly connected by (TAC)n repeats) and over-scaffold (ragtag did not distinguish heterozygosity region) are resolved by custom-scripts by removing overlapping cotigs with former contigs. ABS contigs are scaffolded by post joining overlapping regions with head to tail orientation.
Here we developed a R function, plot_paf, (Under assembly folder named 06_VisualizeABSGenomeProblem.R) to visualize the over-scaffold problem created by hifiasm and compound by Ragtag.
Here shows a dotplot about the comparison between A188 genome and ABS4 inbred final assembly:
step 3. Gene annotation is performed by Liftoff used A188 as a guide.
step 4. TE annotation is performed by EDTA
, parameters selection refers to what Ou Shuju did for the maize pan genome assembly published by Hufford et.al (2021) .
step 5. The annotation of ABS and pACH25 are done by blastn
.
step 6. The visualization here used Karyotype package in R. The visualization in the main figure used customize script under mapping folder, it is way easier than you think: acquire the start and end position of the block you like (gene, TE or others) and used geom_rect
to visualize each block. You can adjust rect position by assigning different heights (I used y to denote in the script).
Low coverage illumina sequencing is used to estimate the polidy of chromosome 4 by following work flow:
step1: bwa
index the W22 and mapped illumina seq to it;
step2: bedtools makewindows
generated 100kb window;
step3: bedtools intersect -c
calculate the read count for each bin;
step4: The polidy estimation is done in R by: mutate(ploidy = 2 * (count / sum(count)) * length(count))
;
step5: Visualized it in R using ggplot2 with each dot represents an estimation of ploidy at 100kb window.
Here, we tested three different references: 1. ABS4 assembly where we first seed centromeres; 2. W22 genome where we kept backcrossing to neo4bs; 3. A distantly related but near perfect assembly --- Mo17. W22 yields the most consistent results. The code for reference testing can be found under mapping folding.
Even I called the object ploidy estimation here. But it is not ploidy, one chromosome can only have one ploidy but we have thousands of data points here.
step1: bwa
index the ABS4 inbred assembly and mapped illumina seq to it;
step2: bedtools makewindows
generated 10kb window (Smaller bins help to see if CENH3 goes over gene);
step3: bedtools intersect -c
calculate the read count for each bin;
step4: The visualization is done by customize R script enabling color-blinding selection.
step1: STAR
index the ABS4 inbred assembly and mapped illumina seq to it;
step2: Featurecount
is used to calculate the read count for each exon
step3: Gene read count is sum in R and DESeq is used to perform fold change, RLEs and DEG analysis.
step4: The visualization is done by customize R script. Putting two histograms in one plot only requires a scale factor to shirk or enlarge one distribution. Here is a trick, fix one distribution you like and scale the other one!
step1: EM-seq reads were trimmed of adapter sequence using cutadapt
, parameters -q 20 -a AGATCGGAAGAGC -A AGATCGGAAGAGC -O .
step2: Reads were aligned to each genome and methylation values were called using BS-Seeker2 v.2.1.5
, parameters -m 1 –aligner=bowtie2 -X 1000
step3: The resulting files in CGmap format were processed using CGmapTools v.1.2
. The replicates of ABS4 homozygous line and 4a(3) 4b(3) homozygous line were merged two by two using the merge2
tool and the 100kb window methylation calculation done with the mbin tool -c 1, -B 100000
step4: Results plotted with gpplot2 wrapped in tidyverse.
The root-tip FISH verified the 11-chromosome line carring ABS4 (synthetic centromere), which stained as FITC (green).
NCBI BioProject: PRJNA874319
ABS4 Genome: Zenodo
Tools: Hifiasm, RagTag, BWA, BEDTools, STAR, DESeq2, ggplot2, BS-Seeker2, CGmapTools.
*Code: All scripts and analysis pipelines are available in this repository. Some other codes are included here for pre-analysis but not used in the analysis for the paper. e.g. (1. A single-end read simulation using customized R script to check the mapping performance of illumina reads (150nt) over repetitve region; 2. Calculate the proportion of ABS containing count for qualify control; 3. Illumina analysis for additional 14 Neo4bs screens (Named as Neo4ls before) containing SVs generated by Breakage-Fusion-Bridge Cycles; 4. The small TE annotation within the ABS insertion region was identified as misannotation)
*The code is organized by analysis type and codes with the same order indicates one is for the analysis (.sh) and one is for visualization (.R).