Disclosure of Invention
The invention aims to provide a method for searching Alpha satellite DNA sequences of a human chromosome centromere region, which aims to solve the problem of low efficiency in searching Alpha satellite DNA sequences specific to the human chromosome centromere region at present.
The invention provides a method for searching Alpha satellite DNA sequences of a human chromosome centromere region for solving the technical problems, which comprises the following steps:
1) Obtaining a target chromosome centromere region DNA sequence and 24 human chromosome centromere region DNA sequences;
2) Extracting the repeated sequence of the obtained DNA sequence of the centromere region of the target chromosome to obtain a corresponding repeated sequence;
3) Matching and matching the obtained repeated sequence with DNA sequences of 24 chromosome centromere regions of human bodies to obtain a file containing matching conditions of the repeated sequence in the 24 chromosome centromere regions of each human body, namely a matching file;
4) And designing a script, linking the script to the matching file, counting the occurrence times of the repeated sequence in the matching file in 24 chromosome centromere regions of each human body, and indicating that the repeated sequence is the target chromosome centromere specific Alpha satellite DNA repeated sequence when the repeated sequence has far more repeated times in the DNA sequence of the target chromosome centromere region than in the DNA sequences of other chromosome centromere regions.
The invention firstly utilizes the existing tool to obtain the matching file of the matching condition of the repeated sequence in each chromosome, then designs the script, links the script to the matching file, and runs the script in the background to realize the automatic statistics of the repeated times of the repeated sequence in the matching file in each chromosome, and judges whether the repeated sequence is the specific sequence of the target sequence according to the repeated times in each chromosome. According to the invention, the matching files do not need to be opened to carry out individual statistical analysis, and only the designed script is used for running in the matching files, so that the searching of the specific repeated sequences is automatically realized.
Further, the script is realized by adopting perl language programming.
According to the invention, the perl language is used as a scripting language, a compiler and a linker are not needed to run codes, and the codes can be directly run in the matching file, so that statistics of repeated times of repeated sequences is realized, and the searching efficiency is further improved.
Further, the step 4) is characterized in that 1 repetition is recorded when the repetition number is counted, wherein the degree of matching with the repetition sequence is larger than a set threshold.
According to the invention, only the sequences with the matching degree larger than the set threshold value are counted by setting the threshold value, so that the sequences meeting the conditions can be quickly pricked from the matching file, and the counting efficiency is further improved.
Further, the set threshold is 85%.
The invention sets the set threshold value to 85%, so that the statistical result can meet the specific requirement.
Further, before comparing the obtained repetitive sequence with the DNA sequences of 24 chromosomal centromere regions of human body, the DNA sequences of 24 chromosomal centromere regions of human body need to be combined, and the obtained repetitive sequence is compared with the combined sequences.
In order to facilitate uniform matching, each chromosome sequence is prevented from being opened independently, and each chromosome sequence is combined into one database, so that matching efficiency is improved.
Further, in the step 2), the DNA sequence of the centromere region of the target chromosome is divided into several short sequences when the repeated sequence treatment is performed.
The invention considers the limitation of the online website adopted in the repeated sequence extraction on the data length, and divides the target chromosome sequence into a plurality of short sequences so as to adapt to the requirement of the repeated sequence extraction website.
Further, the method further comprises the step of converting the statistical result in the step 4) into a table so as to realize visual display of the statistical result.
In order to intuitively display the search result, the statistical result after script operation is copied into the table, and the statistical result is intuitively displayed in the table, so that the method is convenient for staff to intuitively check.
Detailed Description
The following describes the embodiments of the present invention further with reference to the drawings.
The invention aims at the problems that the matching file obtained by the traditional software is relatively large, generally reaches a plurality of G, the information contained in the matching file is relatively large, a certain time is required for opening, the repeated condition of the repeated sequence in each chromosome is determined by a manual statistics mode after opening, the efficiency is low, and the error is easy to occur. Therefore, a method for searching Alpha satellite DNA sequences of a human chromosome centromere region is provided, and belongs to the technical field of molecular bioinformatics. The invention firstly utilizes the existing tool to obtain the matching file of the matching condition of the repeated sequence in each chromosome, then designs the script, links the script to the matching file, and runs the script in the background to realize the automatic statistics of the repeated times of the repeated sequence in the matching file in each chromosome, and judges whether the repeated sequence is the specific sequence of the target sequence according to the repeated times in each chromosome. The specific implementation flow of the method is shown in fig. 1, and the specific implementation process is as follows.
1. Obtaining the DNA sequences of the target chromosome centromere region and 24 human chromosome centromere region.
The invention takes the chromosome 6 as a target chromosome, the total DNA sequence of the centromere region of the obtained chromosome 6 is 4.1Mb, the sequence can be obtained through UCSC website, and similarly, 24 chromosome centromere region DNA sequences are downloaded by the website. The usage interface of the UCSC website is shown in fig. 2.
2. The obtained DNA sequences of 24 chromosome centromere regions of human body are combined.
In order to facilitate subsequent searching and comparison, the invention needs to combine the obtained DNA sequences of 24 chromosome centromere regions of human body, and the embodiment uses a downloading Tbtools tool and installing (https:// download.csdn. Net/download/dingbp/10700820) to combine the 24 chromosome centromere DNA sequences into a new database, named as '1-Y'.
3. A repeat sequence is obtained.
The method comprises the steps of carrying out repeated sequence analysis on the DNA sequence of the centromere region of the obtained target chromosome to obtain a corresponding repeated sequence (Masked Regions), wherein the repeated sequence analysis is realized by an online website (https:// www.girinst.org/censor/index. Php), as shown in the figure 3-a, the method comprises the steps of firstly carrying out segmentation operation on the target sequence before uploading the target sequence to the website, and segmenting the target sequence into a plurality of short sequences according to the limitation of the maximum transmission data for uploading respectively, and the result of obtaining the repeated sequence of the target chromosome (No. 6) by the online website is shown in the figure 3-b.
4. And comparing and analyzing the obtained repeated sequence with the DNA sequences of 24 chromosome centromere regions of the human body to obtain corresponding matching files.
In this embodiment, the Tbtools tool is used to align the repeated sequence (Masked Regions) obtained in step 3 with the new database "1-Y" obtained in step 2, and the Tbtools tool is used to apply the interface as shown in fig. 4, where the output file is "123". When the Tbtools tool is matched, the generated file is generally larger to several G, and the comparison condition of the repeated sequence in the DNA sequences of 24 chromosome centromere regions of the human body is included, as shown in figure 5.
5. And (3) carrying out data arrangement and selection on the file obtained in the step (4) by utilizing a script so as to find the Alpha satellite DNA sequence of the centromere region of the human chromosome.
Because the obtained matching file is relatively large, if the file is directly opened, a certain time is required, and in order to improve the efficiency, the invention designs a script which needs to run under the Perl software, and realizes the analysis and statistics of the Perl program on the file with the file name in the script through a script command. The script is directly linked to the matching file, and the matching file can be directly operated in the background without opening the matching file by operating the script, so that automatic statistics can be carried out on the matching file.
The invention uses the script to count the occurrence times of the repeated sequence in the obtained matching file in 24 chromosome centromere areas of each human body, so as to judge whether the repeated sequence is the target chromosome centromere specific Alpha satellite DNA repeated sequence. In the case of counting the number of repetitions, the repetition is recorded as a repetition having a degree of matching with the repetition sequence greater than a set threshold, wherein the set threshold may be determined according to the actual situation, and the set threshold in this embodiment is 85%.
The script constructed in this example is shown below, where the input file name is "123", the output file name is "123result. Txt", and 85 is 85% sequence matching (experiments verify that sequences with matching higher than 85% can hybridize effectively).
In order to avoid the compiling and linking process of the program, the invention adopts perl language as the script language, and the specific implementation codes are as follows:
The final statistical result can be obtained through the script, the output file after the script is operated is in txt format, and the result is copied into an excel table for intuitively displaying the statistical result, and the result is shown in the table 1.
TABLE 1
The total class of Table 1 is the repeat sequence name, the row is the 1-Y chromosome name, and the other values are the number of repetitions of the sequence at a chromosome match of greater than 85%.
From the sequence alignment, it can be seen that the sequence hg38_ DNA FRAGMENT 428655- >428822 has 1578 repeats on chromosome 6 and less than or equal to 3 repeats on chromosome 1/3/12/20, so that it can be theoretically determined that the sequence is a chromosome 6 centromere specific Alpha satellite DNA repeat sequence.
Through the process, the invention can respectively obtain the specific repeated sequences of the chromosome 6, the chromosome 4 and the chromosome 20, wherein the specific repeated sequences are respectively as follows:
Chromosome 6 Alpha satellite DNA sequence:
TGTAGTATTTCCAAGCGGATATTTGGAACGCCTTGAAGCGTATGGTAGAAAAGGAAATATCTTTCCATAAAACCTAGACAGAACCCATCTCAGAAACGACTTTGTGATGTCTGCATTGAACTCACAGAGTTGAACATTTCTCTTGATAGAGCAGTTTTGAAACCCTCT
chromosome 4 Alpha satellite DNA sequence:
CTGCACTACCAGGAAGTGGACATTTCGAGCGCTTTGAGGCTTATGGTGAAAAAGGAAATATCTTCTCATAAAAACCAGAAAGAAGCGTTCTCAGAAACTTCTTTGTGTTGTGTGTACTCATGTAACAGTGTTGAACCATCCTTTTGACAGAGCAGTTTTGAAACAATCTTT
Chromosome 20 Alpha satellite DNA sequence:
GGATGTTTCGATTGAAGTCCCAGTGTTGAACATTCCCTTTTATAGAGCAGGTTGGAAACACTCTTTCTGCATTCCCTGGAAGTGGACATTTAGAGCGCTTTCAGGACGACGGTGAAAATGGAAATATCTTCCAAGAAAATCTAGATAGAA
in order to verify the effectiveness of the searching method, in-situ hybridization experiments are carried out after fluorescent markers are used for the searched chromosome centromere specific Alpha satellite DNA repeated sequences, and specific repeated sequences of the searched chromosomes 4, 6 and 20 are taken as examples. The experimental procedure was as follows:
I. fluorescent labeling:
① The searched Alpha satellite DNA sequence is delivered to a pUC57 vector by a gene synthesis company for cloning;
② The upstream primer and the downstream primer are synthesized, wherein the sequence of the upstream primer is CCTTATCCGGTAACTATCGTCT, and the sequence of the downstream primer is TGTTCTTCTAGTGTAGCCGTA. (Note: this sequence is pUC57 vector sequence for amplifying the repetitive sequence of inserted Alpha satellite DNA)
③ PCR labeling is carried out by taking the cloned plasmid as a template, and the labeling system is as follows:
After the PCR reaction is completed, the Alpha satellite DNA probe solution is obtained.
II. Preparation of probe working solution
1. Mu.L of Alpha satellite DNA probe solution, 1ul of gene probe solution, 1ul of Cot-1DNA and 7ul of hybridization buffer solution are taken, evenly mixed and centrifuged to prepare probe working solution.
Note that the selected gene probe is the corresponding Alpha satellite DNA probe on the same chromosome as the control probe, and Cot-1 DNA is used for blocking nonspecific hybridization.
III, fluorescence In Situ Hybridization (FISH)
1. Taking a clean glass slide;
2. taking 5 μl of suspension after re-suspending the cells, dripping the suspension onto a glass slide, and airing at room temperature;
3. Observing the cell density under a phase contrast microscope by using a 10X objective lens, wherein the cells are required to be non-overlapped, and the number of single-field cells is preferably 100-200:
3.1 if the cell density and number are proper, continuing to step 4;
3.2 if there is overlap of cells, diluting the cell suspension by adding appropriate fresh fixative;
3.3 if the cell density is low, centrifuging at 2000rpm for 5 minutes, carefully sucking a proper amount of supernatant, uniformly mixing, taking 3 μl of suspension, tabletting, airing, and observing;
4. If too much cell debris is observed under a phase contrast microscope, pretreatment is required and the appropriate hybridization region is selected.
Pretreatment of glass slides:
the dropped cell pellet is placed in an environment of 60 ℃ and baked for 20-30 minutes.
① . After baking the slide, placing the slide in 2 XSSC at room temperature and immersing for 10 minutes;
② . Then soaking in 70%, 85% and 100% ethanol for 2 min to dehydrate;
③ . Taking out the slide glass and airing at room temperature.
Sample denaturation, hybridization with probe (note light shielding):
① . Taking out the probe, mixing, centrifuging, adding 10ul probe to the hybridization area, covering with cover glass, sealing with sealing glue along the edge of cover glass, and taking care that sealing completely avoids hybridization solution volatilization.
② . The slides were placed on a hybridization apparatus for denaturing hybridization according to the set program (recommended: 82℃for 6 minutes, 40℃overnight hybridization).
Post hybridization wash (note light protection):
1. the prepared washing liquid I is placed in a water bath kettle for 30 minutes before washing, and is preheated to 72+/-1 ℃.
2. After removing the slide from the hybridization apparatus and removing the sealing gel, the slide was immersed in the washing solution II at room temperature for 10 minutes to remove the cover slip (the cover slip was not directly removed with tweezers).
3. The slide was placed in washing solution I at 72.+ -. 1 ℃ for 2 minutes.
4. The slide was removed and placed in washing solution II at room temperature for 2 minutes.
5. The slide glass is taken out and then soaked in 70% and 85% ethanol solution for 2 minutes, and then dried at room temperature.
6. DAPI staining and sealing, namely dripping 10 mu l of DAPI counterstain solution onto the sliced tissues, avoiding bubbles, covering a cover glass, smearing the edge of the cover glass with nail polish, and incubating for 10-20 minutes at minus 20 ℃.
7. Microscopic examination, namely preserving the slice in the dark. Long-term storage should be carried out at-20deg.C and counted under a fluorescence microscope.
The experimental results are shown in FIGS. 6-a, 6-b and 6-c, wherein FIG. 6-a is a graph showing the results of labeling fluorescent probes and gene probes with Alpha satellite DNA sequences in the centromeric region of chromosome 4, FIG. 6-b is a graph showing the results of labeling fluorescent probes and gene probes with Alpha satellite DNA sequences in the centromeric region of chromosome 6, and FIG. 6-c is a graph showing the results of labeling fluorescent probes and gene probes with Alpha satellite DNA sequences in the centromeric region of chromosome 20. This shows that the specific repetitive sequence found by the method meets the biological requirement, and the invention can not only quickly find the specific repetitive sequence, but also find the specific repetitive sequence accurately.