7
$\begingroup$

My Problem:

Skipping some specific background, what I want to do isjudging whether some soft-clipping sequences are the same, which may result by the same SV event. Colored bases in Fig.1 is an example of soft-clipping sequences.

Fig.1 an example of soft-clipping sequences

I useBWA -MEM as the aligner. And I am trying to develop an SV caller actually, what I want to do here is refining PE(Paired-Ends) signal. If it is SR(Split-Reads), I don't need to do this, because of SR and its supplemental alignments coming from the same read. But PE signal generated from some cluster procedures need some filters. And I think K-mer is not a good way to do this which adapted by some SV callers. For me, as a bioinformatic fool, imitation and implementation may be a good learning process. It seems still a long way to go.

My solution:

I suppose this is a sequence alignment problem. Then I use FM-indexing based on BWT to do this. My inexact/approximate matching implementation is referred fromBWA.

My workflow:

  1. Extract these soft-clipping sequences
  2. Choose the longest one as the reference
  3. Iterate others to match with the reference, if the number of unmatched is large than 1, it means these sequences are not the same.

  • Question 1:Is that sounds reasonable to you? Any suggestions?

  • Question 2:If it is OK, how should I implementthe penalty scoring scheme?
    Here, because these sequences should come from the same SV event, the main differences(Mismatch, InDel) should arise from sequencing errors, not small genomic variants. So I want to set a higher gap open penalty and even higher gap extend penalty, like

    mismatch_penalty = 4gapopen_penalty = 11gapextend_penalty = 20alignment penalty cutoff = 30

The later means that one alignment with two more gaps will be considered as spurious alignment.

How do I optimize these parameters?

gringer's user avatar
gringer
15.7k5 gold badges27 silver badges84 bronze badges
askedOct 25, 2018 at 7:20
CodeUnsolved's user avatar
$\endgroup$
0

1 Answer1

1
$\begingroup$

You just need to check the flag for suplementary mapping. If there is suplementary mapping then use the coordinates; if they are all the same the seq is the same and you just have one event, if different reads have different suplementary mapping locations then there's multiple events. The problem is more complex than this though so just use an existing tool.

answeredJan 28, 2021 at 3:47
Liam McIntyre's user avatar
$\endgroup$

Your Answer

Sign up orlog in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

By clicking “Post Your Answer”, you agree to ourterms of service and acknowledge you have read ourprivacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.