Movatterモバイル変換


[0]ホーム

URL:


difflib

Andrew Dalkedalke at acm.org
Sun Apr 22 21:43:48 EDT 2001


Tim:>That's because you're trying to outguess nature, but SequenceMatcher is>trying to outguess people:  the latter doesn't care about the mostefficient>way to change X into Y, it wants to guess what a *person* probably did when>changing X into Y by hand.I think you might be surprised.  There are "optimal" sequencealignment programs but people will modify the results by hand tomake it look like what they expect.  The problem is that there isno true optimal solution, ending up with sequence alignment programswhich a lot of tunable parameters and with humans modifying theresults.>    before:  private Thread currentThread;>    after:   private volatile Thread currentThread;>>Run this thru your usual sequence comparator, and it's likely to say "Ah, I>know!  They inserted 'e volatil' between the 't' and 'e' at the end of>'private'".If I did that, I would say that spaces define gaps so there is muchless cost to adding characters at those sites.  Then the alignmentcode would say "add 'volatile ' after 'private '" which is as expected.>    before:  ab>    after:   acab>>"Ah, I know!  They stuffed 'ca' into the middle!" (something Unix diff and>Windows Windiff both believe, albeit via entirely different algorithms).Adding to the beginning or end of a sequence should be a lowerpenalty than adding in the middle.  Most good bio alignments shouldhandle this case as expected.>"more efficient" ways to get from before to after.  AFAIK, it's only people>using vi who spend minutes plotting out the most efficient key sequence to>get a job done <wink>.Biological sequence alignment programs don't like breaking upregions into lots of parts because it's more likely to destroy acoding region, or break an important structure.  So they introducegaps and penalities associated with different types of gaps (creatinga gap, increasing the size of a gap, adding to the beginning or end).These usually turn out to make this more like what people expect.And you haven't seen the *hours* people will spend to align sequencesby hand.  I know at least one case where someone became a coauthoron a paper because the hand aligment he did proved very insightful.(This was for a multiple sequence alignment which is an NP problemand can't be solved optimally for more than about a dozen sequences.)>I don't know, but I named it "difflib.py" instead of "sequencematcher.py">just in case one of you intrepid researchers did have something of general>value.Hey - working code is good.  We've got nothing that could be usedfor this any time soon, and anything we're likely to do will beoptimized for aligning letters, not words or lines.  I was justsurprised not to have heard of the alignment method before.I will keep difflib in mind as we work on this problem.>> Hmm, and it seems difflib has support for masking what is sometimes>> viewed as low information content regions.>>No, Wootton has support for what difflib views as clumps of junk.  I'm nota>fan of overblown terminology unless *so* overblown it's funny <wink>.Wink noted,  but I feel like being elaborative today.  <wink>"Junk" has another meaning in DNA sequences.  Junk DNA is noncodingregions (they don't produce protein) but some junk DNA appearsconserved, perhaps because it was created after a recent duplicationevent or because it has a impact on transcription elsewhere.So we need to use another phrase besides "junk" for what you call"junk" (people want to align junk DNA).  "Low information content"works.  BTW, the phrase isn't even exact - purely random sequencescontain a lot of information in the entropy sense, but don't codefor anything real, so biologists ignore it because it containsno information to them.                    Andrewdalke at acm.org


More information about the Python-listmailing list

[8]ページ先頭

©2009-2025 Movatter.jp