- Notifications
You must be signed in to change notification settings - Fork36
Description
Dear stringdist developers,
Hello and greetings. I am a basic bioinformatics analyst. I'm using stringdist on theRstudio platform to perform a large number of short amino acid sequence alignments, and I hope to obtain the distance relationships between these short sequences to create a network graph. My work environment isUbuntu 24, with4 physical CPUs and1TB of RAM.
My code is as follows, wheresequences is acharacter storing amino acid sequences. These sequences vary in length but are all less than 20 characters:
levdist <- stringdistmatrix(sequences, method = "lv")matrix <- as.matrix(levdist)When the length of sequences is less than 10,000, the above code works well without any errors. However, when dealing with a large number of amino acid sequences (my goal is six million sequences), thestringdistmatrix function runs successfully, but the resulting levdist object encounters the following error:
> matrix <- as.matrix(levdist)
Error in sequence.default(n..1, from = seq.int(s.1, length(df), s.1)) :
'from' contains NAs
In sequence.default(n..1, from = seq.int(s.1, length(df), s.1)) :
NAs introduced by coercion to integer range
Not only does theas.matrix function cause this error, but even simple operations likeunique(levdist)or attempting to directly readlevdist result in the same error.
I have tried searching for relevant information online but have not found an answer. I am unsure where the error lies.
Furthermore, do you think there is a problem with my method of converting the dist format to a matrix? Is there a better way to create a network graph besides using the igraph package to interpret the matrix?
I sincerely look forward to your answers and appreciate your help!
Thank you very much!