Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
/bwaPublic
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Use SAM headers from index directory's<prefix>.hdr if it exists#348

Open
jmarshall wants to merge2 commits intolh3:master
base:master
Choose a base branch
Loading
fromjmarshall:headerfile

Conversation

jmarshall
Copy link
Contributor

Introduce a<prefix>.hdr file that the sysadmin or user should place alongside the other index files, useful for setting up detailed@SQ headers etc to be used by default when mapping.

The idea is to make it easy to have rich@SQ headers (and other headers if desired) in bwa mapped output. Rather than needing to use-H every time, you can set up a<prefix>.hdr file at the same time as you generate the bwa index. In particular, this can be used to add descriptive fields such as MD5, species, assembly, and alternative names (which enables querying for e.g.chr1 or1 interchangeably; seesamtools/hts-specs#100 andsamtools/htslib#931).

This PR adds to themem/samse/sampe commands so that they output SAM headers from:

  • -H options;
  • a.hdr file stored with the index files;
  • the basic hard-coded@HD/@SQ default headers.

An@HD header is always printed, with an@HD line from-H overriding one from<prefix>.hdr (if any), which overrides the default one added since#336. A set of@SQ headers is always printed, with-H options overriding<prefix>.hdr overriding the default, similarly.

Other headers specified in either-H or<prefix>.hdr are all printed.

If-H includes@SQ headers, we consider that the user has carefully set up all the headers to be output and ignore<prefix>.hdr entirely.

Also adds a brief description of<prefix>.alt and a full description of<prefix>.hdr to the manual page.

@nh13
Copy link
Contributor

What if it just uses a .dict file by default when present? while .hdr is more flexible, .dict files are already in use.

@jmarshall
Copy link
ContributorAuthor

jmarshall commentedMar 19, 2022
edited
Loading

The contents (a bunch of SAM headers) would be much the same, just the filename would be different. I considered using the existing.dict convention for this, but there are several reasons not to:

  1. For a FASTA reference fileGRCh38.fa, the usual name for a corresponding dict file would beGRCh38.dict but all the rest of bwa's index files are of the formGRCh38.fa.xyz. (This is the most compelling reason.)
  2. All (except one perhaps, which is shorter) of bwa's index files use three letter extensions. (I would not want to stop this working on DOS 😄)
  3. AIUI Sequence Dictionaries contain only@SQ and maybe@HD headers. The intention here is that it may contain any header — there would be use cases for adding constant@PG or@CO headers to your mapping output.
  4. The purpose of this file is just to have a bunch of headers to be copied into the output. Hence.hdr. If the filename were.dict, users might infer that it was used by bwa as a dictionary to do lookups of some kind.

(You can always usebwa mem -H foo.dict … to use a dict file with whatever filename it might have.)

@nh13
Copy link
Contributor

nh13 commentedMar 19, 2022
edited
Loading

Are there types of lines that should be disallowed? In (1), it looks like the intention is that the file should live statically alongside the other bwa files. So RG, and perhaps even PG, should not be static. RG is obvious as to why, but PG being in the file would be misleading if we use a newer version of bwa with older and compatible index folds.

Given .dict files are found a lot in the wild, and can be generated by multiple tools, how about they get used if the .hdr file does not exist? I find having yet another auxiliary file I have to maintain when .dict already exists confusing and cumbersome. If folks want a custom header, they can just specify it similar to your .dict example with the -H option. And is this file generated by bwa, then hand modified to add additional lines, or even add AH fields to SQ lines? If so, then should it really be codified as a bwa generated static file? I don’t think so.

@jmarshall
Copy link
ContributorAuthor

jmarshall commentedMar 19, 2022
edited
Loading

  1. As described, the intention is for this to be a static file alongside the other bwa index files. Like the.alt file, it is prepared separately rather than bybwa index. (It might be worth adding code so thatbwa index would generate a starting point for this file, but as tools likeCreateSequenceDictionary andsamtools dict already do this there's not much need.)
  2. BWA already adds a@PG ID:bwa … line containing its version number. If you added some static@PG lines to this file, it would be to represent other early-stage processing such as bcl2fastq. (This is indeed a bit esoteric;@SQ will be what is common in this file.)
  3. Users could indeed get the effects of this PR by always usingbwa mem -H /path/to/ref.fa.hdr … every time they run BWA to do mapping (mem users anyway;samse/sampe users do not have such an option). But the point of this PR is to make it easy: to make decorated@SQ lines something that can be set up once when you set up the index.
  4. The primary motivation for this is to addSP,AS,M5,TP, andAN fields to the@SQ lines. Especially the latter are not fully automatable, so will require hand-editing from some starting point.
  5. You are correct that manually addingAH fields corresponding to the.alt file would be painful. This is why I have a followup samtools PR coming, which will add an option tosamtools dict to read the.alt file and emitAH appropriately. (And this is the real reason why this starting point is best prepared by an external dict tool: the.alt file may not exist yet whenbwa index is run, so that's really too soon to create this starting point.)

This fileis a dict file, just with a slightly different purpose, so the different name adds some flexibility. But if@lh3 prefers, I would add the code to strip a suffix and add.dict instead of just adding.hdr to…ref.fa in the same way that other index file names are constructed.

For the time being, maybe let's stop quibbling about the filename so that we can hear the BWA maintainer's thoughts about the proposed feature 😄

@nh13
Copy link
Contributor

@jmarshall open to whatever works here

Introduce a <prefix>.hdr file alongside the other index files, useful forsetting up detailed `@SQ` headers etc to be used by default when mapping.Output SAM headers from: -H options; .hdr file stored with the index files;the basic hard-coded `@HD`/`@SQ` default headers. An `@HD` header is alwaysprinted, with -H overriding one from <prefix>.hdr (if any), which overridesthe default one. A set of `@SQ` headers is always printed, with -H optionsoverriding <prefix>.hdr overriding the default, similarly. Other headersspecified in either -H or <prefix>.hdr are all output.If -H includes `@SQ` headers, we consider that the user has carefullyset up all headers to be output and ignore <prefix>.hdr entirely.Add descriptions of <prefix>.alt (brief) and <prefix>.hdr (complete)to the manual page. This <prefix>.hdr file is implemented for mem, bwase,and bwape. It is not implemented for bwasw, which is deprecated.
Now outputs SAM headers from: -H options; .hdr file stored with theindex files if present, or otherwise a .dict file if that is present;the basic hard-coded `@HD`/`@SQ` default headers.
@jmarshall
Copy link
ContributorAuthor

@nh13: I've updated it to accept either:— hence it reads headers from<prefix>.hdr if present, otherwise reads headers from<baseprefix>.dict if that is present; (otherwise does the default thing).

The string munching required to get fromfoo.fa[sta][.gz] tofoo.dict was not much fun to write 😄 but you are right that accepting that filename convention as well is the best thing to do.

Sign up for freeto join this conversation on GitHub. Already have an account?Sign in to comment
Reviewers
No reviews
Assignees
No one assigned
Labels
None yet
Projects
None yet
Milestone
No milestone
Development

Successfully merging this pull request may close these issues.

2 participants
@jmarshall@nh13

[8]ページ先頭

©2009-2025 Movatter.jp