@@ -13,7 +13,7 @@ programs, and can aid you in debugging.
1313How autograd encodes the history
1414--------------------------------
1515
16- Autograd is reverse automatic differentiation system. Conceptually,
16+ Autograd isa reverse automatic differentiation system. Conceptually,
1717autograd records a graph recording all of the operations that created
1818the data as you execute operations, giving you a directed acyclic graph
1919whose leaves are the input tensors and roots are the output tensors.
@@ -23,11 +23,11 @@ compute the gradients using the chain rule.
2323Internally, autograd represents this graph as a graph of
2424:class: `Function ` objects (really expressions), which can be
2525:meth: `~torch.autograd.Function.apply ` ed to compute the result of
26- evaluating the graph. When computing theforwards pass, autograd
26+ evaluating the graph. When computing theforward pass, autograd
2727simultaneously performs the requested computations and builds up a graph
2828representing the function that computes the gradient (the ``.grad_fn ``
2929attribute of each:class: `torch.Tensor ` is an entry point into this graph).
30- When theforwards pass is completed, we evaluate this graph in the
30+ When theforward pass is completed, we evaluate this graph in the
3131backwards pass to compute the gradients.
3232
3333An important thing to note is that the graph is recreated from scratch at every
@@ -119,7 +119,7 @@ For more fine-grained exclusion of subgraphs from gradient computation,
119119there is setting the ``requires_grad `` field of a tensor.
120120
121121Below, in addition to discussing the mechanisms above, we also describe
122- evaluation mode (:meth: `nn.Module.eval() `), a method that is notactually used
122+ evaluation mode (:meth: `nn.Module.eval() `), a method that is not used
123123to disable gradient computation but, because of its name, is often mixed up with the three.
124124
125125Setting ``requires_grad ``
@@ -164,16 +164,16 @@ of the module's parameters (which have ``requires_grad=True`` by default).
164164Grad Modes
165165^^^^^^^^^^
166166
167- Apart from setting ``requires_grad `` there are also threepossible modes
168- enableable from Python that can affect how computations in PyTorch are
167+ Apart from setting ``requires_grad `` there are also threegrad modes that can
168+ be selected from Python that can affect how computations in PyTorch are
169169processed by autograd internally: default mode (grad mode), no-grad mode,
170170and inference mode, all of which can be togglable via context managers and
171171decorators.
172172
173173Default Mode (Grad Mode)
174174^^^^^^^^^^^^^^^^^^^^^^^^
175175
176- The "default mode" isactually the mode we are implicitly in when no other modes like
176+ The "default mode" is the mode we are implicitly in when no other modes like
177177no-grad and inference mode are enabled. To be contrasted with
178178"no-grad mode" the default mode is also sometimes called "grad mode".
179179
@@ -237,7 +237,7 @@ For implementation details of inference mode see
237237Evaluation Mode (``nn.Module.eval() ``)
238238^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
239239
240- Evaluation mode is notactually a mechanism to locally disable gradient computation.
240+ Evaluation mode is not a mechanism to locally disable gradient computation.
241241It is included here anyway because it is sometimes confused to be such a mechanism.
242242
243243Functionally, ``module.eval() `` (or equivalently ``module.train(False) ``) are completely
@@ -263,21 +263,21 @@ In-place operations with autograd
263263Supporting in-place operations in autograd is a hard matter, and we discourage
264264their use in most cases. Autograd's aggressive buffer freeing and reuse makes
265265it very efficient and there are very few occasions when in-place operations
266- actually lower memory usage by any significant amount. Unless you're operating
266+ lower memory usage by any significant amount. Unless you're operating
267267under heavy memory pressure, you might never need to use them.
268268
269269There are two main reasons that limit the applicability of in-place operations:
270270
2712711. In-place operations can potentially overwrite values required to compute
272272 gradients.
273273
274- 2. Every in-place operationactually requires the implementation to rewrite the
274+ 2. Every in-place operation requires the implementation to rewrite the
275275 computational graph. Out-of-place versions simply allocate new objects and
276276 keep references to the old graph, while in-place operations, require
277277 changing the creator of all inputs to the:class: `Function ` representing
278278 this operation. This can be tricky, especially if there are many Tensors
279279 that reference the same storage (e.g. created by indexing or transposing),
280- and in-place functions willactually raise an error if the storage of
280+ and in-place functions will raise an error if the storage of
281281 modified inputs is referenced by any other:class: `Tensor `.
282282
283283In-place correctness checks
@@ -338,18 +338,18 @@ serializing all the backward calls in a specific order during execution
338338Non-determinism
339339^^^^^^^^^^^^^^^
340340
341- If you are calling ``backward() ``on multiplethread concurrentlybut with
342- shared inputs (i.e. Hogwild CPU training). Since parameters are automatically
343- shared across threads, gradient accumulation might become non-deterministic on
344- backward calls across threads, because two backward calls might access and try
345- to accumulate the same `` .grad `` attribute . This is technically not safe, and
346- it might result inracing condition and the result might be invalid to use.
341+ If you are calling ``backward() ``from multiplethreads concurrentlyand have
342+ shared inputs (i.e. Hogwild CPU training), then non-determinsim should be expected.
343+ This can occur because parameters are automatically shared across threads,
344+ as such, multiple threads may access and try to accumulate the same `` .grad ``
345+ attribute during gradient accumulation . This is technically not safe, and
346+ it might result inrace condition and the result might be invalid to use.
347347
348- But this is expected pattern if you are using the multithreading approach to
349- drive the whole training process but using shared parameters, user who use
350- multithreading should have the threading model in mind and should expect this
351- to happen. User could use the functional API:func: `torch.autograd.grad `to
352- calculate the gradients instead of ``backward() `` to avoid non-determinism.
348+ Users developing multithreaded models featuring shared parameters should have the
349+ threading model in mind and should understand the issues described above.
350+
351+ The functional API:func: `torch.autograd.grad `may be used to calculate the
352+ gradients instead of ``backward() `` to avoid non-determinism.
353353
354354Graph retaining
355355^^^^^^^^^^^^^^^
@@ -368,9 +368,9 @@ Thread Safety on Autograd Node
368368
369369Since Autograd allows the caller thread to drive its backward execution for
370370potential parallelism, it's important that we ensure thread safety on CPU with
371- parallelbackwards that share part/whole of the GraphTask.
371+ parallel`` backward() `` calls that share part/whole of the GraphTask.
372372
373- Custom Python ``autograd.Function `` is automatically thread safe because of GIL.
373+ Custom Python ``autograd.Function ``\s are automatically thread safe because of GIL.
374374For built-in C++ Autograd Nodes (e.g. AccumulateGrad, CopySlices) and custom
375375``autograd::Function ``\s , the Autograd Engine uses thread mutex locking to ensure
376376thread safety on autograd Nodes that might have state write/read.
@@ -440,8 +440,8 @@ It also turns out that no interesting real-valued objective fulfill the
440440Cauchy-Riemann equations. So the theory with homomorphic function cannot be
441441used for optimization and most people therefore use the Wirtinger calculus.
442442
443- Wirtinger Calculus comesin picture ...
444- ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
443+ Wirtinger Calculus comesinto the picture ...
444+ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
445445
446446So, we have this great theory of complex differentiability and
447447holomorphic functions, and we can’t use any of it at all, because many