NotificationsYou must be signed in to change notification settings
Fork26.3k
Star96k

Speed up an integer to the power of a positive integer on CPU#26020

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Jump to bottom

Closed

xuhdev wants to merge1 commit intopytorch:masterfromxuhdev:int-pow

Closed

Speed up an integer to the power of a positive integer on CPU#26020

xuhdev wants to merge1 commit intopytorch:masterfromxuhdev:int-pow

Conversation

Copy link

Collaborator

xuhdev commentedSep 11, 2019•
edited
Loading

Current integer scalar exps are always cast to double. This commit avoids cast if the tensor is also
integral and the scalar is positive to speed up.

Benchmark (Debian Buster, g++ 8, Intel(R) Xeon(R) E-2136 CPU @ 3.30GHz00:03300.00 MHz, Debug
build, Turbo turned off):

importtimeitforn,tin [(1000,13000),            (10_000,1300)]:forein (2,3,4):fordtypein ('torch.int16','torch.int32','torch.int64'):print(f'a.pow({e}) (a.numel() =={n}) for{t} times')print(f'dtype{dtype},{t} times',end='\t\t')print(timeit.timeit(f'a.pow({e})',setup=f'import torch; a = torch.arange({n}, device="cpu", dtype={dtype})',number=t))

Before:

a.pow(2) (a.numel() == 1000) for 13000 timesdtype torch.int16, 13000 times1.6958350749996498a.pow(2) (a.numel() == 1000) for 13000 timesdtype torch.int32, 13000 times0.7989626339999631a.pow(2) (a.numel() == 1000) for 13000 timesdtype torch.int64, 13000 times0.7973162800003593a.pow(3) (a.numel() == 1000) for 13000 timesdtype torch.int16, 13000 times1.8660746679997828a.pow(3) (a.numel() == 1000) for 13000 timesdtype torch.int32, 13000 times0.8101709959996697a.pow(3) (a.numel() == 1000) for 13000 timesdtype torch.int64, 13000 times0.8135280149999744a.pow(4) (a.numel() == 1000) for 13000 timesdtype torch.int16, 13000 times5.010833072999958a.pow(4) (a.numel() == 1000) for 13000 timesdtype torch.int32, 13000 times4.801007671999741a.pow(4) (a.numel() == 1000) for 13000 timesdtype torch.int64, 13000 times3.963344578000033a.pow(2) (a.numel() == 10000) for 1300 timesdtype torch.int16, 1300 times1.6216251330001796a.pow(2) (a.numel() == 10000) for 1300 timesdtype torch.int32, 1300 times0.5672429639998882a.pow(2) (a.numel() == 10000) for 1300 timesdtype torch.int64, 1300 times0.5544572270000572a.pow(3) (a.numel() == 10000) for 1300 timesdtype torch.int16, 1300 times1.656308512999658a.pow(3) (a.numel() == 10000) for 1300 timesdtype torch.int32, 1300 times1.502670819999821a.pow(3) (a.numel() == 10000) for 1300 timesdtype torch.int64, 1300 times0.5757876879997639a.pow(4) (a.numel() == 10000) for 1300 timesdtype torch.int16, 1300 times4.775718216999849a.pow(4) (a.numel() == 10000) for 1300 timesdtype torch.int32, 1300 times4.754745475000163a.pow(4) (a.numel() == 10000) for 1300 timesdtype torch.int64, 1300 times3.737249878000057

After:

a.pow(2) (a.numel() == 1000) for 13000 timesdtype torch.int16, 13000 times1.1006453190002503a.pow(2) (a.numel() == 1000) for 13000 timesdtype torch.int32, 13000 times1.0849009019998448a.pow(2) (a.numel() == 1000) for 13000 timesdtype torch.int64, 13000 times1.093259106000005a.pow(3) (a.numel() == 1000) for 13000 timesdtype torch.int16, 13000 times1.0859826279997833a.pow(3) (a.numel() == 1000) for 13000 timesdtype torch.int32, 13000 times1.1076840900000207a.pow(3) (a.numel() == 1000) for 13000 timesdtype torch.int64, 13000 times1.0755480369998622a.pow(4) (a.numel() == 1000) for 13000 timesdtype torch.int16, 13000 times1.918211066999902a.pow(4) (a.numel() == 1000) for 13000 timesdtype torch.int32, 13000 times1.9183043200000611a.pow(4) (a.numel() == 1000) for 13000 timesdtype torch.int64, 13000 times1.930021430999659a.pow(2) (a.numel() == 10000) for 1300 timesdtype torch.int16, 1300 times0.7271483560002707a.pow(2) (a.numel() == 10000) for 1300 timesdtype torch.int32, 1300 times0.7289002070001516a.pow(2) (a.numel() == 10000) for 1300 timesdtype torch.int64, 1300 times0.7267536800000016a.pow(3) (a.numel() == 10000) for 1300 timesdtype torch.int16, 1300 times0.7301799359997858a.pow(3) (a.numel() == 10000) for 1300 timesdtype torch.int32, 1300 times0.7289195180001116a.pow(3) (a.numel() == 10000) for 1300 timesdtype torch.int64, 1300 times0.7270008230002531a.pow(4) (a.numel() == 10000) for 1300 timesdtype torch.int16, 1300 times1.5354506029998447a.pow(4) (a.numel() == 10000) for 1300 timesdtype torch.int32, 1300 times1.528263066999898a.pow(4) (a.numel() == 10000) for 1300 timesdtype torch.int64, 1300 times1.5369428439998956

Best viewed with whitespace changes turned off

pytorchbot added module: cpu

CPU specific problem (e.g., perf, algorithm)

module: operators labels

Sep 11, 2019

xuhdev requested review fromVitalyFedyunin andpbelevich and removed request forpbelevich

September 11, 2019 17:51

xuhdev force-pushed theint-pow branch 7 times, most recently from2cf14ad to5f3242bCompare

September 11, 2019 20:29

Copy link

Contributor

pbelevich commentedSep 13, 2019

@pytorchbot rebase this please

Copy link

Collaborator

pytorchbot commentedSep 13, 2019

Sorry, only maintainers are authorized to rebase other people's PRs. Feel free to try again on one of your PRs!

(To learn more about this bot, seeBot commands.)

Copy link

CollaboratorAuthor

xuhdev commentedSep 13, 2019

@pytorchbot rebase this please

xuhdev force-pushed theint-pow branch 2 times, most recently fromb5052d1 to5810990Compare

September 16, 2019 18:18

Copy link

CollaboratorAuthor

xuhdev commentedSep 16, 2019

@pytorchbot rebase this please

ezyang added the open source label

Sep 18, 2019

Copy link

CollaboratorAuthor

xuhdev commentedSep 18, 2019

@pytorchbot rebase this please

Copy link

Contributor

VitalyFedyunin commentedSep 18, 2019

Did you forgot to add implementation files? I see only tests =)

xuhdev force-pushed theint-pow branch from515370e to3c3ee27Compare

September 18, 2019 19:41

Copy link

CollaboratorAuthor

xuhdev commentedSep 18, 2019•
edited
Loading

@VitalyFedyunin 🤦‍♂️ 🤦‍♂️ 🤦‍♂️ I got them lost during updates...

facebook-github-bot reviewed

Sep 19, 2019

View reviewed changes

Copy link

Contributor

facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

@VitalyFedyunin has imported this pull request. If you are a Facebook employee, you can view this diffon Phabricator.

Copy link

Contributor

VitalyFedyunin commentedSep 19, 2019

@pytorchbot retest this please

VitalyFedyunin approved these changes

Sep 19, 2019

View reviewed changes

Copy link

Contributor

VitalyFedyunin left a comment•
edited
Loading

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

With this rate pow kernels will take significant part of binary size.

Copy link

CollaboratorAuthor

xuhdev commentedSep 19, 2019

I don't see any reason that the size of kernel will increase significantly...

Copy link

Contributor

VitalyFedyunin commentedSep 20, 2019

Please rebase

xuhdev force-pushed theint-pow branch from3c3ee27 to032372fCompare

September 20, 2019 19:11

Copy link

CollaboratorAuthor

xuhdev commentedSep 20, 2019

@VitalyFedyunin Magic again 😆

facebook-github-bot reviewed

Sep 20, 2019

View reviewed changes

Copy link

Contributor

facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

@VitalyFedyunin has imported this pull request. If you are a Facebook employee, you can view this diffon Phabricator.

Copy link

CollaboratorAuthor

xuhdev commentedSep 20, 2019

Is@torchtest.for_all_device_types() gone? What should I use?

xuhdev force-pushed theint-pow branch from032372f to4ca7775Compare

September 20, 2019 19:32

Copy link

CollaboratorAuthor

xuhdev commentedSep 20, 2019

OK I got it. I removed@torchtest.for_all_device_types()

Speed up an integer to the power of a positive integer on CPU

d3779d9

Current integer scalar exps are always cast to double. This commit avoids cast if the tensor is alsointegral and the scalar is positive to speed up.Benchmark (Debian Buster, g++ 8, Intel(R) Xeon(R) E-2136 CPU @ 3.30GHz, Debug build, Turbo turned off):```pythonimport timeitfor n, t in [(1000, 13000),            (10_000, 1300)]:    for e in (2, 3, 4):        for dtype in ('torch.int16', 'torch.int32', 'torch.int64'):            print(f'a.pow({e}) (a.numel() == {n}) for {t} times')            print(f'dtype {dtype}, {t} times', end='\t\t')            print(timeit.timeit(f'a.pow({e})',                                setup=f'import torch; a = torch.arange({n}, device="cpu", dtype={dtype})',                                number=t))```Before:```a.pow(2) (a.numel() == 1000) for 13000 timesdtype torch.int16, 13000 times1.6958350749996498a.pow(2) (a.numel() == 1000) for 13000 timesdtype torch.int32, 13000 times0.7989626339999631a.pow(2) (a.numel() == 1000) for 13000 timesdtype torch.int64, 13000 times0.7973162800003593a.pow(3) (a.numel() == 1000) for 13000 timesdtype torch.int16, 13000 times1.8660746679997828a.pow(3) (a.numel() == 1000) for 13000 timesdtype torch.int32, 13000 times0.8101709959996697a.pow(3) (a.numel() == 1000) for 13000 timesdtype torch.int64, 13000 times0.8135280149999744a.pow(4) (a.numel() == 1000) for 13000 timesdtype torch.int16, 13000 times5.010833072999958a.pow(4) (a.numel() == 1000) for 13000 timesdtype torch.int32, 13000 times4.801007671999741a.pow(4) (a.numel() == 1000) for 13000 timesdtype torch.int64, 13000 times3.963344578000033a.pow(2) (a.numel() == 10000) for 1300 timesdtype torch.int16, 1300 times1.6216251330001796a.pow(2) (a.numel() == 10000) for 1300 timesdtype torch.int32, 1300 times0.5672429639998882a.pow(2) (a.numel() == 10000) for 1300 timesdtype torch.int64, 1300 times0.5544572270000572a.pow(3) (a.numel() == 10000) for 1300 timesdtype torch.int16, 1300 times1.656308512999658a.pow(3) (a.numel() == 10000) for 1300 timesdtype torch.int32, 1300 times1.502670819999821a.pow(3) (a.numel() == 10000) for 1300 timesdtype torch.int64, 1300 times0.5757876879997639a.pow(4) (a.numel() == 10000) for 1300 timesdtype torch.int16, 1300 times4.775718216999849a.pow(4) (a.numel() == 10000) for 1300 timesdtype torch.int32, 1300 times4.754745475000163a.pow(4) (a.numel() == 10000) for 1300 timesdtype torch.int64, 1300 times3.737249878000057```After:```a.pow(2) (a.numel() == 1000) for 13000 timesdtype torch.int16, 13000 times1.1006453190002503a.pow(2) (a.numel() == 1000) for 13000 timesdtype torch.int32, 13000 times1.0849009019998448a.pow(2) (a.numel() == 1000) for 13000 timesdtype torch.int64, 13000 times1.093259106000005a.pow(3) (a.numel() == 1000) for 13000 timesdtype torch.int16, 13000 times1.0859826279997833a.pow(3) (a.numel() == 1000) for 13000 timesdtype torch.int32, 13000 times1.1076840900000207a.pow(3) (a.numel() == 1000) for 13000 timesdtype torch.int64, 13000 times1.0755480369998622a.pow(4) (a.numel() == 1000) for 13000 timesdtype torch.int16, 13000 times1.918211066999902a.pow(4) (a.numel() == 1000) for 13000 timesdtype torch.int32, 13000 times1.9183043200000611a.pow(4) (a.numel() == 1000) for 13000 timesdtype torch.int64, 13000 times1.930021430999659a.pow(2) (a.numel() == 10000) for 1300 timesdtype torch.int16, 1300 times0.7271483560002707a.pow(2) (a.numel() == 10000) for 1300 timesdtype torch.int32, 1300 times0.7289002070001516a.pow(2) (a.numel() == 10000) for 1300 timesdtype torch.int64, 1300 times0.7267536800000016a.pow(3) (a.numel() == 10000) for 1300 timesdtype torch.int16, 1300 times0.7301799359997858a.pow(3) (a.numel() == 10000) for 1300 timesdtype torch.int32, 1300 times0.7289195180001116a.pow(3) (a.numel() == 10000) for 1300 timesdtype torch.int64, 1300 times0.7270008230002531a.pow(4) (a.numel() == 10000) for 1300 timesdtype torch.int16, 1300 times1.5354506029998447a.pow(4) (a.numel() == 10000) for 1300 timesdtype torch.int32, 1300 times1.528263066999898a.pow(4) (a.numel() == 10000) for 1300 timesdtype torch.int64, 1300 times1.5369428439998956```

xuhdev force-pushed theint-pow branch from4ca7775 tod3779d9Compare

September 20, 2019 19:36

Copy link

Contributor

VitalyFedyunin commentedSep 20, 2019

https://github.com/pytorch/pytorch/wiki/Writing-tests-that-run-on-all-available-device-types world changes super fast

pbelevich suggested changes

Sep 21, 2019

View reviewed changes

aten/src/ATen/native/cpu/PowKernel.cppShow resolvedHide resolved

pbelevich approved these changes

Sep 21, 2019

View reviewed changes

facebook-github-bot reviewed

Sep 23, 2019

View reviewed changes

Copy link

Contributor

facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

@VitalyFedyunin has imported this pull request. If you are a Facebook employee, you can view this diffon Phabricator.

facebook-github-bot closed this inae0732c

Sep 24, 2019

zdevito pushed a commit to zdevito/ATen that referenced this pull request

Sep 24, 2019

Speed up an integer to the power of a positive integer on CPU (#26020)

6d43458

Summary:Current integer scalar exps are always cast to double. This commit avoids cast if the tensor is alsointegral and the scalar is positive to speed up.Benchmark (Debian Buster, g++ 8, Intel(R) Xeon(R) E-2136 CPU @ 3.30GHz00:03300.00 MHz, Debugbuild, Turbo turned off):```pythonimport timeitfor n, t in [(1000, 13000),            (10_000, 1300)]:    for e in (2, 3, 4):        for dtype in ('torch.int16', 'torch.int32', 'torch.int64'):            print(f'a.pow({e}) (a.numel() == {n}) for {t} times')            print(f'dtype {dtype}, {t} times', end='\t\t')            print(timeit.timeit(f'a.pow({e})',                                setup=f'import torch; a = torch.arange({n}, device="cpu", dtype={dtype})',                                number=t))```Before:```a.pow(2) (a.numel() == 1000) for 13000 timesdtype torch.int16, 13000 times1.6958350749996498a.pow(2) (a.numel() == 1000) for 13000 timesdtype torch.int32, 13000 times0.7989626339999631a.pow(2) (a.numel() == 1000) for 13000 timesdtype torch.int64, 13000 times0.7973162800003593a.pow(3) (a.numel() == 1000) for 13000 timesdtype torch.int16, 13000 times1.8660746679997828a.pow(3) (a.numel() == 1000) for 13000 timesdtype torch.int32, 13000 times0.8101709959996697a.pow(3) (a.numel() == 1000) for 13000 timesdtype torch.int64, 13000 times0.8135280149999744a.pow(4) (a.numel() == 1000) for 13000 timesdtype torch.int16, 13000 times5.010833072999958a.pow(4) (a.numel() == 1000) for 13000 timesdtype torch.int32, 13000 times4.801007671999741a.pow(4) (a.numel() == 1000) for 13000 timesdtype torch.int64, 13000 times3.963344578000033a.pow(2) (a.numel() == 10000) for 1300 timesdtype torch.int16, 1300 times1.6216251330001796a.pow(2) (a.numel() == 10000) for 1300 timesdtype torch.int32, 1300 times0.5672429639998882a.pow(2) (a.numel() == 10000) for 1300 timesdtype torch.int64, 1300 times0.5544572270000572a.pow(3) (a.numel() == 10000) for 1300 timesdtype torch.int16, 1300 times1.656308512999658a.pow(3) (a.numel() == 10000) for 1300 timesdtype torch.int32, 1300 times1.502670819999821a.pow(3) (a.numel() == 10000) for 1300 timesdtype torch.int64, 1300 times0.5757876879997639a.pow(4) (a.numel() == 10000) for 1300 timesdtype torch.int16, 1300 times4.775718216999849a.pow(4) (a.numel() == 10000) for 1300 timesdtype torch.int32, 1300 times4.754745475000163a.pow(4) (a.numel() == 10000) for 1300 timesdtype torch.int64, 1300 times3.737249878000057```After:```a.pow(2) (a.numel() == 1000) for 13000 timesdtype torch.int16, 13000 times1.1006453190002503a.pow(2) (a.numel() == 1000) for 13000 timesdtype torch.int32, 13000 times1.0849009019998448a.pow(2) (a.numel() == 1000) for 13000 timesdtype torch.int64, 13000 times1.093259106000005a.pow(3) (a.numel() == 1000) for 13000 timesdtype torch.int16, 13000 times1.0859826279997833a.pow(3) (a.numel() == 1000) for 13000 timesdtype torch.int32, 13000 times1.1076840900000207a.pow(3) (a.numel() == 1000) for 13000 timesdtype torch.int64, 13000 times1.0755480369998622a.pow(4) (a.numel() == 1000) for 13000 timesdtype torch.int16, 13000 times1.918211066999902a.pow(4) (a.numel() == 1000) for 13000 timesdtype torch.int32, 13000 times1.9183043200000611a.pow(4) (a.numel() == 1000) for 13000 timesdtype torch.int64, 13000 times1.930021430999659a.pow(2) (a.numel() == 10000) for 1300 timesdtype torch.int16, 1300 times0.7271483560002707a.pow(2) (a.numel() == 10000) for 1300 timesdtype torch.int32, 1300 times0.7289002070001516a.pow(2) (a.numel() == 10000) for 1300 timesdtype torch.int64, 1300 times0.7267536800000016a.pow(3) (a.numel() == 10000) for 1300 timesdtype torch.int16, 1300 times0.7301799359997858a.pow(3) (a.numel() == 10000) for 1300 timesdtype torch.int32, 1300 times0.7289195180001116a.pow(3) (a.numel() == 10000) for 1300 timesdtype torch.int64, 1300 times0.7270008230002531a.pow(4) (a.numel() == 10000) for 1300 timesdtype torch.int16, 1300 times1.5354506029998447a.pow(4) (a.numel() == 10000) for 1300 timesdtype torch.int32, 1300 times1.528263066999898a.pow(4) (a.numel() == 10000) for 1300 timesdtype torch.int64, 1300 times1.5369428439998956``` ---Best viewed with whitespace changes turned offPull Requestresolved:pytorch/pytorch#26020Differential Revision: D17485400Pulled By: VitalyFedyuninfbshipit-source-id: 3a16b074825a5aab0f7e7af3d8100f9e4b7011a3

xuhdev deleted the int-pow branch

September 24, 2019 17:42

Copy link

Contributor

facebook-github-bot commentedSep 24, 2019

@VitalyFedyunin merged this pull request inae0732c.

facebook-github-bot added the merged label

Sep 24, 2019

xuhdev added a commit that referenced this pull request

Sep 26, 2019

Bring back the optimization of integer.pow({2.0, 3.0})

7fcfaa7

They were accidentally removed in#26020[ghstack-poisoned]

xuhdev mentioned this pull request

Sep 26, 2019

Bring back the optimization of integer.pow({2.0, 3.0}) on CPU#26938

Closed

xuhdev added a commit that referenced this pull request

Sep 26, 2019

Bring back the optimization of integer.pow({2.0, 3.0})

8c547d2

They were accidentally removed in#26020ghstack-source-id:ba4853cPull Requestresolved:#26938

facebook-github-bot pushed a commit that referenced this pull request

Sep 27, 2019

Bring back the optimization of integer.pow({2.0, 3.0}) on CPU (#26938)

6d715c9

Summary:Pull Requestresolved:#26938They were accidentally removed in#26020Test Plan: Imported from OSSDifferential Revision: D17632120Pulled By: pbelevichfbshipit-source-id: d62f2b5635fb4976fd4eda2f2015fdf67138a0c0

rohithkrn added a commit to ROCm/pytorch that referenced this pull request

Oct 1, 2019

Merge upstream master (#485)

0e6c9c8

* Typo fix (#26417)Summary:Pull Request resolved: https://github.com/pytorch/pytorch/pull/26417Signed-off-by: Edward Z. Yang <ezyang@fb.com>Test Plan: Imported from OSSDifferential Revision: D17548776Pulled By: ezyangfbshipit-source-id: 8c79893ee4216780edb838671e701de5518c4cd0* Don't generate named tensor functions to RegistrationFunctions.h (#26685)Summary:Pull Request resolved: https://github.com/pytorch/pytorch/pull/26685This prevents XLA from picking up on named tensor APIs. I ran into someproblems while attempting to support dimname overloads in XLA; since wedon't need the first iteration of named tensors to work with XLA this isOK.Test Plan: - run CI.Differential Revision: D17538893Pulled By: zou3519fbshipit-source-id: 93d579c93f5b1dc68541c07c4a3d61792859507d* Updating submodulesSummary:GitHub commits:https://github.com/facebook/litho/commit/ff4a61094e9405310b39219a35c6ff8e44300573https://github.com/facebookincubator/mvfst/commit/ad81c3823ec7910296f97d2050fde181be1d4ac4https://github.com/pytorch/fbgemm/commit/518d8a1832cf1eb1dda2feace1a278e9e4f302baTest Plan: n/aReviewed By: yns88fbshipit-source-id: 2a9a47805569a43e05d044c5494b57f6a7996bc4* Add tests for C++ functional cosine_similarity and pairwise_distance, and clean up functional test code (#26559)Summary:This ensures that `F::cosine_similarity` and `F::pairwise_distance` can be used simply by including `torch/torch.h` and set `namespace F = torch::nn::functional`.Pull Request resolved: https://github.com/pytorch/pytorch/pull/26559Differential Revision: D17507421Pulled By: yf225fbshipit-source-id: f895dde3634d5c8ca66ee036903e327e5cdab6b1* allow building docker without torchvision (#26168)Summary:There is an issue with the torchvision version not matching the pytorch version if one builds the docker from a tag, see issue https://github.com/pytorch/pytorch/issues/25917.  The current solution requires one to re-init the submodules or manually change the version of torchvision.  This PR allows one to build the docker image without torchvision, which not only fixes the above mentioned bug but also frees non-image pytorch users from the tyranny of torchvision :laughing:.In all seriousness, for NLP researchers especially torchvision isn't a necessity for pytorch and all non-essential items shouldn't be in the docker.  This option removes one extra thing that can go wrong.Pull Request resolved: https://github.com/pytorch/pytorch/pull/26168Differential Revision: D17550001Pulled By: soumithfbshipit-source-id: 48b8b9e22b75eef3afb392c618742215d3920e9d* Speed up an integer to the power of a positive integer on CPU (#26020)Summary:Current integer scalar exps are always cast to double. This commit avoids cast if the tensor is alsointegral and the scalar is positive to speed up.Benchmark (Debian Buster, g++ 8, Intel(R) Xeon(R) E-2136 CPU @ 3.30GHz00:03300.00 MHz, Debugbuild, Turbo turned off):```pythonimport timeitfor n, t in [(1000, 13000),            (10_000, 1300)]:    for e in (2, 3, 4):        for dtype in ('torch.int16', 'torch.int32', 'torch.int64'):            print(f'a.pow({e}) (a.numel() == {n}) for {t} times')            print(f'dtype {dtype}, {t} times', end='\t\t')            print(timeit.timeit(f'a.pow({e})',                                setup=f'import torch; a = torch.arange({n}, device="cpu", dtype={dtype})',                                number=t))```Before:```a.pow(2) (a.numel() == 1000) for 13000 timesdtype torch.int16, 13000 times1.6958350749996498a.pow(2) (a.numel() == 1000) for 13000 timesdtype torch.int32, 13000 times0.7989626339999631a.pow(2) (a.numel() == 1000) for 13000 timesdtype torch.int64, 13000 times0.7973162800003593a.pow(3) (a.numel() == 1000) for 13000 timesdtype torch.int16, 13000 times1.8660746679997828a.pow(3) (a.numel() == 1000) for 13000 timesdtype torch.int32, 13000 times0.8101709959996697a.pow(3) (a.numel() == 1000) for 13000 timesdtype torch.int64, 13000 times0.8135280149999744a.pow(4) (a.numel() == 1000) for 13000 timesdtype torch.int16, 13000 times5.010833072999958a.pow(4) (a.numel() == 1000) for 13000 timesdtype torch.int32, 13000 times4.801007671999741a.pow(4) (a.numel() == 1000) for 13000 timesdtype torch.int64, 13000 times3.963344578000033a.pow(2) (a.numel() == 10000) for 1300 timesdtype torch.int16, 1300 times1.6216251330001796a.pow(2) (a.numel() == 10000) for 1300 timesdtype torch.int32, 1300 times0.5672429639998882a.pow(2) (a.numel() == 10000) for 1300 timesdtype torch.int64, 1300 times0.5544572270000572a.pow(3) (a.numel() == 10000) for 1300 timesdtype torch.int16, 1300 times1.656308512999658a.pow(3) (a.numel() == 10000) for 1300 timesdtype torch.int32, 1300 times1.502670819999821a.pow(3) (a.numel() == 10000) for 1300 timesdtype torch.int64, 1300 times0.5757876879997639a.pow(4) (a.numel() == 10000) for 1300 timesdtype torch.int16, 1300 times4.775718216999849a.pow(4) (a.numel() == 10000) for 1300 timesdtype torch.int32, 1300 times4.754745475000163a.pow(4) (a.numel() == 10000) for 1300 timesdtype torch.int64, 1300 times3.737249878000057```After:```a.pow(2) (a.numel() == 1000) for 13000 timesdtype torch.int16, 13000 times1.1006453190002503a.pow(2) (a.numel() == 1000) for 13000 timesdtype torch.int32, 13000 times1.0849009019998448a.pow(2) (a.numel() == 1000) for 13000 timesdtype torch.int64, 13000 times1.093259106000005a.pow(3) (a.numel() == 1000) for 13000 timesdtype torch.int16, 13000 times1.0859826279997833a.pow(3) (a.numel() == 1000) for 13000 timesdtype torch.int32, 13000 times1.1076840900000207a.pow(3) (a.numel() == 1000) for 13000 timesdtype torch.int64, 13000 times1.0755480369998622a.pow(4) (a.numel() == 1000) for 13000 timesdtype torch.int16, 13000 times1.918211066999902a.pow(4) (a.numel() == 1000) for 13000 timesdtype torch.int32, 13000 times1.9183043200000611a.pow(4) (a.numel() == 1000) for 13000 timesdtype torch.int64, 13000 times1.930021430999659a.pow(2) (a.numel() == 10000) for 1300 timesdtype torch.int16, 1300 times0.7271483560002707a.pow(2) (a.numel() == 10000) for 1300 timesdtype torch.int32, 1300 times0.7289002070001516a.pow(2) (a.numel() == 10000) for 1300 timesdtype torch.int64, 1300 times0.7267536800000016a.pow(3) (a.numel() == 10000) for 1300 timesdtype torch.int16, 1300 times0.7301799359997858a.pow(3) (a.numel() == 10000) for 1300 timesdtype torch.int32, 1300 times0.7289195180001116a.pow(3) (a.numel() == 10000) for 1300 timesdtype torch.int64, 1300 times0.7270008230002531a.pow(4) (a.numel() == 10000) for 1300 timesdtype torch.int16, 1300 times1.5354506029998447a.pow(4) (a.numel() == 10000) for 1300 timesdtype torch.int32, 1300 times1.528263066999898a.pow(4) (a.numel() == 10000) for 1300 timesdtype torch.int64, 1300 times1.5369428439998956``` ---Best viewed with whitespace changes turned offPull Request resolved: https://github.com/pytorch/pytorch/pull/26020Differential Revision: D17485400Pulled By: VitalyFedyuninfbshipit-source-id: 3a16b074825a5aab0f7e7af3d8100f9e4b7011a3* Use noop observer to pass dtype for dynamic quantization (#26709)Summary:Pull Request resolved: https://github.com/pytorch/pytorch/pull/26709Polishes implementation from #25975. Primarily, we use NoopObserver to communicate that weights need to be quantized to float16. The very top-level API (quantize_dynamic) stays the same with `dtype` argument but the implementation follows the common flow.One can argue that dynamic fp16 quantization doesn't really fit into the 'observer' mechanism. It's in fact not ideal, but it's better to have the same flow than branching on both dtype and qconfig.Test Plan: Imported from OSSDifferential Revision: D17544103Pulled By: dzhulgakovfbshipit-source-id: 6af3f18c35929a1a53ea734079c005f656e4925f* Remove duplicate calculation of output shape (#26684)Summary:Pull Request resolved: https://github.com/pytorch/pytorch/pull/26684Output heights and widths are already calculated by conv_p. Remove the duplicate calculation.ghstack-source-id: 90633432Test Plan:buck test mode/dev caffe2/test:quantized```Summary (total time 18.69s):  PASS: 45  FAIL: 0  SKIP: 10    caffe2/test:quantized - test_qadd_scalar_relu (test_quantized.TestQuantizedOps)    caffe2/test:quantized - test_equal (test_quantized.TestQuantizedOps)    caffe2/test:quantized - test_qnnpack_add (test_quantized.TestQNNPackOps)    caffe2/test:quantized - test_qconv_unpack (test_quantized.TestQNNPackOps)    caffe2/test:quantized - test_qlinear_unpack (test_quantized.TestQNNPackOps)    caffe2/test:quantized - test_compare_tensor_scalar (test_quantized.TestComparatorOps)    caffe2/test:quantized - test_qconv_qnnpack (test_quantized.TestQNNPackOps)    caffe2/test:quantized - test_qlinear_qnnpack (test_quantized.TestQNNPackOps)    caffe2/test:quantized - test_qnnpack_relu (test_quantized.TestQNNPackOps)    caffe2/test:quantized - test_qnnpack_maxpoolMore details at https://our.intern.facebook.com/intern/buck/build/3b394f1e-ab99-4e59-bdf5-2766f46e98692d (test_quantized.TestQNNPackOps)  FATAL: 0  TIMEOUT: 0  OMIT: 0```Differential Revision: D17538375fbshipit-source-id: b4b60e93fdec4cc7bbf6aee7182381221dfac243* Expands TestAutogradDeviceType (#26708)Summary:- Ports all CUDA tests to TestAutogradDeviceType except those using multiple devicesPull Request resolved: https://github.com/pytorch/pytorch/pull/26708Differential Revision: D17549435Pulled By: mruberryfbshipit-source-id: b564186444201d1351934b6a7d21f67bdfca6e3b* Add traces to specialize_autograd and lower_grad_of (2nd try)Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/22752Differential Revision: D17543836Pulled By: Krovatkinfbshipit-source-id: 5cbca220943a580169bf60ac09780b6e67075d2b* Setting automatic default selection for ONNX IR v4 semantics in ONNX export API (#26146)Summary:This is a follow-up PR for https://github.com/pytorch/pytorch/pull/23284. In that PR we had removed changing the default behavior for `keep_initializers_as_input` argument to the export API. With this PR we are enabling that change in that if `keep_initializers_as_input` is not specified then value/behavior for this argument is chosen automatically depending on whether the export type is ONNX or not.This was part of the earlier PR was removed for further review. The test points have also been updated.This change may fail some internal tests which may require explicitly setting `keep_initializers_as_input=True` to preserve old behavior.Pull Request resolved: https://github.com/pytorch/pytorch/pull/26146Reviewed By: hl475Differential Revision: D17369677Pulled By: houseroadfbshipit-source-id: 2aec2cff50d215714ee8769505ef24d2b7865a11* Enable hub tests on MacOS (#26697)Summary:fix https://github.com/pytorch/pytorch/issues/26032.This was broken by a bad openssl release in conda. Should be fixed now. Testing...Pull Request resolved: https://github.com/pytorch/pytorch/pull/26697Differential Revision: D17542095Pulled By: ailzhangfbshipit-source-id: ba99f9b36ef2a7c793842cf91bd46fb2634ac1aa* Trivial quantized torch.mean implementationSummary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/26253Test Plan: Imported from OSSDifferential Revision: D17529994Pulled By: jamesr66afbshipit-source-id: e3aff71da35b05ed61710cdb88d72b51c944168b* Remove _dequantize_per_channel in the pattern (#26680)Summary:Pull Request resolved: https://github.com/pytorch/pytorch/pull/26680This was introduced before under the assumption that we'll have a qconv_per_tensor_affineand a qconv_per_channel_affine, but turns out we don't have these, so we'll removethse functions.Test Plan:python test/test_jit.py 'TestJit.test_quant_fusion'Imported from OSSDifferential Revision: D17542607fbshipit-source-id: b90ce5738170f0922bdc2eb1c4dbecd930f68a48* Register values listed in __constants__ as attributes of the Module. (#26581)Summary:Pull Request resolved: https://github.com/pytorch/pytorch/pull/26581We're currently inlining immediate values of the constants directly intoIR when we generate it providing no way to access these values by theirnames later. This change registers such values as atrtibutes of themodule so that they are not lost after IR generation.Differential Revision: D17513451Test Plan: Imported from OSSPulled By: ZolotukhinMfbshipit-source-id: cf8f9b450e7178692211abd905ffd2d7ce5a6ce1* Un-hardcode epsilon constant in FoldConvBatchNorm2d.Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/26584Test Plan: Imported from OSSDifferential Revision: D17514653Pulled By: ZolotukhinMfbshipit-source-id: 7d9cc8f619b7dbe26fa58eac37cc131929c004d4* Add doc building instructionsSummary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/26553Differential Revision: D17551426Pulled By: driazatifbshipit-source-id: 53ce05882091aca4617586bc53944ee4c8b3a622* Make `is_optional` check more robust (#26312)Summary:If the `Union` contains a non-class type, `issubclass` would fail, thisadds a check for that case](https://our.intern.facebook.com/intern/diff/17505206/)Pull Request resolved: https://github.com/pytorch/pytorch/pull/26312Pulled By: driazatiDifferential Revision: D17505206fbshipit-source-id: 1331e412f938e2f08ecb079972147f11e3ec77cd* Remove _dequantize_per_tensor (#26681)Summary:Pull Request resolved: https://github.com/pytorch/pytorch/pull/26681attTest Plan:ciImported from OSSDifferential Revision: D17542833fbshipit-source-id: 653e906b0e146763609c69ef0de7f9cf38621586* fix annotation regex for flake8 (#26694)Summary:Pull Request resolved: https://github.com/pytorch/pytorch/pull/26694Previously we would not properly populate `errorDesc` for:```./torch/jit/__init__.py:13:1: F401 'torch.nn.ModuleList' imported but unused```because we wanted only letters and spaces. Be more permissiveTest Plan: Imported from OSSDifferential Revision: D17551999Pulled By: suofbshipit-source-id: b82567df1fa3c9729e7427dc3461bedfb40933dc* Add C++ nn::Identity (#26713)Summary:**Summary**:Adds `torch::nn::Identity` module support for the C++ API.**Issue**: https://github.com/pytorch/pytorch/issues/25883**Reviewer**: yf225Pull Request resolved: https://github.com/pytorch/pytorch/pull/26713Differential Revision: D17550982Pulled By: yf225fbshipit-source-id: f24483846e82d5d276d77a1a0c50884f3bc05112* add timeout parameter to connect function in TCPStore (#26554)Summary:Pull Request resolved: https://github.com/pytorch/pytorch/pull/26554Previously, in `TCPStore`'s constructor we did not pass in a timeout tothe `connect` function, which thus used the default timeout (-1, so infinite).But the timeout variable in `TCPStore.cpp `is configurable by the user and set tobe 300 seconds by default, so we should be passing this into the connect function.Test Plan: see above.Differential Revision: D17486779fbshipit-source-id: 42d38a3b8d492d9e9ff09110990a8e4a3a1292b2* Add threadpool in qlinear and qconv for mobile (#26728)Summary:Pull Request resolved: https://github.com/pytorch/pytorch/pull/26728Use Caffe2::mobile_threadpool() in linear and conv operatorsPerfWithout threadpool - 76msWith threadpool - 41 msTest Plan:python test/test_quantized.py TestQNNPackOpsImported from OSSDifferential Revision: D17553510fbshipit-source-id: dd5b06f526f65d87727ec7e3dad0a5fa74cba9f9* Update ONNX Export for Interpolate in Opset 11 (#24805)Summary:- Add support for linear and cubic interpolate in opset 11.- Add support for 1d and 3d interpolate in nearest mode for opset 7 and 8.- Add tests for all cases of interpolate in ORT tests (nearest/linear/cubic, 1d/2d/3d, upsample/downsample).Pull Request resolved: https://github.com/pytorch/pytorch/pull/24805Reviewed By: hl475Differential Revision: D17330801Pulled By: houseroadfbshipit-source-id: 1bdefff9e72f5e70c51f4721e1d7347478b7505b* Refactor android torchvision: not hardcoded mean/std (#26690)Summary:- Normalization mean and std specified as parameters instead of hardcode - imageYUV420CenterCropToFloat32Tensor before this change worked only with square tensors (width==height) - added generalization to support width != height with all rotations and scalings- javadocsPull Request resolved: https://github.com/pytorch/pytorch/pull/26690Differential Revision: D17556006Pulled By: IvanKobzarevfbshipit-source-id: 63f3321ea2e6b46ba5c34f9e92c48d116f7dc5ce* Simplify operator `sign` using the helper.Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/25592Test Plan: Imported from OSSDifferential Revision: D17552470Pulled By: VitalyFedyuninfbshipit-source-id: 6c8cc4f46dd390c231b2d0aac664ad2a6ac8876e* Revert D17514653: [quant] Un-hardcode epsilon constant in FoldConvBatchNorm2d.Test Plan: revert-hammerDifferential Revision:D17514653Original commit changeset: 7d9cc8f619b7fbshipit-source-id: 2cf32082a46fe169a1db4926df78a9f3256616ad* Revert D17513451: Register values listed in __constants__ as attributes of the Module.Test Plan: revert-hammerDifferential Revision:D17513451Original commit changeset: cf8f9b450e71fbshipit-source-id: 319ec9399173eb06556969dc6be365b319c1ab6c* Make ONNX_ATEN_FALLBACK also works for _export (#26738)Summary:Pull Request resolved: https://github.com/pytorch/pytorch/pull/26738someone may use torch._export directly. Here we change the onnx_export_type's default value to None,and if it's pytorch onnx caffe2 bundle, we set it to ONNX_ATEN_FALLBACK, otherwise, it's ONNX.Test Plan: ciReviewed By: hl475Differential Revision: D17546452fbshipit-source-id: 38e53926e2b101484bbbce7b58ebcd6af8c42438* Address review comments in https://github.com/pytorch/pytorch/pull/26272 (#26587)Summary:Pull Request resolved: https://github.com/pytorch/pytorch/pull/26587-ghstack-source-id: 90557226Test Plan: unit testsDifferential Revision: D17515048fbshipit-source-id: 3459ee80efec29080060ec29d67642d789dd8749* move more functions to InsertObserversHelper (#26696)Summary:Pull Request resolved: https://github.com/pytorch/pytorch/pull/26696attTest Plan:ciImported from OSSDifferential Revision: D17558701fbshipit-source-id: 96ef87db74bd1a5d4ddc69867ae71d78c0df83fd* Added test case for reinit (#26506)Summary:Pull Request resolved: https://github.com/pytorch/pytorch/pull/26506[pytorch] [distributed] Made test forgiving to allow rpc agent to return one of the two errors.ghstack-source-id: 90667534Test Plan: Made sure pg based UT works.Differential Revision: D17488899fbshipit-source-id: 41f76cf4b4a0ca5e651a5403d6e67b639f0b9c4f* Switch our Android CI to Clang (#26656)Summary:Pull Request resolved: https://github.com/pytorch/pytorch/pull/26656Updating the NDK to r18 or newer triggers a path in our CI scripts so that we now build with clang instead of gcc.Google discontinued the gcc support for android quite a while ago, clang is the only way forward.ghstack-source-id: 90698985Test Plan: CIReviewed By: dreissDifferential Revision: D17533570fbshipit-source-id: 5eef4d5a539d8bb1a6682f000d0b5d33b3752819* quantized_tensor tests (#25429)Summary:Pull Request resolved: https://github.com/pytorch/pytorch/pull/25429Previously we are using empty to generate test tensors, this PR changes the test tensors to userandint so that we can test things properlyAlso added a set_sizes_and_strides and removed .contiguous() in int_repr function to preserve theoriginal size and stridesTest Plan:python test/test_quantized_tensor.pyImported from OSSDifferential Revision: D17559660fbshipit-source-id: d4ce81d577296c1137270fdaa6b1359fb703896f* Add a lot of dimname overloads (#26636)Summary:Pull Request resolved: https://github.com/pytorch/pytorch/pull/26636This PR defines a lot of dimname overloads so that when named tensorsupport is added for those operators, we will not have to modify theautogenerated TensorMethods.h, thereby avoiding potential mergeconflicts in the future.Overloads were added for the following:- all- any- argmax- argmin- cumsum- cumprod- index_copy- kthvalue- mode- permute- squeeze- index_add- index_fill- scatter- scatter_add- index_select- gather- sort- argsortTest Plan: - [namedtensor ci]Differential Revision: D17522984Pulled By: zou3519fbshipit-source-id: eca6dea819ba4e4e43b71b700d5cf09176f00061* Automatic update of fbcode/onnx to ab6b94203c595f74b1f126eb118eef22e4c05a57 (#26736)Summary:Pull Request resolved: https://github.com/pytorch/pytorch/pull/26736Previous import was 23bb6ea1a71f08e200114a153f48bd7adb66d486Included changes:- **[ab6b9420](https://github.com/onnx/onnx/commit/ab6b9420)**: Relax IF's shape inference rule (#2345) <Wei-Sheng Chin>- **[c5af774a](https://github.com/onnx/onnx/commit/c5af774a)**: Clarify behavior in ConvTranspose (#2343) <Wei-Sheng Chin>- **[a20ba2f1](https://github.com/onnx/onnx/commit/a20ba2f1)**: Fix node test case model for Gemm scalar bias case (#2342) <Hariharan Seshadri>- **[1aa176e0](https://github.com/onnx/onnx/commit/1aa176e0)**: Update pybind (#2340) <Changming Sun>- **[7840504d](https://github.com/onnx/onnx/commit/7840504d)**: Update gen_doc script to validate proto3 files (#2122) <Raymond Yang>- **[bd35e623](https://github.com/onnx/onnx/commit/bd35e623)**: Fix some backend tests  (#2335) <Hariharan Seshadri>Test Plan: ciReviewed By: hl475Differential Revision: D17552449fbshipit-source-id: 424acb261b54fc98485f782f6922b11b28c836eb* Add whitelist for backward compatible checks for function schemas (#26740)Summary:Now, we skip all function schema contains quantize key wordPull Request resolved: https://github.com/pytorch/pytorch/pull/26740Reviewed By: hl475Differential Revision: D17561753Pulled By: houseroadfbshipit-source-id: c5e47ada072e71bfa2341a0af8f1743e86ef733c* Revert D17558701: [refactor] move more functions to InsertObserversHelperTest Plan: revert-hammerDifferential Revision:D17558701Original commit changeset: 96ef87db74bdfbshipit-source-id: fc398d3b8bb1cd0bae573e3fdac5cfb883b31373* Wrap dimensions during named inference (#26558)Summary:Pull Request resolved: https://github.com/pytorch/pytorch/pull/26558Previously, name inference gets called after dimensions are wrapped.This PR makes it so that name inference always wraps dimensions so thatit can be called anywhere. Ideally we would only wrap dimensions once,but many of our operators wrap dimensions in weird places.Wrapping dimensions in name inference is pretty inexpensive and onlyhappens for named tensors (name inference does not run on unnamedtensors.)Test Plan: - [namedtensor ci]Differential Revision: D17557049Pulled By: zou3519fbshipit-source-id: 68c5636489e233dbf2588ab6ad4e379a6fe4c8ba* Fix builtin lookup for Python functionsSummary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/26688Pulled By: driazatiDifferential Revision: D17560634fbshipit-source-id: e1c50d1ca24e0313c2b7d704c488a29ef6a47cad* Revert D17330801: [pytorch][PR] Update ONNX Export for Interpolate in Opset 11Test Plan: revert-hammerDifferential Revision:D17330801Original commit changeset: 1bdefff9e72ffbshipit-source-id: dff07477403170c27260f736ab6e6010f0deca9f* Revert D17559660: [fix] quantized_tensor testsTest Plan: revert-hammerDifferential Revision:D17559660Original commit changeset: d4ce81d57729fbshipit-source-id: b6c9dc31f08935d255fa9eb3a830bafc76a13799* use new fbgemm PackedDepthWiseConvMatrix without template parameter (#26760)Summary:Pull Request resolved: https://github.com/pytorch/pytorch/pull/26760Follow-up of D17514003 . Change Caffe2 code to use the new PackedDepthWiseConvMatrix interface.Test Plan: CIReviewed By: dskhudiaDifferential Revision: D17514350fbshipit-source-id: 691d9f1fd35bdb7dd8ba152287f3a34359dc1f4c* Add comments for multidim tensor factory limitations, and rename ListInitTensor for better clarity (#26756)Summary:This PR includes the following improvements:1. Add comments for limitations of the multidim tensor factory function `torch::tensor(...)`, noting the fact that `torch::tensor({})` and mixed data type such as `torch::tensor({{bool, 2.0}})` are not supported at the moment. (I will also update https://pytorch.org/cppdocs/notes/tensor_creation.html to include usage examples for the multidim tensor factory function `torch::tensor(...)`)2. Rename `ListInitTensor` to `InitListTensor`, for better naming consistency.This addresses reviews in https://github.com/pytorch/pytorch/pull/26210. I will work on a separate PR to move the factory function to `at::`.Pull Request resolved: https://github.com/pytorch/pytorch/pull/26756Differential Revision: D17560136Pulled By: yf225fbshipit-source-id: eb8b45226e999784da48f75cc8953a998582df99* rename caffe2::mobile_threadpool to caffe2::mobile_pthreadpoolSummary:Rename old mobile_threadpool() API, replace it with a new version thatreturns caffe2::ThreadPool instead of pthreadpool_t.Test Plan: - buildsDifferential Revision: D17543413Pulled By: ljk53fbshipit-source-id: a3effd24e8ce9d677a2a04ebe6b6e1582e6f0a65* Improve error message in IR parser when accessing undefined variable.Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/26771Test Plan: Imported from OSSDifferential Revision: D17562853Pulled By: ZolotukhinMfbshipit-source-id: b4d4bc6001e3ea06f4d1b8691ad2a339a04c16ea* Handle DeQuantStub() for QAT (#26518)Summary:Pull Request resolved: https://github.com/pytorch/pytorch/pull/26518Skip Dequantize() modules for QAT alone. For fake quant insertion, DeQuantize() is a no-op and we should not be inserting fake-quant.ghstack-source-id: 90704220Test Plan:buck test caffe2/test:quantization -- --print-passing-detailsTests in test_quantization pass with changes:Finished test run: https://our.intern.facebook.com/intern/testinfra/testrun/281475121296989Summary (total time 73.03s):  PASS: 28  FAIL: 0  SKIP: 0  FATAL: 0  TIMEOUT: 0  OMIT: 0Differential Revision: D17439333fbshipit-source-id: f716c23500324ae08c8d104ee2c9587fa6926571* Add <cinttypes> include to resolve PRIu32 macro (#26745)Summary:Pull Request resolved: https://github.com/pytorch/pytorch/pull/26745This file doesn't appear to be included by default on GCC 7.3 andcauses compilation to fail. Adding this include fixes compilation.Test Plan: Imported from OSSDifferential Revision: D17566444Pulled By: pieternfbshipit-source-id: 9afb3d4596e424efc5a6ea6ab3b1cffdb2b41fbb* Fake quantization enhancements for QAT/PTQ support (#26420)Summary:Pull Request resolved: https://github.com/pytorch/pytorch/pull/26420Flags for enabling/disabling observer and fake quant independently. Improve repr for fake quant.ghstack-source-id: 90704254Test Plan:buck test caffe2/test:fake_quant --  --print-passing-detailsbuck test caffe2/test:quantization -- --print-passing-detailsDifferential Revision: D17458232fbshipit-source-id: f44380c60f1a10a8ea09bca8ab79ba5d1867ed62* Revert D17458232: Fake quantization enhancements for QAT/PTQ supportTest Plan: revert-hammerDifferential Revision:D17458232Original commit changeset: f44380c60f1afbshipit-source-id: 64a244c720b61fa912bacbb23fcbf9faed0757c2* Named tensor support for: atan2, output_nr, detach{_}, requires_grad_ (#26543)Summary:Pull Request resolved: https://github.com/pytorch/pytorch/pull/26543Also adds a test for logical_xor (it already had named tensor supportbut there was no test)Test Plan: - [namedtensor ci]Differential Revision: D17501403Pulled By: zou3519fbshipit-source-id: 49be15580be9fb520e25a8020164e5a599d22d40* Update ONNX Export for Interpolate in Opset 11 (#26778)Summary:Pull Request resolved: https://github.com/pytorch/pytorch/pull/26778- Add support for linear and cubic interpolate in opset 11.- Add support for 1d and 3d interpolate in nearest mode for opset 7 and 8.- Add tests for all cases of interpolate in ORT tests (nearest/linear/cubic, 1d/2d/3d, upsample/downsample).Original PR resolved: https://github.com/pytorch/pytorch/pull/24805Reviewed By: hl475Differential Revision: D17564911Pulled By: houseroadfbshipit-source-id: 591e1f5b361854ace322eca1590f8f84d29c1a5d* Support Negative Axis in Size in ONNX (#26436)Summary:Currently, we export invalid ONNX models when size() is used with a negative dim.This PR fixes the issue and allows exporting these models to ONNX (ex: input.size(-1)).Pull Request resolved: https://github.com/pytorch/pytorch/pull/26436Reviewed By: hl475Differential Revision: D17565905Pulled By: houseroadfbshipit-source-id: 036bc384b25de77506ef9fbe24ceec0f7e3cff8b* Expose a torch.result_type and simplify tensor iteratorSummary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/26012Test Plan: Imported from OSSDifferential Revision: D17556197Pulled By: nairbvfbshipit-source-id: c0be3ac9e99fecc26a181e301defc1942bc6708c* Named tensor support for logsumexp, mode, kthvalue, median, min, max (#26563)Summary:Pull Request resolved: https://github.com/pytorch/pytorch/pull/26563This adds name inference rules for pre-existing logsumexp, mode,kthvalue, and median ops. Also adds overloads so that they can take`Dimname` dimensions.There are a lot of min/max overloads. This PR adds name inference tothe following overloads for (both) min and max:- min(Tensor, int dim)- min(Tensor, Dimname dim)- min(Tensor)  (full reduction)Test Plan: - new tests and [namedtensor ci]Differential Revision: D17557050Pulled By: zou3519fbshipit-source-id: a099a0ef04ad90d021a38a0668fc44902e1c7171* Delete backwards compatibility Backend overload for registerOp (#25914)Summary:Pull Request resolved: https://github.com/pytorch/pytorch/pull/25914Signed-off-by: Edward Z. Yang <ezyang@fb.com>Test Plan: Imported from OSSDifferential Revision: D17284083Pulled By: ezyangfbshipit-source-id: 430ac7ea2bd042b1f4bb874e53679d0fde326dec* Implement multiple dispatch in boxed c10 dispatcher (#26118)Summary:Pull Request resolved: https://github.com/pytorch/pytorch/pull/26118Signed-off-by: Edward Z. Yang <ezyang@fb.com>Test Plan: Imported from OSSDifferential Revision: D17404367Pulled By: ezyangfbshipit-source-id: 14a16baa4b59f97182725092531a54603f3d92b8* Remove unnecessary include from TensorBody (#26360)Summary:Pull Request resolved: https://github.com/pytorch/pytorch/pull/26360This is not just for aesthetics: this include blocks the inclusionof headers like ivalue.h from ATenDispatch.h (as it causes aninclude cycle.)Signed-off-by: Edward Z. Yang <ezyang@fb.com>Test Plan: Imported from OSSDifferential Revision: D17429163Pulled By: ezyangfbshipit-source-id: 03feb210c12bc891d95bbb5a11ffd694ec05005c* Add some missing constructors to IValue. (#26718)Summary:Pull Request resolved: https://github.com/pytorch/pytorch/pull/26718Signed-off-by: Edward Z. Yang <ezyang@fb.com>Test Plan: Imported from OSSDifferential Revision: D17549623Pulled By: ezyangfbshipit-source-id: 8880c09d85a15b2a63dcf0c242ba6a2dd941decb* Updating submodulesSummary:GitHub commits:https://github.com/facebook/litho/commit/6668c21398a9b71f12cff9574bb8c7d8ebf93463https://github.com/pytorch/fbgemm/commit/189aebb34442a6e96bf88734a047eaae7b258195Test Plan: n/aReviewed By: yns88fbshipit-source-id: f2037290b58ac295eeb94626e172491a8526875d* Revert D17549623: Add some missing constructors to IValue.Test Plan: revert-hammerDifferential Revision:D17549623Original commit changeset: 8880c09d85a1fbshipit-source-id: 002bb1173dbcf6a1d18e1c4b84b4365f145c38dd* Hub improvements (#26723)Summary:Resubmit of https://github.com/pytorch/pytorch/pull/25980.Our old serialization was in tar (like `resnet18-5c106cde.pth` was in this format) so let's only support automatically unzip if checkpoints are zipfiles.We can still manage to get it work with tarfile, but let's delay it when there's an ask.Pull Request resolved: https://github.com/pytorch/pytorch/pull/26723Differential Revision: D17551795Pulled By: ailzhangfbshipit-source-id: 00b4e7621f1e753ca9aa07b1fe356278c6693a1e* Upgrade sleef to v3.4.0. (#26749)Summary:This reset the sleef submodule to upstream, since everything else excepta small build sanity fix<https://github.com/zdevito/sleef/commit/191f655caa25526ae226cf88dd2529265176014a>has been merged to upstream. The new release includes an important fixfor trigonometric functions on MacOS, which would unblock https://github.com/pytorch/pytorch/issues/26431.This should supersede https://github.com/pytorch/pytorch/issues/20536.Close https://github.com/pytorch/pytorch/issues/20536.cc colesbury resistorPull Request resolved: https://github.com/pytorch/pytorch/pull/26749Differential Revision: D17572783Pulled By: ezyangfbshipit-source-id: dd7827e8c8500a0050e3e318d184134c792d3ecc* Updating submodulesSummary:GitHub commits:https://github.com/facebook/litho/commit/5096b0ae1f5ef28bc0b948e260eb512626c6fea9https://github.com/facebook/proxygen/commit/ecd6c10ea3df82cb0d221798150a0cf1f07315c3https://github.com/facebookincubator/mvfst/commit/67abe5d0aaf42659358fa1d96a4159e5832f9c70https://github.com/facebookincubator/profilo/commit/90580f7e064c25bac9c0a1f59afb4da55f46d3cdhttps://github.com/facebookresearch/pytorch-biggraph/commit/7f98961c7b70bda098c371a8b1395f0d6ff5434chttps://github.com/pytorch/fbgemm/commit/f8da6e6e36b5970e95bf150521a1b3af844638beTest Plan: n/aReviewed By: yns88fbshipit-source-id: 60ce61531cf6d4ac8616b3986b40b423abc7de15* move more functions to InsertObserversHelper (#26773)Summary:Pull Request resolved: https://github.com/pytorch/pytorch/pull/26773attTest Plan:ciImported from OSSDifferential Revision: D17563673fbshipit-source-id: 5a6fb4238b6886695c2d25db11fec22ebe5d0c08* autodiff changes to enable profilingSummary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/25397Differential Revision: D17565747Pulled By: Krovatkinfbshipit-source-id: b772437d9e02df99db6e662cb7d1227359959bed* Lets generic tests use multiple devices (#26594)Summary:- Separates device type from default (test) device- Adds multidevice decorator- Updates generic tests to use multidevice decorator where applicableTorchXLA wants to change the default test device based on the test environment. Separating the device type and the default (test) device enables that functionality.Additionally, many existing tests only run on multiple devices and are required, as a consequence, to make CUDA-specific API calls. The multidevice decorator simplifies the existing code and limits the CUDA dependency. Eventually this should let us run multidevice tests on multiple device types.Pull Request resolved: https://github.com/pytorch/pytorch/pull/26594Test Plan: tests were manually run with the CUDA test device set to 'cuda:1'.Differential Revision: D17568910Pulled By: mruberryfbshipit-source-id: c442f748a31a970be8c21deb12a67c3b315c1128* quantized_tensor tests (#26784)Summary:Pull Request resolved: https://github.com/pytorch/pytorch/pull/26784Previously we are using empty to generate test tensors, this PR changes the test tensors to userandint so that we can test things properlyAlso added a set_sizes_and_strides and removed .contiguous() in int_repr function to preserve theoriginal size and stridesTest Plan:python test/test_quantized_tensor.pyImported from OSSDifferential Revision: D17566575fbshipit-source-id: 89379fb09b500dd156118e6ee0709df59f169990* Refactor checked_tensor_unwrap to take DeviceType instead of Backend (#26290)Summary:Pull Request resolved: https://github.com/pytorch/pytorch/pull/26290Fixes #26206Happily, I also can delete the dead Dense***Tensor cases, since theyare for the defunct THS backend.Signed-off-by: Edward Z. Yang <ezyang@fb.com>Test Plan: Imported from OSSDifferential Revision: D17404368Pulled By: ezyangfbshipit-source-id: 79d71ad40c4325c9f52d2825aceb65074d2e20e8* Use Caffe2's implementation of grouped depthwise 3x3 convolutions (#26556)Summary:Use Caffe2's implementation of grouped depthwise 3x3 convolutions instead of NNPACK.Pull Request resolved: https://github.com/pytorch/pytorch/pull/26556Test Plan:_Correctness_ - Manually check the results using the --print-output flag on speed_benchmark_torch._Performance_ - All measurements below on Pixel 2**Before**:Multi-threaded:> adb shell "./speed_benchmark_torch \>  --model=./xraymobilev3.pt \>  --input_dims="1,3,224,224" \>  --input_type=float --warmup=5 \>  --iter=25">> Main run finished. Milliseconds per iter: **876.002**. Iters per second: 1.14155Single-threaded:> adb shell "./speed_benchmark_torch \>  --model=./xraymobilev3.pt \>  --input_dims="1,3,224,224" \>  --input_type=float --warmup=5 \>  --iter=25>  --caffe2_threadpool_force_inline=true">> Main run finished. Milliseconds per iter: **459.409**. Iters per second: 2.17671**After**:Multi-threaded:> adb shell "./speed_benchmark_torch \>  --model=./xraymobilev3.pt \>  --input_dims="1,3,224,224" \>  --input_type=float --warmup=5 \>  --iter=25>> Main run finished. Milliseconds per iter: **285.68**. Iters per second: 3.50042Single-threaded:> adb shell "./speed_benchmark_torch \>  --model=./xraymobilev3.pt \>  --input_dims="1,3,224,224" \>  --input_type=float --warmup=5 \>  --iter=25>  --caffe2_threadpool_force_inline=true"> Main run finished. Milliseconds per iter: **278.999**. Iters per second: 3.58425>Differential Revision: D17533311Pulled By: AshkanAliabadifbshipit-source-id: 9ee8acf02b8e3e8da1922b188ed0a6459a90b67d* Port CUDA implementation of expm1 to ATen (#26598)Summary:Closes https://github.com/pytorch/pytorch/issues/24562Pull Request resolved: https://github.com/pytorch/pytorch/pull/26598Differential Revision: D17531503Pulled By: VitalyFedyuninfbshipit-source-id: 8119c796e142f073ad4e274dda1ad99344215c48* add function to get NCCL version for logging (#26583)Summary:Pull Request resolved: https://github.com/pytorch/pytorch/pull/26583Adds a function that uses the nccl api to get the version code. Converts it to a readable version. Will beused for logging NCCL version in exception messages.Test Plan: See aboveDifferential Revision: D17473200fbshipit-source-id: 4881ed5221b397f2f967262668c2b376b6bf3c64* Remove one unnecessary copy of the output during the type promotion. (#26816)Summary:Output tensors doesn't need to be copied during type promotion as we are not using any data from them. Simple allocation gives steady 10% performance gain.BEFORE```In [1]: x = torch.randn(64, 2048, 7,7)In [2]: y = torch.randn(64, 2048, 7,7, dtype=torch.float64)In [3]: timeit x.add_(y)77.3 ms ± 257 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)```AFTER```In [1]: x = torch.randn(64, 2048, 7,7)In [2]: y = torch.randn(64, 2048, 7,7, dtype=torch.float64)In [3]: timeit x.add_(y)68.2 ms ± 713 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)```Pull Request resolved: https://github.com/pytorch/pytorch/pull/26816Differential Revision: D17573455Pulled By: VitalyFedyuninfbshipit-source-id: 47286abce5e7e665eb61e46ae358c896e945bef2* Prepare for Cocoapods 1.3 Release (#26751)Summary:Pull Request resolved: https://github.com/pytorch/pytorch/pull/26751### SummaryWe're going to use the AWS s3 bucket - `s3://ossci-ios` to store the release binary. To release the cocoapods, we can follow the steps below:1.  Open a fake PR to trigger the CI job that pulls the code from the 1.3.0 tag branch and does the building and uploading.2. Verify the binary locally  - Run tests on both arm64 and simulator3. Publish the cocoapods officially### Test plan- podspec lint command succeeds    - `pod spec lint --verbose --allow-warnings --no-clean --use-libraries --skip-import-validation`Test Plan: Imported from OSSDifferential Revision: D17577131Pulled By: xta0fbshipit-source-id: 55fee918ecc5c4e0b6d714488a12351b4370afac* Validate Docker version in CI. (#26496)Summary:Pull Request resolved: https://github.com/pytorch/pytorch/pull/26496It is a BAD BAD idea to deploy Docker versions which are not deployed(per ossci-job-dsl) because those versions will get GC'ed after twoweeks.  At the moment, there is no verification that your Docker versionis deployed.  This adds an Azure job to check this.Signed-off-by: Edward Z. Yang <ezyang@fb.com>Test Plan: Imported from OSSDifferential Revision: D17575100Pulled By: ezyangfbshipit-source-id: 8df2331c6e6899c585bc2917b55e8955908b0e4a* Fix CI docker builds (#26704)Summary:Pull Request resolved: https://github.com/pytorch/pytorch/pull/26704nccl 2.1.15 isn't available for CUDA 10.1 and 2.4.8 isn't available for cuda 9.1 :(ghstack-source-id: 90714191Test Plan: build docker images on JenkinsDifferential Revision: D17543120fbshipit-source-id: 882c5a005a9a3ef78f9209dea9dcec1782060b25* Export baddbmm (#25738)Summary:Added ONNX export for baddbmm in opset9Pull Request resolved: https://github.com/pytorch/pytorch/pull/25738Reviewed By: hl475Differential Revision: D17565828Pulled By: houseroadfbshipit-source-id: 85f605a7b3fa4783ef4f6ced86223133c85062d5* Fix Future default constructor missing for ParallelNativeSummary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/26739Test Plan: Imported from OSSDifferential Revision: D17577908Pulled By: bwastifbshipit-source-id: a09cdbd8619a926e93418a692ce859d4157f2da8* Quantized Interpolate Kernel(upsample_bilinear2d) (#26631)Summary:We implement the quantized upsample_bilinear2d case for interpolate kernel in this PR.For nhwc performance improvement:import torch, timefor dtype in [torch.qint8, torch.quint8, torch.qint32]:    print('****', str(dtype), '*****')    x = torch.rand(1, 56, 56, 256)    q_x = torch.quantize_per_tensor(x, 0.5, 1, dtype)    q_x = q_x.permute([0, 3, 1, 2])    x = x.permute([0, 3, 1, 2])    NITER = 100    s = time.time()    for i in range(NITER):        float_out = torch.nn.functional.interpolate(x, size=5, scale_factor=None, mode="bilinear", align_corners=True)    time_per_iter_float = (time.time() - s) / NITER    s = time.time()    for i in range(NITER):        quant_out = torch.nn.quantized.functional.interpolate(q_x, size=5, scale_factor=None, mode="bilinear", align_corners=True)    time_per_iter_quant = (time.time() - s) / NITER    ref_quantized = torch.quantize_per_tensor(float_out, 0.5, 1, dtype)    #  torch.testing.assert_allclose(ref_quantized.dequantize(), quant_out.dequantize())    print('time/iter ms (float)', 'time/iter ms (quant)', 'quant/float', sep='\t')    print(time_per_iter_float * 1000, time_per_iter_quant * 1000, time_per_iter_quant / time_per_iter_float, sep='\t')    bytes_float = (x.numel() + float_out.numel()) * x.element_size()    bytes_quant = (q_x.numel() + quant_out.numel()) * q_x.element_size()    float_bw_gbps = bytes_float / time_per_iter_float / 1e9    quant_bw_gbps = bytes_quant / time_per_iter_quant / 1e9    print('GB/s float', 'GB/s quant', sep='\t')    print(float_bw_gbps, quant_bw_gbps, sep='\t')===========without nhwc handling===========**** torch.qint8 *****time/iter ms (float)    time/iter ms (quant)    quant/float1.999044418334961       2.5860953330993652      1.2936657681940702GB/s float      GB/s quant1.6192056416115257      0.3129103516188541**** torch.quint8 *****time/iter ms (float)    time/iter ms (quant)    quant/float2.02730655670166        2.6061582565307617      1.2855274639721328GB/s float      GB/s quant1.596632728927902       0.3105014816242217**** torch.qint32 *****time/iter ms (float)    time/iter ms (quant)    quant/float2.0180463790893555      2.4047350883483887      1.1916153728010588GB/s float      GB/s quant1.603959172365819       1.3460376636426636===========with nhwc handling===========**** torch.qint8 *****time/iter ms (float)    time/iter ms (quant)    quant/float2.0913314819335938      0.09696483612060547     0.04636512047863123GB/s float      GB/s quant1.5477527249803915      8.345458337015**** torch.quint8 *****time/iter ms (float)    time/iter ms (quant)    quant/float2.1065664291381836      0.09959936141967773     0.04728042754408879GB/s float      GB/s quant1.5365591871338384      8.124710725706763**** torch.qint32 *****time/iter ms (float)    time/iter ms (quant)    quant/float2.044203281402588       0.6003522872924805      0.29368521846837126GB/s float      GB/s quant1.5834354779917448      5.391607675216635Pull Request resolved: https://github.com/pytorch/pytorch/pull/26631Differential Revision: D17521498Pulled By: llyfacebookfbshipit-source-id: 385ae0f77777cd8bee385cafb80e492127b7d103* Typevar matching fix + implicit conversions from Scalar to int/float (#26453)Summary:Pull Request resolved: https://github.com/pytorch/pytorch/pull/26453Previously, schema matching would incorrectly widen typevar bindingswhen later occurrences were supertypes of earlier ones. This allowedcallsites like `floatlist.append(tensor.item())` to pass the typechecker,causing a runtime assert (issue #24856).An earlier, reverted fix (#25136) insisted on strict equality across alloccurrences of a typevar, necessitating explicit casts around Scalar-typedarguments to int- or float-typed parameters, like `tensor.item()` above.This was per the original type system design, but turned out to breakexisting user code that relied on the de facto dynamic downcast. (Theerror required a specialized list representation.)The current fix includes the prevention of typevar widening, butadds logic to insert implicit conversions from Scalar to float or intas needed to satisfy a matched schema.Test Plan: Imported from OSSDifferential Revision: D17470598Pulled By: bhosmerfbshipit-source-id: d260dbf3cd78b9c2f2229bc61afc84e1910b5659* Improve C++ maxpool and avgpool (#26521)Summary:This PR makes the following improvements:1. Add `forward_with_indices` method to all C++ MaxPool modules, to return the max indices along with the outputs. (We can't make two `forward` methods that return different types based on input, because that will break the type deduction of `torch::detail::return_type_of_forward_t`)2. Add `max_poolNd_with_indices` to `torch::nn::functional`, to be used when indices of the max values are needed. (We can't merge this with `torch::nn::functional::max_poolNd` because the return type of `max_poolNd` has to be defined statically).3. Improve `pretty_print` of C++ MaxPoolNd and AvgPoolNd modules to match the Python `extra_repr`.Pull Request resolved: https://github.com/pytorch/pytorch/pull/26521Differential Revision: D17507358Pulled By: yf225fbshipit-source-id: b6c0e2b27b38378cdc0c75f4bfc797b3c6b17cd9* Revert D17565828: [pytorch][PR] [ONNX] Export baddbmmTest Plan: revert-hammerDifferential Revision:D17565828Original commit changeset: 85f605a7b3fafbshipit-source-id: 7705325087d83362f71a717be880a13e9f575b37* Cuda101 upgrade (#26823)Summary:test run: https://github.com/pytorch/pytorch/issues/26732Pull Request resolved: https://github.com/pytorch/pytorch/pull/26823Reviewed By: soumithDifferential Revision: D17576095Pulled By: mingbowanfbshipit-source-id: 269cf443aea18b47bbee63996d035bc5bcd2726b* Convert TensorIterator to use function_ref, a lightweight alternative to std::function. (#26592)Summary:function_ref is pulled over from LLVM.  It is to callables what StringRef is to strings.This allows it to be substantially lighter weight, particularly in code size.  That comesat the cost of not being usable in situations where the callable's lifetime is shorterthan the function_ref.  This means it is suitable for callback-like scenarios, but notfor situations where the callable needs to be stored.  In converting TensorIterator,I only encountered one situation that required refactoring to comply with function_ref'sconstraints.In my local Release build, this reduces the size of libtorch by 4MB, from 70MB->66MB.Pull Request resolved: https://github.com/pytorch/pytorch/pull/26592Differential Revision: D17516202fbshipit-source-id: 267476891f767f4827a4d38149f70e5035c56c48* Revert D17473200: [pytorch][distributed] add function to get NCCL version for loggingTest Plan: revert-hammerDifferential Revision:D17473200Original commit changeset: 4881ed5221b3fbshipit-source-id: c5635ce89de1644d2135b657427cbd0c3af83576* Named tensor support for: all, any, bitwise_not, cumprod, cumsum, and more (#26815)Summary:Pull Request resolved: https://github.com/pytorch/pytorch/pull/26815This PR adds named tensor support for:- any, all, `bitwise_not(_)`, cumprod, cumsum, `logical_not`In addition, it adds smoke tests for a variety of tensor attributes andfns:- is_shared, is_signed- retain_grad, register_hookTest Plan: - [namedtensor ci]Differential Revision: D17575905Pulled By: zou3519fbshipit-source-id: 37bfa327e68112c5bf0f6bf1f467a527f50fa1c4* torch.load default encoding change to 'utf-8' (#26421)Summary:Default encoding when using torch.load to 'utf-8'This commit provides changes for cases where user tries to torch.loada pickled module with non-ASCII characters in the docstring asdiscussed in https://github.com/pytorch/pytorch/issues/21743. The default encoding was changed from 'ascii'to 'utf-8'. Documentation for `torch.load` was updated and two tests(loading py2 unicode module with unicode in it; error throwing whenuser explicitly sets wrong encoding) were written.~~This commit provides changes for better error handling in caseswhere user tries to `torch.load` a pickled module with non-ASCIIcharacters in the docstring as discussed in https://github.com/pytorch/pytorch/issues/21743.~~Ping ezyangPull Request resolved: https://github.com/pytorch/pytorch/pull/26421Differential Revision: D17581633Pulled By: yf225fbshipit-source-id: f8e77dcf7907092771149aad8ede6cfb73c21620* fix to operate on cuda kernel with clang and libc++ (#25553)Summary:We find a bug about `std::tuple` with nvcc.In C++11, `std::tuple` constructor is constexpr in libstdc++, but is not constexpr in libc++.https://github.com/pytorch/pytorch/blob/c36b77fcdad3d54227cf0fd51693eb57035002c0/aten/src/ATen/native/cuda/Loops.cuh#L109-L111The lines have occurred crashes in CUDA with a message `scan failed with synchronize`. It is a error message of cuda initialization.The purpose of this PR is fixed for loop in nvcc and libc++ by not using `std::tuple`.Pull Request resolved: https://github.com/pytorch/pytorch/pull/25553Differential Revision: D17582118Pulled By: yf225fbshipit-source-id: d6f62ed46c2415b48eb49f8a051cf3c0e7cb23ce* Do not call cpuinfo_initialize() on other than x86 arch. (#26265)Summary:cpuinfo_initialize() was not implemented for s390 arch.cpuinfo calls are x86 specific to determine vector extensions AVX, AVX512 etc.Without this patch an unnecessary error log is printed in s390 arch:Error in cpuinfo: processor architecture is not supported in cpuinfoPull Request resolved: https://github.com/pytorch/pytorch/pull/26265Differential Revision: D17452301Pulled By: izdebyfbshipit-source-id: 9ca485550385c26dec18aac5953c887f1ffbfb7a* support iterables, rangevalue in list comprehensions (#26768)Summary:Support IterableValue expressions and rangevalue in list comprehensions. Just as with supporting list comprehensions where the expression changes the input list types, we need to correctly type the list we create and it works.Fixes https://github.com/pytorch/pytorch/issues/26693Fixes https://github.com/pytorch/pytorch/issues/22483Pull Request resolved: https://github.com/pytorch/pytorch/pull/26768Differential Revision: D17562762Pulled By: eellisonfbshipit-source-id: 7ce8bf8605758dfd99057bc0376b4b724c4f9251* Fix CUDA named tensor `copy_` (#26829)Summary:Pull Request resolved: https://github.com/pytorch/pytorch/pull/26829The TensorIterator loop for `copy_` uses operations that are currentlyunsupported by named tensors. The solution is to wrap `copy_` in afunction that does the name propagation and ignore names when runningthe implementation of `copy_`. There is no test case because I'm notsure how to trigger the incorrect behavior, but there is definitely codein CUDA copy that doesn't support named tensors (expand_as isn'tsupported):https://github.com/pytorch/pytorch/blob/aaf30cdf36839bc3f21b1622fb91ff3e2983e8ea/aten/src/ATen/native/cuda/Copy.cu#L141-L148Test Plan: - [namedtensor ci]Differential Revision: D17577310Pulled By: zou3519fbshipit-source-id: e11c52243800e1331fad738084304badcfd51ae2* Highlighting in the doc that square root comes before adding epsilonSummary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/26735Test Plan: Imported from OSSDifferential Revision: D17558505Pulled By: vincentqbfbshipit-source-id: 36449c501f3ab3bc7cadd1f580258904b39369d4* Bytecode export flow (#25187)Summary:Pull Request resolved: https://github.com/pytorch/pytorch/pull/25187The bytecode export flow: dump the bytecode format for the light weighted interpreter.* The bytecode is generated without input spec optimization. It would be more generic (input independent) with no obvious performance degradation (to be tested).* Main API: torch::jit::script::Module::save(filename, extra_files, bool *bytecode_format* = false).* Both bytecode and module object are exported in pickle format.    * The module object (in data.pkl) is the same as the original JIT model.    * The serializer is dependent on pickle only (no protobuf or Json).    * The major functionality is forked in ScriptModuleSerializer2::serialize().    * The test loader is test_bc_export.cpp.* Simple APIs are added in Code and its implementation to get necessary information (instructions, operators and constants).* Since there's no dependency on graph/node, GetAttr is promoted from an operator to first-class instruction (https://github.com/pytorch/pytorch/pull/25151) .* Some definitions (instructions, writeArchive, etc) that are shared by full JIT and bytecode are pulled out of the local namespace (https://github.com/pytorch/pytorch/pull/25148).The output layout looks like:* folders of methods.    * In each method folder (for example, forward/):        * bytecode.pkl: instructions and operators        * constants{.pkl,/}: constant list in constants.pkl. If there are tensors in constants, the binary tensor files in constants/ folder.* data{.pkl,/}: the module object, with binary tensor files in data/ folder. The same as in torchscript.Test Plan: Imported from OSSDifferential Revision: D17076411fbshipit-source-id: 46eb298e7320d1e585b0101effc0fcfd09219046* Move the CUDA implementation of log to ATen. (#26494)Summary:Pull Request resolved: https://github.com/pytorch/pytorch/pull/26494Close #24586Test Plan: Imported from OSSDifferential Revision: D17572497Pulled By: VitalyFedyuninfbshipit-source-id: e1bcd33021464eaa4affd4c6d3283c8403069945* enable double backward for non-cudnn LSTM and GRU (#26660)Summary:An attempt to enable double backward for non-cudnn LSTM and GRU (see https://github.com/pytorch/pytorch/issues/25315, https://github.com/pytorch/pytorch/issues/20449). RNN works already because it does not rely on fused kernels.This does not implement double backward function itself, because that is pretty hard to spell out. Instead, it implements backward using differentiable operations, so that double backward can be done automatically.The good: seems to work, no effect on performance on the usual case without double backward. because fused lstm backward is used.The bad: Performance of backward and, especially, double backward, is pretty bad. Scripting would still be a preferred way if we want a performant solution. Performance and/or memory use can be slightly improved if in-place variants can be used for sigmoid_backward and tanh_backward to avoid cat in the end, but I'm not yet sure it's possible, and in any case it is only slight improvement.The ugly: I could not figure out a way to reuse workspace that contains the sum of the gates with the applied sigmoid and tanh operations, so that's probably another perf and memory hit.cc soumith, albanD. If you think this approach is viable, I can extend to GRU and RNN.Thanks to mcarilli whose approach to double backward in weight norm I copied.Pull Request resolved: https://github.com/pytorch/pytorch/pull/26660Test Plan: added tests to check gradgrad for GRU and LSTM with cudnn disabled.Differential Revision: D17581489Pulled By: ngimelfbshipit-source-id: efd204289e9a0e94d94896a0b3bff5cf6246cafa* Migrate multinomial from the TH to Aten (CUDA) (#26481)Summary:https://github.com/pytorch/pytorch/issues/24604Pull Request resolved: https://github.com/pytorch/pytorch/pull/26481Differential Revision: D17489859Pulled By: ifedanfbshipit-source-id: 0702044c7c0f78e5e30826e8a5a83da27156bdb3* QEngine::QNNPACK enabled, module.eval()Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/26855Test Plan: Imported from OSSDifferential Revision: D17589837Pulled By: IvanKobzarevfbshipit-source-id: 0084538e9b9d760a8728cdcd5723fc7fae5838c7* Use optimized_graph in graph_executor.Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/26705Test Plan: Imported from OSSDifferential Revision: D17543281Pulled By: ZolotukhinMfbshipit-source-id: 91c40559aac6f2a1f77060fa28c33725a2b8e5f9* Remove convert_to_ssa argument from runCleanupPasses - it is only used in one place.Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/26703Test Plan: Imported from OSSDifferential Revision: D17543131Pulled By: ZolotukhinMfbshipit-source-id: c4a209c55ac76d8472e64af79f76e9a61fd2a941* Throw if someone tries to torch.save() quantized modules (#26828)Summary:Pull Request resolved: https://github.com/pytorch/pytorch/pull/26828Pickle serialization for quantized modules is currently broken by https://github.com/pytorch/pytorch/issues/24045, so let's be loud and fail if the user tries to do itTest Plan: Imported from OSSDifferential Revision: D17579127Pulled By: jamesr66afbshipit-source-id: 3deccac7e4590c6f648f22bb79c57badf3bf0487* Fix broken failure messages for OverloadedMethodValueSummary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/26846Test Plan: Imported from OSSDifferential Revision: D17587050Pulled By: jamesr66afbshipit-source-id: e5f3ea05b496afae15994b539f018ed0499ca62b* Re-write of tensor-scalar quantized addSummary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/26766Test Plan: Imported from OSSDifferential Revision: D17587105Pulled By: jamesr66afbshipit-source-id: 4da6ea98a4c5cc36fd191d9845c1ef409efce464* Try to disable annoying hypothesis warnings again (#26853)Summary:Pull Request resolved: https://github.com/pytorch/pytorch/pull/26853This is the same as https://github.com/pytorch/pytorch/pull/25188 but we add a version check for if the hypothesis version is too oldTest Plan: Imported from OSSDifferential Revision: D17589086Pulled By: jamesr66afbshipit-source-id: b968965719593ff989d612384e00dfb823cf0a73* Remove three unused declaration. (#26699)Summary:`frac()` in `Vec256<int{16,32,64}_t>` is not overridden.Pull Request resolved: https://github.com/pytorch/pytorch/pull/26699Differential Revision: D17549502Pulled By: soumithfbshipit-source-id: 87c65286032bfc88c447ec4eef1e3ebc73da5d27* Fix building with PARALLEL_BACKEND=NATIVE_TBB (#26742)Summary:Fixing https://github.com/pytorch/pytorch/issues/26721Pull Request resolved: https://github.com/pytorch/pytorch/pull/26742Test Plan:```export USE_OPENMP=0export USE_TBB=1export BLAS=MKLexport MKL_THREADING=TBBexport MKLDNN_THREADING=TBBexport PARALLEL_BACKEND=NATIVE_TBBexport USE_CUDA=0python setup.py build```Reviewed By: dskhudiaDifferential Revision: D17586233Pulled By: ilia-cherfbshipit-source-id: 8e8befa6aa776b8c2b27bb4b79a3bff33dbcba7e* Remove unnecessary functions and cleanup code in quantization.cpp.Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/26852Test Plan: Imported from OSSDifferential Revision: D17587742Pulled By: ZolotukhinMfbshipit-source-id: f345ea4d524fde9741d6629dec1ea8ab870e49a5* Updating submodulesSummary:GitHub commits:https://github.com/pytorch/fbgemm/commit/f767351c4b85cb29f6ea07d1a3bc27d62cca5150Test Plan: n/aReviewed By: yns88fbshipit-source-id: d0bfc9e5e62669ada8d56b853490a373eb8ba2f7* Improvements to GuardElimination and InsertBailoutsSummary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/25430Differential Revision: D17584722Pulled By: Krovatkinfbshipit-source-id: 9db099b904d71572c1bf3aef5419d38435cecbb5* add mobile friendly at:parallel_for backendSummary:This diff implemented at::parallel_for()/parallel_reduce() and otherATen/Parallel.h APIs for mobile using caffe2::ThreadPool.caffe2::ThreadPool doesn't support submitting individual tasksseparately and running them in parallel - all tasks need to be submit inone batch which will lock the thread pool until all of them finish - as aresult we didn't wrap caffe2::ThreadPool with TaskThreadPoolBase interfaceand reuse at::parallel_for() implementation in ParallelNative.h. Becauseof this constraint, intraop_launch() / intraop_launch_future() are notsupported yet.This diff doesn't touch inter-ops pool - it's still default native c10thread pool. Will work on it when it's widely used.Test Plan: - This is early draft to receive feedback. Will do more thorough tests.Differential Revision: D17543412Pulled By: ljk53fbshipit-source-id: 53a3259409c7207d837b9135d87d8daa6ad15e30* remove backward functions from jit-op-registry for mobile build (#26851)Summary:Pull Request resolved: https://github.com/pytorch/pytorch/pull/26851Add codegen option to remove backward ops from jit-op-registry as they are notlikely to be used for inference only mobile build.Measured ARM-v7 AAR build size change: 5,804,182 -> 5,331,219.Test Plan: - build and integrate with demo app;Differential Revision: D17587422Pulled By: ljk53fbshipit-source-id: 08c0fc7a710698a0d4baaf16bbb73cb812b1126a* Enable batch_size = 0 support in DNNLOWP Concat operator (#26849)Summary:Pull Request resolved: https://github.com/pytorch/pytorch/pull/26849We were having division-by-zero errors when one of the input tensor dimension is 0 . Examples: P111481720 and P111481374This diff adds unit tests for empty input tensors and fixes division-by-zero errors in the partition function.Test Plan: buck test caffe2/caffe2/quantization/server:concat_dnnlowp_op_test -- --stress-runs=100Reviewed By: jianyuhDifferential Revision: D17574566…

pdlive215 pushed a commit to pdlive215/pytorch that referenced this pull request

Nov 27, 2019

Bring back the optimization of integer.pow({2.0, 3.0}) on CPU (pytorc…

f7d72f1

…h#26938)Summary:Pull Requestresolved:pytorch#26938They were accidentally removed inpytorch#26020Test Plan: Imported from OSSDifferential Revision: D17632120Pulled By: pbelevichfbshipit-source-id: d62f2b5635fb4976fd4eda2f2015fdf67138a0c0