NotificationsYou must be signed in to change notification settings
Fork26.3k
Star96k

Enable CPU fused kernel on Windows#25578

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Jump to bottom

Closed

peterjc123 wants to merge7 commits intopytorch:masterfrompeterjc123:cpu_fused_win

Closed

Enable CPU fused kernel on Windows#25578

peterjc123 wants to merge7 commits intopytorch:masterfrompeterjc123:cpu_fused_win

Conversation

Copy link

Collaborator

peterjc123 commentedSep 3, 2019•
edited
Loading

No description provided.

peterjc123 requested a review fromapaszke as acode owner

September 3, 2019 12:35

pytorchbot added the oncall: jitAdd this issue/PR to JIT oncall triage queue label

Sep 3, 2019

peterjc123 removed the request for review fromapaszke

September 3, 2019 12:39

pytorchbot added caffe2 module: buildBuild system issues labels

Sep 4, 2019

LIANGXINKAI mentioned this pull request

Sep 4, 2019

Performance issue with torch.jit.trace(), slow prediction in C++ (CPU)#19106

Open

Copy link

Immocat commentedSep 6, 2019•
edited
Loading

Thank you@peterjc123 for the implementation. Actually I am working on writing a Unity native plugin(c++) on Windows to infer neural net results every frame, and CPU only is indeed much slower without this feature. I tried to the plugin with CUDA libtorch, however, Unity crashes at the exact line that doing neural net inference (forward(input_tensor)). I wrote a simple C++ program to load the plugin lib and dll and it works totally well. I suppose that the crash related to some library conflict between which libraries Unity uses and libtorch's CUDA prebuilt library on Windows. I found similar painful experience of OpenPose's Unity Plugin dealing with Neural Net library + CUDA + Unity Plugin setting.

So I think I will give up the CUDA version. My question is that is this feature finished in your fork branch(peterjc123:cpu_fused_win)? I only care about libtorch, CPU only version, on Windows.

If you've already finished it, may I try to build it from your last commit( on Windows and CPU only).
I would appreciate if you could share the pre-built binary, but I could also try build from your source code. Could you also link some building instruction for building libtorch CPU only on Windows with Visual Studio + cmake? Is there any difference in terms of building libtorch CPU only on Windwos between your code and the master pytorch branch?

Thank you very much!

peterjc123 force-pushed thecpu_fused_win branch from3e1e317 tof620840Compare

September 7, 2019 04:45

Copy link

CollaboratorAuthor

peterjc123 commentedSep 7, 2019

@Immocat No, it is still in an early stage. There are some difficulties that I have to tackle before it can be merged into master.

Finding VS installation (Plan: usingvswhere and setting env vars. The current method is to activate the dev env every time we call the compiler.)
Tempfile under Windows (Plan: Using code from GCC and make some adaptations)
Some other things that I haven't considered (e.g. Some util functions and OS-dependent code logic)

For cuda jit fusion conflicts, maybe you could try building the static version of LibTorch. Below are the steps:

cmd:: EssentialsetBUILD_SHARED_LIBS=OFF:: [Optional] If you want to build with VS 2019 generator, please change the value in the next line to `Visual Studio 16 2019`.:: Note: This value is useless if Ninja is detected. However, you can force that by using `set USE_NINJA=OFF`.setCMAKE_GENERATOR=Visual Studio152017:: Read the content in the previous section carefully before you preceed.:: [Optional] If you want to override the underlying toolset used by Ninja and Visual Studio with CUDA, please run the following script block.:: "Visual Studio 2017 Developer Command Prompt" will be run automatically.:: Make sure you have CMake >= 3.12 before you do this when you use the Visual Studio generator.:: It's an essential step if you use Python 3.5.setCMAKE_GENERATOR_TOOLSET_VERSION=14.11setDISTUTILS_USE_SDK=1for /f"usebackq tokens=*"%i in (`"%ProgramFiles(x86)%\Microsoft Visual Studio\Installer\vswhere.exe" -version [15^,16^) -products * -latest -property installationPath`) do call "%i\VC\Auxiliary\Build\vcvarsall.bat" x64 -vcvars_ver=%CMAKE_GENERATOR_TOOLSET_VERSION%:: [Optional] If you want to override the cuda host compilersetCUDAHOSTCXX=C:\Program Files (x86)\Microsoft Visual Studio\2017\Enterprise\VC\Tools\MSVC\14.11.25503\bin\HostX64\x64\cl.exepython tools\build_libtorch.py

pytorchbot added module: ci

Related to continuous integration

module: pybindRelated to our Python bindings / interactions with other Python libraries labels

Sep 7, 2019

peterjc123 force-pushed thecpu_fused_win branch 2 times, most recently from491b8cf toc4731a1Compare

September 7, 2019 15:41

Copy link

CollaboratorAuthor

peterjc123 commentedSep 7, 2019•
edited
Loading

The basic functionality is working now. However, there are still some points to improve:

Currently, we activate the dev env every time if it's not activated. It is very slow, and we may need to move it to the python frontend.
OpenMP is not working. It complains that "index variable in OpenMP 'for' statement must have signed integral type".
We should skip the tests when VS is not installed.
Implement something like-march=native.

peterjc123 requested review fromapaszke,ezyang,goldsborough,suo,yf225 andzdevito

September 7, 2019 15:45

peterjc123 force-pushed thecpu_fused_win branch 2 times, most recently from854deda to56fe412Compare

September 9, 2019 06:11

Copy link

Contributor

ezyang commentedSep 9, 2019

This is nifty stuff. Let us know if there is stuff we can do to help move it along.

Copy link

CollaboratorAuthor

peterjc123 commentedSep 9, 2019

@ezyang Could you please tell me where the jit frontend is? That is, how can I disable it in the Python side?

Copy link

Contributor

ezyang commentedSep 9, 2019

Could you please tell me where the jit frontend is? That is, how can I disable it in the Python side?

Are you talking about the TorchScript compiler? It's not really disableable; when you request a function to be compiled for torchscript, we recursively collect the source code reachable from it and compile it. Maybe you could tell us more about what's going on?

cc@suo for perhaps more comments

Copy link

CollaboratorAuthor

peterjc123 commentedSep 10, 2019•
edited
Loading

@ezyang
Some more questions:

What about usingPYTORCH_JIT=0 in this line:https://github.com/pytorch/pytorch/blob/master/torch/jit/__init__.py#L57?
You called it the TorchScript compiler, so it implies that JIT fusion is not working when we dotorch.jit.trace, right?
Could you please tell me a bit more about whattorch.jit.trace is currently doing on Windows?

The following is what I want to do now. First, I want to add a check for the VS env before every jit fuse call. If it is not activated, then we will try to activate it, but if we cannot find it, then we will skip the fusion step. Do you know where should I add these code?

Copy link

Contributor

ezyang commentedSep 10, 2019

What about using PYTORCH_JIT=0 in this line:https://github.com/pytorch/pytorch/blob/master/torch/jit/__init__.py#L57?

Ah yes, I forgot about that. That will indeed turn off JIT globally; it's meant as an easy way to turn off script if you're debugging an issue without having to edit source code.

You called it the TorchScript compiler, so it implies that JIT fusion is not working when we do torch.jit.trace, right?

Actually, fusion can apply to trace too. Trace versus script refers to different ways of getting the IR in question; trace means we run your program and record what happened; script means we parse the literal program text. The IR can be fused in both cases.

Could you please tell me a bit more about what torch.jit.trace is currently doing on Windows?

I am not aware of any Windows specific behavior for torch.jit.trace, and we don't seem to have any macros on MSVC that would affect this.

First, I want to add a check for the VS env before every jit fuse call. If it is not activated, then we will try to activate it, but if we cannot find it, then we will skip the fusion step. Do you know where should I add these code?

For cpu, it's going to be somewhere liketorch/csrc/jit/fuser/cpu/fused_kernel.cpp, probablyrunCompiler.

pytorchbot added the module: testsIssues related to tests (not the torch.testing module) label

Sep 11, 2019

Copy link

CollaboratorAuthor

peterjc123 commentedSep 11, 2019

@pytorchbot rebase this please

peterjc123 changed the title~~[WIP] Enable CPU fused kernel on Windows~~Enable CPU fused kernel on Windows

Sep 11, 2019

Copy link

CollaboratorAuthor

peterjc123 commentedSep 13, 2019

@xscha Sure, it should be fairly easy to support clang or any other compilers, but not in this PR. And we may need a code refactory otherwise the code will look messy. As for android, I think it should be just be the same with the deskop OS, using interpreter to run the operators when using jit script, and for jit fusion, only gcc is supported.

Copy link

Contributor

xsacha commentedSep 13, 2019

I'm just worried about the fact we need a compiler on deployed systems where we do inferencing (JIT is only for the inferencing right?).
Is there an alternative to ahead-of-time compile with options like -mavx -mavx2, etc?

Copy link

CollaboratorAuthor

peterjc123 commentedSep 13, 2019

@xsacha Yes, I agree with you that we should use some lightweight cross-platform compilers like llvmlite used by numba.

ezyang reviewed

Sep 13, 2019

View reviewed changes

torch/csrc/jit/fuser/cpu/fused_kernel.cpp OutdatedShow resolvedHide resolved

ezyang reviewed

Sep 13, 2019

View reviewed changes

torch/csrc/jit/fuser/cpu/fused_kernel.cpp OutdatedShow resolvedHide resolved

ezyang reviewed

Sep 13, 2019

View reviewed changes

torch/csrc/jit/fuser/cpu/temp_file.h OutdatedShow resolvedHide resolved

ezyang reviewed

Sep 13, 2019

View reviewed changes

torch/csrc/jit/fuser/cpu/temp_file.h OutdatedShow resolvedHide resolved

ezyang reviewed

Sep 13, 2019

View reviewed changes

torch/csrc/jit/fuser/cpu/fused_kernel.cpp OutdatedShow resolvedHide resolved

ezyang reviewed

Sep 13, 2019

View reviewed changes

torch/csrc/jit/fuser/cpu/fused_kernel.cpp OutdatedShow resolvedHide resolved

ezyang reviewed

Sep 13, 2019

View reviewed changes

torch/csrc/jit/fuser/cpu/fused_kernel.cpp OutdatedShow resolvedHide resolved

ezyang reviewed

Sep 13, 2019

View reviewed changes

torch/csrc/jit/fuser/cpu/fused_kernel.cpp OutdatedShow resolvedHide resolved

ezyang requested changes

Sep 13, 2019

View reviewed changes

Copy link

Contributor

ezyang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

This is very nice work. Inclusion of LGPL code is a blocker; we'll have to find an implementation somewhere else. I think my only other major concern is in-place mutation of environment variables in process.

peterjc123 added5 commits

September 14, 2019 13:19

Fix CPU fused kernel on Windows

2c9d4f1

bug fixEnable the jit tests on WindowsMore fixesFix tempfile for WindowsMore fixesMinor fixesadd headerlint changesDebugging stuff.....dllexportChange working dir to make git cleanCleanupRemove useless print

Implement march=native

c70c2fd