Simulation code can be accelerated in multiple ways. When optimizing your code, you first have to understand where the time is being spent. ΦFlow includes an easy-to-use profiler that tells you which operations and functions eat up your computation time.
Enabling GPU execution may speed up your code when dealing with large tensors, especially when using the custom CUDA kernels. Switching to Graph mode may also reduce computational overheads.
The integrated profiler can be used to measure the time spent on each operation, independent of which backend is being used. Additionally, it traces the function calls to let you see which high-level function calls the operations belong to.
To profile a code block, use
from phi.flow import * with backend.profile() as prof: simulation_step() prof.save_trace('Trace.json') prof.print(min_duration=1e-2)
This above code stores the full profile in the trace event format
and prints all operations and functions that take more than 10 ms (
min_duration=1e-2) to the console.
To view the profile, open Google Chrome and go to the address
Then drag the file
Trace.json into the browser window.
There you may zoom your view and click on any block to view additional information.
All simulations based on ΦFlow can be computed on the GPU without transferring data back to the CPU. This requires a GPU-enabled TensorFlow or PyTorch installation, see the installation instructions.
Moving computations to the GPU can greatly increase the performance of simulation code but there are also drawbacks. Since GPUs have access to many more processors than a CPU, GPU operations finish much faster than CPU operations. However, for each GPU operation, one or multiple CUDA kernels have to be launched which adds a significant overhead. This overhead is almost independent of the involved tensor sizes, so the speedup is greatest for large tensors.
Therefore, your code should be vectorized as much as possible. Instead of performing an action multiple times, stack the data along a batch dimension. ΦFlow tensors support arbitrary numbers of named batch dimensions with no reshaping required.
To run your code with either TensorFlow or PyTorch, select the corresponding backend by choosing one of the following imports:
from phi.tf.flow import *
from phi.torch.flow import *
from phi.jax.flow import *
TensorFlow and Jax will use your GPU by default.
For PyTorch, call
TORCH.set_default_device('GPU') and move your network
to the GPU via
ΦFlow comes with a number of CUDA kernels for TensorFlow that accelerate specific operations such as grid interpolation or solving linear systems of equations. These GPU operators yield the best overall performance, and are highly recommended for larger scale simulations or training runs in 3D. To use them, download the ΦFlow sources and compile the kernels, following the installations instructions. PyTorch already comes with a fast GPU implementation of grid interpolation.
ΦFlow supports both static and dynamic execution. In graph mode, execution is usually faster, but an additional overhead is required for setting up the graph. Also, certain checks and optimizations may be skipped in graph mode.
There are two ways of compiling a static graph
jit_compile()(recommended): The function
phi.math.jit_compile()use the backend-specific compiler, if available, to compile a static graph for
Field-valued functions, respectively.
Gradients. Computing gradients may be easier in graph mode since no special actions are required for recording the operations. In eager execution mode, spatial_gradient recording needs to be enabled using one of the following methods:
with math.record_gradients():block will enable spatial_gradient recording for both TensorFlow and PyTorch.
GradientTapemay be used directly. Retrieve TensorFlow tensors to watch using
requires_gradattribute may be set to
Truemanually. Retrieve PyTorch tensors using
Methods 2 and 3 require special handling for non-uniform tensors. Manually iterate over the contained uniform tensors using
and watch each element using