Multi-GPU rendering

Joe24

When assigning a filter/encode operation to FFmpeg GPU1, the encode uses NVENC of GPU1, but the CUDA + copy engine of GPU0. There is bus activity on both GPUs, so this is not an OS reporting anomaly. Data is being shuffled from one GPU to the other mid-task.

When assigning filter/encode operations to both GPUs: both CUDA + copy engines of FFmpeg GPU1 remain idle.

Testing on a 2-GPU system, VoPro version 0.7.2.8, Vegas 20 build 411. GPU was assigned by setting the target GPU in the CUDA Upload filter.

NOTE: FFmpeg numbers the GPUs differently than Windows does. In my system:

Windows GPU0 = FFmpeg "GPU1"
Windows GPU1 = FFmpeg "GPU0"

You can find the FFmpeg designation for each card by the following command:

Code

./ffmpeg -f lavfi -i nullsrc -c:v h264_nvenc -gpu list -f null -

Scene files to test:

scenes.zip

PS, to run both these Scenes simultaneously, you'll probably have to patch your drivers (because there are more than 3 total output streams):

nvidia-patch/win at master · keylase/nvidia-patch

This patch removes restriction on maximum number of simultaneous NVENC video encoding sessions imposed by Nvidia to consumer-grade GPUs. - keylase/nvidia-patch

github.com

Vouk

Thanks for pointing out these edge cases. It's really helping to improve the software.

Will look into that.

Joe24

You're welcome. Thanks for putting in all the work fixing the issues. This software is already amazing, and has huge potential.

Joe24

Update: Added a 3rd graphics card into the system. The correct card GPU0/1/2 is always used for NVENC, but all other functions (CUDA and copy engines) are always performed by GPU0.

Vouk

I have already implemented something like this which you will be able to test soon.

Just want to finish the other tasks and then do a new build.

Encoders:

Upload filter:

Joe24

Naming the GPUs right there in the option menus is a great idea.

Vouk

Please check out 0.7.3. Please tell me if this works. (I don't have multiple gpus in my machine)

Joe24

Well now I feel a bit silly. It wasn't a VoukoderPro problem at all.

Removed all video nodes from a test scene, and still observed significant GPU0 copy and CUDA activity on an audio-only VoPro encode. Hmmm.

I always have GPU acceleration turned off in Vegas because it slows renders down . . . but must have enabled it for some test, and forgot to disable it again. *facepalm*

Once I turned GPU acceleration off in Vegas, the copy and CUDA activity is no longer present on GPU0. There is a light 3% load on GPU0 3D/CUDA engine when Vegas is onscreen, but this drops to 0% when Vegas is minimized, and seems to be just normal Windows usage of the primary graphics card.

Tested renders on all 3 cards individually, and all GPU activity takes place on the correct intended card.

So there is no actual problem.

On another note . . . To my way of thinking, it would make more sense to have each node 'inherit' the GPU assignation of the previous (upstream) node in Scene Designer, rather than having to set a target GPU for every single node. If a user wanted to transfer the stream from one GPU to another at any point, couldn't they manually use Download/Upload nodes to send data from one card to another?

Joe24

As it happens, using Download/Upload to transfer data between cards doesn't work. I'm not sure it's even possible to move video data between GPUs mid-task, or even use more than one GPU with a single instance of FFmpeg. This might be an FFmpeg limitation? Tried several different ways, and couldn't get it to work.

When specifying an Upload to GPU0 -> Encode on GPU1, throws an FFmpeg error:

Code

[FFmpeg:0] Could not set non-existent option 'gpu' to value '1'

Tried parallel Upload filters directly from Video Input node, uploading to 2 different cards (GPU0/1), and it throws the same type of error:

Code

[FFmpeg:0] Could not set non-existent option 'gpu' to value '0'

Using Upload to GPU0 -> Scale -> Download from GPU0 -> Upload to GPU1 -> Encode on GPU1 doesn't work either. Throws the same FFmpeg error:

Code

2023-09-07 16:41:03 (trace)    [FFmpeg:0] Calling cu->cuCtxPushCurrent(s->hwctx->cuda_ctx)
2023-09-07 16:41:03 (trace)    [FFmpeg:0] Calling cu->cuModuleUnload(s->cu_module)
2023-09-07 16:41:03 (trace)    [FFmpeg:0] Calling cu->cuCtxPopCurrent(&dummy)
2023-09-07 16:41:03 (trace)    [FFmpeg:0] Calling cu->cuCtxPushCurrent(s->hwctx->cuda_ctx)
2023-09-07 16:41:03 (trace)    [FFmpeg:0] Calling cu->cuModuleUnload(s->cu_module)
2023-09-07 16:41:03 (trace)    [FFmpeg:0] Calling cu->cuCtxPopCurrent(&dummy)
2023-09-07 16:41:03 (info)    [Router.cpp:152] Dumping system information:

  Architecture: x64
  Vendor ID: GenuineIntel
  Model name: Intel(R) Xeon(R) CPU E5-2470 v2 @ 2.40GHz
  Frequency: 2400 MHz

  Quantities:
    CPU packages : 2
    Physical CPUs: 20
    Logical CPUs : 40

  Caches:
    L1:
      Size         : 32768
      Line size    : 64
      Associativity: 8
      Type         : Data
    L2:
      Size         : 262144
      Line size    : 64
      Associativity: 8
      Type         : Unified
    L3:
      Size         : 26214400
      Line size    : 64
      Associativity: 20
      Type         : Unified

  Instruction set support:
    3D-now!: false
    MMX    : true
    SSE    : true
    SSE2   : true
    SSE3   : true
    AVX    : true

  Memory:
    Physical:
      Available: 92780867584
      Total    : 103025205248
    Virtual:
      Available: 140448929554432
      Total    : 140737488224256

  Kernel:
    Variant: Windows NT
    Version: 10.0.19041 build 3155

  OS:
    Name     : Windows NT
    Full name: Microsoft Windows 10 Pro
    Version  : 10.0.19045 build 3324

  GPUs:
    No detection methods enabled
2023-09-07 16:41:03 (info)    [Router.cpp:185] Using input node 9309cf35-7e4c-40e3-b133-4754fe21a3f3 for track #0 (video).
2023-09-07 16:41:03 (info)    [Router.cpp:186] Executing pre-init phase for track #0 (video) ...
2023-09-07 16:41:03 (debug)    [InputNode.cpp:78] Filter config: buffer@9309cf357e4c40e3b1334754fe21a3f3=width=1920:height=1080:pix_fmt=yuv420p:time_base=1001/24000:pixel_aspect=1/1,hwupload_cuda[673157b8b5c94fcbb6460873239f5e54];[673157b8b5c94fcbb6460873239f5e54]split=2[split_74972bd9eab84782948d76fe847c9664][split_21f42936466e4121b3817c10e2155724];[split_74972bd9eab84782948d76fe847c9664]scale_cuda=w=1920:h=1080:interp_algo=3:force_original_aspect_ratio=1,hwdownload,hwupload_cuda=gpu=1,format=pix_fmts=cuda,buffersink@a5bebec201e640c6940a6f7d9caca738;[split_21f42936466e4121b3817c10e2155724]scale_cuda=w=1280:h=720:interp_algo=3:force_original_aspect_ratio=1,format=pix_fmts=cuda,buffersink@bc23b883870a452fa61cea7d0d361e4c
2023-09-07 16:41:03 (info)    [Router.cpp:196] Pre-init phase for track #0 (video) succeeded.
2023-09-07 16:41:03 (info)    [Router.cpp:197] Executing init phase for track #0 (video) ...
2023-09-07 16:41:03 (trace)    [FFmpeg:0] Setting 'width' to value '1920'
2023-09-07 16:41:03 (trace)    [FFmpeg:0] Setting 'height' to value '1080'
2023-09-07 16:41:03 (trace)    [FFmpeg:0] Setting 'pix_fmt' to value 'yuv420p'
2023-09-07 16:41:03 (trace)    [FFmpeg:0] Setting 'time_base' to value '1001/24000'
2023-09-07 16:41:03 (trace)    [FFmpeg:0] Setting 'pixel_aspect' to value '1/1'
2023-09-07 16:41:03 (trace)    [FFmpeg:0] Setting 'outputs' to value '2'
2023-09-07 16:41:03 (trace)    [FFmpeg:0] Setting 'w' to value '1920'
2023-09-07 16:41:03 (trace)    [FFmpeg:0] Setting 'h' to value '1080'
2023-09-07 16:41:03 (trace)    [FFmpeg:0] Setting 'interp_algo' to value '3'
2023-09-07 16:41:03 (trace)    [FFmpeg:0] Setting 'force_original_aspect_ratio' to value '1'
2023-09-07 16:41:03 (trace)    [FFmpeg:0] Setting 'gpu' to value '1'
2023-09-07 16:41:03 (trace)    [FFmpeg:0] Setting 'pix_fmts' to value 'cuda'
2023-09-07 16:41:03 (trace)    [FFmpeg:0] Setting 'w' to value '1280'
2023-09-07 16:41:03 (trace)    [FFmpeg:0] Setting 'h' to value '720'
2023-09-07 16:41:03 (trace)    [FFmpeg:0] Setting 'interp_algo' to value '3'
2023-09-07 16:41:03 (trace)    [FFmpeg:0] Setting 'force_original_aspect_ratio' to value '1'
2023-09-07 16:41:03 (trace)    [FFmpeg:0] Setting 'pix_fmts' to value 'cuda'
2023-09-07 16:41:03 (trace)    [FFmpeg:0] Could not set non-existent option 'gpu' to value '1'
2023-09-07 16:41:03 (trace)    [FFmpeg:0] Error applying filter options
2023-09-07 16:41:03 (error)    [InputNode.cpp:104] Unable to parse filter graph.
2023-09-07 16:41:03 (error)    [Router.cpp:202] Init phase of track #0 (video) failed!
2023-09-07 16:41:03 (error)    [Router.cpp:242] Initialization failed: -11
2023-09-07 16:41:03 (info)    [VoukoderPro.cpp:502] Unable to start VoukoderPro: FFmpeg error.

Alles anzeigen

It would certainly be nice to run multiple GPUs from a single Vegas render, but if this is in fact an FFmpeg limitation, I guess there's not much anybody can do about it except to run multiple Vegas instances, each with a separate VoukoderPro Scene controlling it's own GPU. Unless VoPro could run multiple FFmpeg instances from the same Vegas output buffer?

Vouk

Zitat von Joe24

So there is no actual problem.

Nice!! Well, you've still got the improvement of a named device dropdown instead ofentering the device number

Zitat von Joe24

It would certainly be nice to run multiple GPUs from a single Vegas render, but if this is in fact an FFmpeg limitation, I guess there's not much anybody can do about it except to run multiple Vegas instances, each controlling it's own GPU. Unless VoPro could run multiple FFmpeg instances from the same Vegas output buffer?

Voukoder(Pro) doesn't use an FFmpeg binary (as most other tools do), it's using its DLL variants. So for each export it creates a new instance. But you can test your filter chain witz the command line ffmpeg version.

In line 65 of your previousely posted log file you can see the text representation of the filter chain:

Code

buffer@9309cf357e4c40e3b1334754fe21a3f3=width=1920:height=1080:pix_fmt=yuv420p:time_base=1001/24000:pixel_aspect=1/1,hwupload_cuda[673157b8b5c94fcbb6460873239f5e54];[673157b8b5c94fcbb6460873239f5e54]split=2[split_74972bd9eab84782948d76fe847c9664][split_21f42936466e4121b3817c10e2155724];[split_74972bd9eab84782948d76fe847c9664]scale_cuda=w=1920:h=1080:interp_algo=3:force_original_aspect_ratio=1,hwdownload,hwupload_cuda=gpu=1,format=pix_fmts=cuda,buffersink@a5bebec201e640c6940a6f7d9caca738;[split_21f42936466e4121b3817c10e2155724]scale_cuda=w=1280:h=720:interp_algo=3:force_original_aspect_ratio=1,format=pix_fmts=cuda,buffersink@bc23b883870a452fa61cea7d0d361e4c

Just append it to the ffmpeg.exe command line using -filter_complex.

Zitat von Joe24

it would make more sense to have each node 'inherit' the GPU assignation of the previous (upstream) node

Yes, and setting the encoder pixel format automatically to CUDA. I might add that later.

Joe24

Zitat von Vouk

... using its DLL variants. So for each export it creates a new instance.

I don't recall ever finding a way to control more than one GPU at a time with FFmpeg command line. Usually you specify "-gpu 1" etc. at the beginning, and that's the only card targeted by the entire command line. For multiple cards, you use multiple command lines.

Am I to understand that VoukoderPro, with its FFmpeg DLL version, is under the same restrictions of 1 GPU per render/export?

Not a huge problem for what I do, but just curious. Other people do far heavier stuff.

Joe24

Seems like the duplicate assertions of the target GPU are causing problems. Directly encoding (Video Input -> Encoder) on GPU0/1/2 works. But when using (Video Input -> CUDA Upload -> Encoder), FFmpeg scrams.

This appears to be caused by the duplicate GPU assertion commands issued to FFmpeg (both Upload and Encoder nodes have options to choose a GPU in Scene Designer). These duplicate commands are not being accepted by FFmpeg. For instance, commanding an Upload to GPU2, then commanding an NVENC encode also on GPU2. FFmpeg acknowledges the first assertion, but rejects the second.

As mentioned in the previous post, I don't believe FFmpeg allows you to use the "-gpu " assignment more than once in a single instance. Even if it's to the same GPU.

This only seems to affect GPUs other than GPU0. When GPU0 is used, I don't see any GPU assertion entries at all in the log file.

In this log, when attempting to run on GPU2, it looks like FFmpeg is choking on the second assertion of "-gpu 2". See log, especially lines 2 and 14:

Code

2023-09-11 15:06:04 (info)    [Router.cpp:197] Executing init phase for track #0 (video) ...
2023-09-11 15:06:04 (trace)    [FFmpeg:0] Setting 'gpu' to value '2'
2023-09-11 15:06:04 (trace)    [FFmpeg:0] Setting 'outputs' to value '2'
2023-09-11 15:06:04 (trace)    [FFmpeg:0] Setting 'w' to value '1280'
2023-09-11 15:06:04 (trace)    [FFmpeg:0] Setting 'h' to value '720'
2023-09-11 15:06:04 (trace)    [FFmpeg:0] Setting 'interp_algo' to value '3'
2023-09-11 15:06:04 (trace)    [FFmpeg:0] Setting 'force_original_aspect_ratio' to value '1'
2023-09-11 15:06:04 (trace)    [FFmpeg:0] Setting 'pix_fmts' to value 'cuda'
2023-09-11 15:06:04 (trace)    [FFmpeg:0] Setting 'w' to value '1920'
2023-09-11 15:06:04 (trace)    [FFmpeg:0] Setting 'h' to value '1080'
2023-09-11 15:06:04 (trace)    [FFmpeg:0] Setting 'interp_algo' to value '3'
2023-09-11 15:06:04 (trace)    [FFmpeg:0] Setting 'force_original_aspect_ratio' to value '1'
2023-09-11 15:06:04 (trace)    [FFmpeg:0] Setting 'pix_fmts' to value 'cuda'
2023-09-11 15:06:04 (trace)    [FFmpeg:0] Could not set non-existent option 'gpu' to value '2'
2023-09-11 15:06:04 (trace)    [FFmpeg:0] Error applying filter options
2023-09-11 15:06:04 (error)    [InputNode.cpp:104] Unable to parse filter graph.
2023-09-11 15:06:04 (error)    [Router.cpp:202] Init phase of track #0 (video) failed!
2023-09-11 15:06:04 (error)    [Router.cpp:242] Initialization failed: -11
2023-09-11 15:06:04 (info)    [VoukoderPro.cpp:502] Unable to start VoukoderPro: FFmpeg error.

Alles anzeigen

Attempts to encode on GPU0 run properly, but this is probably because according to the log, "-gpu 0" is never asserted:

Code

2023-09-11 15:22:50 (info)    [Router.cpp:197] Executing init phase for track #0 (video) ...
2023-09-11 15:22:50 (trace)    [FFmpeg:0] Setting 'width' to value '1920'
2023-09-11 15:22:50 (trace)    [FFmpeg:0] Setting 'height' to value '1080'
2023-09-11 15:22:50 (trace)    [FFmpeg:0] Setting 'pix_fmt' to value 'yuv420p'
2023-09-11 15:22:50 (trace)    [FFmpeg:0] Setting 'time_base' to value '1001/24000'
2023-09-11 15:22:50 (trace)    [FFmpeg:0] Setting 'pixel_aspect' to value '1/1'
2023-09-11 15:22:50 (trace)    [FFmpeg:0] Setting 'outputs' to value '2'
2023-09-11 15:22:50 (trace)    [FFmpeg:0] Setting 'w' to value '1280'
2023-09-11 15:22:50 (trace)    [FFmpeg:0] Setting 'h' to value '720'
2023-09-11 15:22:50 (trace)    [FFmpeg:0] Setting 'interp_algo' to value '3'
2023-09-11 15:22:50 (trace)    [FFmpeg:0] Setting 'force_original_aspect_ratio' to value '1'
2023-09-11 15:22:50 (trace)    [FFmpeg:0] Setting 'pix_fmts' to value 'cuda'
2023-09-11 15:22:50 (trace)    [FFmpeg:0] Setting 'w' to value '1920'
2023-09-11 15:22:50 (trace)    [FFmpeg:0] Setting 'h' to value '1080'
2023-09-11 15:22:50 (trace)    [FFmpeg:0] Setting 'interp_algo' to value '3'
2023-09-11 15:22:50 (trace)    [FFmpeg:0] Setting 'force_original_aspect_ratio' to value '1'
2023-09-11 15:22:50 (trace)    [FFmpeg:0] Setting 'pix_fmts' to value 'cuda'
2023-09-11 15:22:50 (trace)    [FFmpeg:0] w:1920 h:1080 pixfmt:yuv420p tb:1001/24000 fr:0/1 sar:1/1
2023-09-11 15:22:50 (trace)    [FFmpeg:0] Loaded lib: nvcuda.dll

Alles anzeigen

Joe24

Just taking a couple steps back here and collecting a bit more data. And I triple-checked that GPU acceleration was turned off in Vegas.

Version 0.7.2.8 does use the intended GPU1/2, with no 3D/CUDA activity on any cards. However, when encoding on GPU0 (FFmpeg designation), which happens to also be Windows' primary card, 3D/CUDA sees a 93% load even with all windows minimized and GPU acceleration turned off in Vegas.

Version 0.7.4, as mentioned, cannot do any complex operations on GPUs other than GPU0. Anything that requires both CUDA Upload and an NVENC Encoder nodes fails because "gpu" is declared/assigned multiple times in the FFmpeg initialization. GPU0 still has the same 3D/CUDA activity on it, similar to v0.7.2.8 behavior above, and only when encoding using GPU0. No GPU0 activity when encoding with GPU1/2. No 3D/CUDA activity on any cards when encoding on GPU 1/2. No CUDA filters are being used in the test scenes.

Joe24

The following may be the problem:

Zitat von Vouk
Code
hwupload_cuda=gpu=1

Correct syntax should be: hwupload_cuda=1

It is in fact possible to use multiple GPUs from a single FFmpeg command line. There doesn't seem to be a lot of information out there on this topic. However, the following example uses GPU0 to perform resolution-scaling, then GPU1 and GPU2 to encode different formats (1080p, 720p).

In FFmpeg command line, assigning each NVENC encoder "-gpu 1" or "-gpu 2" is completely pointless, as the encoder always uses whatever GPU was targeted by the previous hwupload_cuda command, regardless of NVENC GPU preference in the command line. If you (nonsensically) specify separate GPUs for Upload and Encode, the Upload GPU setting overrides the Encode GPU setting, and both actions take place on the Upload GPU. So adding/removing the NVENC "-gpu" options makes no difference in function.

Using 3 GPUs is hilariously inefficient for such a small job, but the following command is a working example of FFmpeg running a complex operation using 3 GPUs at once. Using uncompressed AVI file input, which seems the most similar to what VoPro is doing (no hardware decoding):

Code

ffmpeg -y -probesize 42M -analyzeduration 10 -i "d:\temp\input.avi" -filter_complex:a "[0:a]asplit [asplit1][asplit2]" -filter_complex:v "[0:v]hwupload_cuda=0, split [split1][split2], [split1]scale_cuda=1920:1080:interp_algo=4:format=yuv420p:force_original_aspect_ratio=1, hwdownload, hwupload_cuda=1 [split1scaled],[split2]scale_cuda=1280:720:interp_algo=4:format=yuv420p:force_original_aspect_ratio=1, hwdownload, hwupload_cuda=2 [split2scaled]" -map "[asplit1]" -c:a ac3 -b:a 96k -map "[split1scaled]" -c:v h264_nvenc -gpu 1 -2pass 0 -b:v 2500k -maxrate 5000k -bufsize 5000k -bluray-compat 1 -coder 1 -cq 0 -g 48 -level 4 -preset:v p7 -profile:v high -rc:v vbr -rc-lookahead 20 -tune:v hq "output_1080v4.9.mp4" -map "[asplit2]" -c:a aac -profile:a aac_main -b:a 96k -map "[split2scaled]" -c:v h264_nvenc -gpu 2 -2pass 0 -b_ref_mode:v middle -preset:v p7 -profile:v 2 -qp 30 -rc 0 -rc-lookahead 20 -tune:v hq "output_720v2.1.mp4"

Corresponding VoPro Scene would be something like this: 3-GPU.scene.zip This cannot yet be tested due to the current bug (as of version 0.7.4).

Vouk

Found it. For the hwupload_cuda filter the parameter name is device not gpu. Fixed it in 0.7.5.

You might have to edit each hwupload_cuda node, select the gpu and save the scene again to make it work.

Joe24

Confirmed working in version 0.7.5. It is now possible to run complex filters on any of the 3 GPUs in the system, or shuffle video data back and forth between the GPUs as desired. Thanks!

Updated 3-GPU test scene (extremely inefficient, just a proof-of-concept), works in VoukoderPro 0.7.5: 3-GPU test scene for VoukoderPro 0.7.5.zip

Vouk

Great! Nice to hear!