Exporting a project with 2048 x 1152 and float32 color precision (per channel).
1 pixel per CUDA thread processing 4 floats per FMA in a 32 x 32 block
Unfortunately the cudaMemCopy's are like 10-15ms each. Processing more frames per call would speed this up even more.
System One
AVX2 (i7-7700K)
[11:26:35] Frame #277: vRender: 39 us, vProcess: 41738 us, vEncoding: 19984 us, aRender: 70 us, aEncoding: 268 us, Latency: 64781 us
[11:26:35] Frame #278: vRender: 37 us, vProcess: 44431 us, vEncoding: 19885 us, aRender: 61 us, aEncoding: 12 us, Latency: 66833 us
[11:26:35] Frame #279: vRender: 33 us, vProcess: 41376 us, vEncoding: 18816 us, aRender: 62 us, aEncoding: 310 us, Latency: 62777 us
[11:26:35] Frame #280: vRender: 39 us, vProcess: 43909 us, vEncoding: 18867 us, aRender: 55 us, aEncoding: 218 us, Latency: 65696 us
[11:26:35] Frame #281: vRender: 35 us, vProcess: 43756 us, vEncoding: 20499 us, aRender: 53 us, aEncoding: 231 us, Latency: 66800 us
[11:26:35] Frame #282: vRender: 31 us, vProcess: 43390 us, vEncoding: 20808 us, aRender: 65 us, aEncoding: 315 us, Latency: 66789 us
CUDA (Quadro P2000)
[09:54:33] Frame #288: vRender: 40 us, vProcess: 18746 us, vEncoding: 22134 us, aRender: 68 us, aEncoding: 19 us, Latency: 44891 us
[09:54:33] Frame #289: vRender: 36 us, vProcess: 21211 us, vEncoding: 18160 us, aRender: 72 us, aEncoding: 322 us, Latency: 42256 us
[09:54:34] Frame #290: vRender: 36 us, vProcess: 18531 us, vEncoding: 20253 us, aRender: 59 us, aEncoding: 214 us, Latency: 41408 us
[09:54:34] Frame #291: vRender: 35 us, vProcess: 18369 us, vEncoding: 22336 us, aRender: 70 us, aEncoding: 327 us, Latency: 43288 us
[09:54:34] Frame #292: vRender: 40 us, vProcess: 17668 us, vEncoding: 18668 us, aRender: 63 us, aEncoding: 17 us, Latency: 38560 us
[09:54:34] Frame #293: vRender: 36 us, vProcess: 17704 us, vEncoding: 19705 us, aRender: 71 us, aEncoding: 327 us, Latency: 40145 us
System Two
AVX2 (i7-8700K)
[11:30:00] Frame #70: vRender: 31 us, vProcess: 36414 us, vEncoding: 16255 us, aRender: 1083 us, aEncoding: 10 us, Latency: 55197 us
[11:30:00] Frame #71: vRender: 83 us, vProcess: 40397 us, vEncoding: 15759 us, aRender: 577 us, aEncoding: 244 us, Latency: 59374 us
[11:30:00] Frame #72: vRender: 30 us, vProcess: 36319 us, vEncoding: 15735 us, aRender: 930 us, aEncoding: 357 us, Latency: 54855 us
[11:30:00] Frame #73: vRender: 1774 us, vProcess: 47668 us, vEncoding: 70102 us, aRender: 13 us, aEncoding: 245 us, Latency: 121703 us
[11:30:00] Frame #74: vRender: 34 us, vProcess: 40626 us, vEncoding: 15824 us, aRender: 610 us, aEncoding: 8 us, Latency: 58531 us
[11:30:00] Frame #75: vRender: 35 us, vProcess: 40386 us, vEncoding: 15860 us, aRender: 565 us, aEncoding: 234 us, Latency: 58775 us
CUDA (GeForce RTX 2080 TI)
[09:28:47] Frame #1720: vRender: 30 us, vProcess: 12659 us, vEncoding: 13886 us, aRender: 893 us, aEncoding: 332 us, Latency: 29435 us
[09:28:47] Frame #1721: vRender: 38 us, vProcess: 13909 us, vEncoding: 17943 us, aRender: 894 us, aEncoding: 421 us, Latency: 35310 us
[09:28:47] Frame #1722: vRender: 39 us, vProcess: 13063 us, vEncoding: 14418 us, aRender: 558 us, aEncoding: 8 us, Latency: 30184 us
[09:28:47] Frame #1723: vRender: 32 us, vProcess: 13319 us, vEncoding: 14304 us, aRender: 12 us, aEncoding: 343 us, Latency: 29725 us
[09:28:47] Frame #1724: vRender: 51 us, vProcess: 14712 us, vEncoding: 15048 us, aRender: 653 us, aEncoding: 244 us, Latency: 33087 us
[09:28:47] Frame #1725: vRender: 30 us, vProcess: 13147 us, vEncoding: 15400 us, aRender: 570 us, aEncoding: 7 us, Latency: 30813 us