Use CUDA to speed up encoding > 8bit color formats

  • Exporting a project with 2048 x 1152 and float32 color precision (per channel).


    1 pixel per CUDA thread processing 4 floats per FMA in a 32 x 32 block


    Unfortunately the cudaMemCopy's are like 10-15ms each. Processing more frames per call would speed this up even more.

    System One

    AVX2 (i7-7700K)

    [11:26:35] Frame #277: vRender: 39 us, vProcess: 41738 us, vEncoding: 19984 us, aRender: 70 us, aEncoding: 268 us, Latency: 64781 us

    [11:26:35] Frame #278: vRender: 37 us, vProcess: 44431 us, vEncoding: 19885 us, aRender: 61 us, aEncoding: 12 us, Latency: 66833 us

    [11:26:35] Frame #279: vRender: 33 us, vProcess: 41376 us, vEncoding: 18816 us, aRender: 62 us, aEncoding: 310 us, Latency: 62777 us

    [11:26:35] Frame #280: vRender: 39 us, vProcess: 43909 us, vEncoding: 18867 us, aRender: 55 us, aEncoding: 218 us, Latency: 65696 us

    [11:26:35] Frame #281: vRender: 35 us, vProcess: 43756 us, vEncoding: 20499 us, aRender: 53 us, aEncoding: 231 us, Latency: 66800 us

    [11:26:35] Frame #282: vRender: 31 us, vProcess: 43390 us, vEncoding: 20808 us, aRender: 65 us, aEncoding: 315 us, Latency: 66789 us

    CUDA (Quadro P2000)

    [09:54:33] Frame #288: vRender: 40 us, vProcess: 18746 us, vEncoding: 22134 us, aRender: 68 us, aEncoding: 19 us, Latency: 44891 us

    [09:54:33] Frame #289: vRender: 36 us, vProcess: 21211 us, vEncoding: 18160 us, aRender: 72 us, aEncoding: 322 us, Latency: 42256 us

    [09:54:34] Frame #290: vRender: 36 us, vProcess: 18531 us, vEncoding: 20253 us, aRender: 59 us, aEncoding: 214 us, Latency: 41408 us

    [09:54:34] Frame #291: vRender: 35 us, vProcess: 18369 us, vEncoding: 22336 us, aRender: 70 us, aEncoding: 327 us, Latency: 43288 us

    [09:54:34] Frame #292: vRender: 40 us, vProcess: 17668 us, vEncoding: 18668 us, aRender: 63 us, aEncoding: 17 us, Latency: 38560 us

    [09:54:34] Frame #293: vRender: 36 us, vProcess: 17704 us, vEncoding: 19705 us, aRender: 71 us, aEncoding: 327 us, Latency: 40145 us

    System Two

    AVX2 (i7-8700K)

    [11:30:00] Frame #70: vRender: 31 us, vProcess: 36414 us, vEncoding: 16255 us, aRender: 1083 us, aEncoding: 10 us, Latency: 55197 us

    [11:30:00] Frame #71: vRender: 83 us, vProcess: 40397 us, vEncoding: 15759 us, aRender: 577 us, aEncoding: 244 us, Latency: 59374 us

    [11:30:00] Frame #72: vRender: 30 us, vProcess: 36319 us, vEncoding: 15735 us, aRender: 930 us, aEncoding: 357 us, Latency: 54855 us

    [11:30:00] Frame #73: vRender: 1774 us, vProcess: 47668 us, vEncoding: 70102 us, aRender: 13 us, aEncoding: 245 us, Latency: 121703 us

    [11:30:00] Frame #74: vRender: 34 us, vProcess: 40626 us, vEncoding: 15824 us, aRender: 610 us, aEncoding: 8 us, Latency: 58531 us

    [11:30:00] Frame #75: vRender: 35 us, vProcess: 40386 us, vEncoding: 15860 us, aRender: 565 us, aEncoding: 234 us, Latency: 58775 us

    CUDA (GeForce RTX 2080 TI)

    [09:28:47] Frame #1720: vRender: 30 us, vProcess: 12659 us, vEncoding: 13886 us, aRender: 893 us, aEncoding: 332 us, Latency: 29435 us

    [09:28:47] Frame #1721: vRender: 38 us, vProcess: 13909 us, vEncoding: 17943 us, aRender: 894 us, aEncoding: 421 us, Latency: 35310 us

    [09:28:47] Frame #1722: vRender: 39 us, vProcess: 13063 us, vEncoding: 14418 us, aRender: 558 us, aEncoding: 8 us, Latency: 30184 us

    [09:28:47] Frame #1723: vRender: 32 us, vProcess: 13319 us, vEncoding: 14304 us, aRender: 12 us, aEncoding: 343 us, Latency: 29725 us

    [09:28:47] Frame #1724: vRender: 51 us, vProcess: 14712 us, vEncoding: 15048 us, aRender: 653 us, aEncoding: 244 us, Latency: 33087 us

    [09:28:47] Frame #1725: vRender: 30 us, vProcess: 13147 us, vEncoding: 15400 us, aRender: 570 us, aEncoding: 7 us, Latency: 30813 us