I've already investigated this some months ago, and yes, if you extract the streams it looks exactly like this, but ...
Each stream (audio, video and subtitles) has a timestamp associated with each frame, sample, sub, ... These timestamps are set during the encoding process. So when you extract all the streams (timestamps are ignored) and compare them, they might have different offsets, but once you are playing the file in a media player the demuxer "realigns" the streams and makes sure all timestamps are perfectly in sync. That's why it is perfectly okay for streams to have different offsets and it is fully within the specification - if the timestamps are correct, of course. (And if the demuxer is working correctly).
So you can only compare and check this using a proper player (hardware or software). Like with the good old clapperboard.
But I agree, when importing such a file in an NLE it should look correct, yes.