Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
To improve the state saving time for very large and sparse memories, we unroll the loop that checks for non-zero pages to work with blocks of 64 bytes per iteration.
Context: Towards implementing playable quotes for PC games, I want to take whole-system snapshots quite frequently (e.g. once or more per second). I found that the scan for non-zero pages was dominating the overall time spent in saving state. It looks like the code was already someone optimized for speed (by working with mem32 rather than mem8), so this enhancement, while somewhat ugly, might fit the surrounding code. In microbenching with a VM with 128MB of guest memory (and an M3 Max host processor) this lead to a ~2x speedup for the loop in question. Deeper unrolling helps slightly more on my processor, but that is probably overspecializing for my wide cache lines (128 bytes). The code here only exploits 64-byte lines.