Fast mesh seam blending

Post process blending of meshes using indirect compute shaders

No blending Blending
Blending On
Blending Off

Idea

In a previous article we went over a technique to blend seams of meshes by mirroring pixels in post-processing.
However the basic implementation ran very slow and was not practical. In this article I'll go over how to improve the technique using indirect compute shaders.

We'll be using OpenGL for simplicity but the code is practically identical on different platforms

Recap

We will go over the steps briefly again, but the previous article goes more in depth into the base technique

Previous article

Passes

The technique relies on 3 logical passes,
First we have to find where the object seams are.

Then we have to find the exact uv offset to each closest seam, we use to mirror around after.
This is the expensive part so we should only do this on the pixels close to the seams

Image shows UV offsets to seam

Then by using these offsets to sample the other side of the seam, we can mirror the pixels.
This allows us to blend while keeping the texture detail of the object.

Outlined rock to show where the seam is; the mirror is not perfect, but that is barely noticeable in practice

no mirror mirrored pixels
Mirror Off
Mirror On


Masking

Our first step is masking where the seams are.
There are many ways to check this. To allow the most control over where the seams are, I will be using a per-object ID

As an example, you can do this using either a stencil ID or a separate output in your shading or GBuffer pass.

Preview

To find the seams we can run a basic outline shader on this:

// GLSL, Compute:

float aspect = float(u_resolution.x) / float(u_resolution.y);
vec2 directions[4] =
{
    vec2(1, 0),
    vec2(-1, 0),
    vec2(0, aspect),
    vec2(0, -aspect)
};

bool foundSeam = false;

// Scale by pixel's depth to make outlines not grow over distance
float maxRange = u_blendRadius / sceneDepth;
// Check in 4 cardinal directions for ID change
for (int i = 0; i < 4; i++)
{
    vec2 offset = directions[i] * maxRange;
    vec2 sampleUV = uv + offset;
    uint otherId = textureLod(u_id, sampleUV,0).x;
    
    // check if differing id
    if (otherId != objectId) 
    {
        foundSeam = true; // -  Found seam!
    }
}

if (foundSeam) {} // Now we can check if we found a seam

Now it might seem tempting to just return here if there is no seam and then run the expensive part of the shader.
However, this would barely improve performance, which is why we will use indirect dispatches

Indirect compute shaders

GPU warps

But why would a return statement be so slow on the GPU?
This is due to how the GPU orders tasks. (keep in mind i will be oversimplifying here)
When we run a compute dispatch on the GPU, our dispatch groups of threads are run in even bigger groups
also called warps (or waves).

Let's say we run this

layout(local_size_x = 4, local_size_y = 4) in;
void main()
{
    // Logic
}
glDispatchCompute(num_groups_x: 8, num_groups_y: 8, num_groups_z: 1);

This will run 32x32x1=1024 threads on our GPU
Let's say our GPU supports 32 threads in a warp,
That means our program will run in 32 warps of 32 threads.

Visualisation of the warps

Since these warps are executed together it has to wait for all threads in the warp to be finished before starting a next one.
This is fine when all threads are basically the same duration but will slow things down if our threads have a lot of branching.

For example, if our program looked like this:

layout(local_size_x = 1, local_size_y = 1) in;
void main()
{
    uint id = gl_GlobalInvocationID.x;
    if (id % 2 != 0) // If the id is an odd number
    {
        sleep(500ms); // example, (sleep function doesn't actually exist)
    }else {
        return;
    }
}

This means our warp will look like this

Visualisation showing warp has to wait for the longest thread

The warp takes ~500ms since it has to wait for the longest thread.

That means adding a mask to our effect doesn't help a lot if a warp contains an unmasked pixel.
Luckily there is a solution.

Indirect dispatches

Instead of putting everything in 1 compute shader, we can separate it into 2 separate dispatches.
In our first shader we just run the mask code and output the found locations in a buffer:

// GLSL, Compute:
layout(local_size_x = 8, local_size_y = 8) in;                   

// This is the buffer we will output in <changed>
layout(std430, binding = 0) buffer ArgBuffer { <changed>
    // The arguments the second pass will be dispatched with <changed>
    uint  dispatch_x; <changed>
    uint  dispatch_y; <changed>
	uint  dispatch_z; <changed>
 <changed>
    // how many locations we written <changed>
    uint  count; <changed>
    // pixel locations for second pass <changed>
    uvec2 pixels[]; <changed>
}; <changed>
/* In HLSL we would instead use
AppendStructuredBuffer<uint2> pixelLocations;
*/

void main() {
    uvec2 coord = gl_GlobalInvocationID.xy;
    // Guard for over dispatching
    if (any(greaterThanEqual(coord, u_resolution))) return;
    vec2 uv = (coord+0.5) / vec2(u_resolution);

    float rawDepth = textureLod(u_depth, uv,0).x;
    float sceneDepth = LinearizeDepth(rawDepth, u_nearfarplane.x, u_nearfarplane.y);

    uint objectId = textureLod(u_id, uv,0).x;

    float aspect = float(u_resolution.x) / float(u_resolution.y);
    vec2 directions[4] =
    {
        vec2(1, 0),
        vec2(-1, 0),
        vec2(0, aspect),
        vec2(0, -aspect)
    };

    bool foundSeam = false;

    // Scale by pixel's depth to make outlines not grow over distance
    float maxRange = u_blendRadius / sceneDepth;
    // Check in 4 cardinal directions for ID change
    for (int i = 0; i < 4; i++)
    {
        vec2 offset = directions[i] * maxRange;
        vec2 sampleUV = uv + offset;
        uint otherId = textureLod(u_id, sampleUV,0).x;
        
        // check if differing id
        if (otherId != objectId) 
        {
            // Add the found location to our buffer <changed>
            uint slot = atomicAdd(count, 1u); // Thread safe way of counting where we are in the buffer <changed>
            pixels[slot] = coord; <changed>
            return; <changed>
        }
    }
}

We can then run our second pass only on the pixels that we got from this buffer.
For this, we have to fill in the dispatch counts in our buffer,
In some APIs there are functions for this but we'll be using a small compute shader.

// GLSL, Compute:
layout(local_size_x = 1) in;

layout(std430, binding = 0) buffer ArgBuffer {
    uint  dispatch_x;
    uint  dispatch_y;
    uint  dispatch_z;
    uint  count;
    uvec2 pixels[];
};

void main() {
    // Copy the count into the dispatchSize
    // We divide by the local group size of the next pass which in my case is 64
    dispatch_x = ceil(count / 64.0);
    dispatch_y = 1u;
    dispatch_z = 1u;
}

Then we can run all this from the cpu like this:

// First clear our argumentBuffer
uint32_t resetData[4] = {0, 1, 1, 0};  // dX=0, dY=1, dZ=1, count=0
glNamedBufferSubData(buffer.meshSeamData.ssbo, 0, sizeof(resetData), resetData);

// Mask pass
glUseProgram(m_meshblendMaskProgram);
glBindBufferBase(GL_SHADER_STORAGE_BUFFER, 0, buffer.meshSeamData.ssbo);

... /// Set uniforms

static constexpr int KERNEL_SIZE = 8;
glDispatchCompute((buffer.width + (KERNEL_SIZE - 1)) / KERNEL_SIZE,
                  (buffer.height + (KERNEL_SIZE - 1)) / KERNEL_SIZE, 1);

// Wait till arg buffer filled
glMemoryBarrier(GL_SHADER_STORAGE_BARRIER_BIT);

// Copy counter into dispatchX
glUseProgram(m_meshblendArgCountProgram);
glBindBufferBase(GL_SHADER_STORAGE_BUFFER, 0, buffer.meshSeamData.ssbo);
glDispatchCompute(1, 1, 1);

// Wait till arg buffer filled
glMemoryBarrier(GL_SHADER_STORAGE_BARRIER_BIT | GL_COMMAND_BARRIER_BIT);

// Blend/Search pass
glUseProgram(m_meshblendBlendProgram);

... // Set uniforms

//Bind the argumentbuffer as use for the dispatch size
glBindBuffer(GL_DISPATCH_INDIRECT_BUFFER, buffer.meshSeamData.ssbo);
glDispatchComputeIndirect(0);

Searching

Now that we know where to run our blend we have to find the actual seams
For every blended pixel we need the exact offset, we need this for 2 things.
We use double this offset as the sample location to get the mirrored side
and we need the length of this offset to know how much to blend the pixel

In the previous article we used a square search kernel, however this becomes either very slow or creates a lot of artifacts at blend sizes more than a few pixels
Instead we will be using a refining search which allows us to find the seam more accurately with fewer samples.

We sample a big circle around our pixel to find the seam direction,
then, using the direction we find, we sample a smaller circle around that direction.
We do this for a set amount of iterations


The code would then look something like this:

//GLSL Compute, Blend Pass

void main() {
    uint idx = gl_GlobalInvocationID.x;
    if (idx >= count) return;

    // We get the coords from the first pass
    uvec2 coord = pixels[idx];
    vec2 uv = (coord+0.5)/vec2(u_resolution);

    float rawDepth = textureLod(u_depth, uv,0).x;
    float sceneDepth = LinearizeDepth(rawDepth, u_nearfarplane.x, u_nearfarplane.y);

    uint objectId = textureLod(u_id, uv,0).x;
    bool selfBlendsSelf   = (objectId & (1u << 11)) != 0u;
    bool selfBlendsOthers = (objectId & (1u << 10)) != 0u;

    float maxRange = u_blendRadius / sceneDepth;

    float aspect = float(u_resolution.x) / float(u_resolution.y);

    float minDist = 9999999;
    vec2 closestSeamLocation = vec2(0, 0);

    // iteration settings
    const int refineIterations = 8;
    const int circleSteps = 16;

    // The refining search
    for (int iter = 0; iter < refineIterations; iter++) 
    {
        // Stepsize halves every iteration
        float stepSize = maxRange / pow(2, iter);

        // We use the previous iteration's offset as a starting location
        vec2 closestFound = closestSeamLocation;
        for (int i = 0; i < circleSteps; i++)
        {
            //          this is ~2*pi
            float angle = (6.283185 * (i)) / circleSteps;

            // Since this is UV space we have to device by aspect
            vec2 dir = vec2(cos(angle), sin(angle))/aspect;
            vec2 sampleOffset = closestFound + dir * stepSize;

            vec2 sampleUV = uv+sampleOffset;

            uint sampleId = textureLod(u_id, sampleUV,0).x;
            if (sampleId != objectId)
            {
                //Calculate the length of the offset,
                float dist = length(sampleOffset*vec2(aspect,1.0));
                if (dist < minDist)
                {
                    minDist = dist;
                    closestSeamLocation = offset;
                }
            }
        }
    }

Blending

Now that we have the seam offsets we can mirror and blend them.
To do this we just use double the seam offset as a sample position and lerp based on the offset length.
We then combine with a weight based on depth to make sure we only blend edges that are actually connected.

//GLSL Compute, Blend Pass

    ... /// Search loop

    // The distance from the search
    float dist = minDist; 
    
    // Color of the current pixel
    vec4 sceneColor = textureLod(u_color, uv,0);

    // Doubling the offset gives us a mirrored location
    vec2 finalOffsetUV = uv + closestSeamLocation * 2;

    vec4 otherColor = textureLod(u_color,finalOffsetUV,0);
    float otherDepth = LinearizeDepth(textureLod(u_depth, finalOffsetUV,0).x, u_nearfarplane.x, u_nearfarplane.y);

    // Difference between our depth and mirrored depth, in linear space
    float depthDiff = abs(otherDepth - sceneDepth);

    // Weight from 0.5-0 how close we are to the seam. 
    // We make this from 0.5-0 since we want to blend halfway when distance is close to zero.
    float spatialWeight = clamp(0.5 - (dist / maxRange),0.0,1.0);

    // Weight based on depth difference of the 2 blended objects
    float depthWeight = clamp(1.0 - depthDiff / (u_depthfall * u_blendRadius),0.0,1.0);
    
    // Combine the weights
    float finalWeight = spatialWeight * depthWeight;

    // Optional smooth function makes the blend a bit smoother
    finalWeight = finalWeight * finalWeight * (3 - 2 * finalWeight);
	
    imageStore(u_output, ivec2(coord), mix(sceneColor, otherColor, finalWeight));

Performance & Conclusions

Even with these optimisations, the effect is still somewhat expensive, running at around ~2-4ms at 8(refine)*16(circle) iterations,
depending on how much blends are on screen and the blend size settings. This is on my RTX 4060 Laptop GPU at 1920x1080.

But since this is post-processing it is largely independent from scene complexity, and the cost of the ID generation is very dependent on render setup.
So the amount of work this saves when kitbashing levels is, in my opinion, worth it.
And keeping the technique only in between (certain?) objects makes it so it doesn't impact artistic intention

Also keep in mind I simplified most of the code in this article, there are definitely more optimisations to be had.

Here are some more varied showcases of the effect:

Ice environment without blending Ice environment with blending
Blending On
Blending Off
Lava environment with blending Lava environment without blending
Blending On
Blending Off