Let's talk about photogrammetry. With AI making it more accessible through consumer apps, what kind of results and equipment would be involved in professional 3D scanning? How can these tools be integrated into generative AI, specifically for character assets? What are the advantages of using a 3D-to-video approach for character generation compared to using a LoRA model?

Episode #467

From Pixels to Splats: Mastering 3D AI Character Consistency

Discover how Gaussian Splatting and 3D-to-video pipelines are revolutionizing character consistency in the age of generative AI.

0:00/0:00

Download Episode

Episode Details

Published: Feb 4, 2026
Duration: 23:49
Audio: Direct link
Pipeline: V4
TTS Engine
LLM
Topics: gaussian-splatting fine-tuning video-generation

AI-Generated Content: This podcast is created using AI personas. Please verify any important information independently.

In the latest episode of My Weird Prompts, hosts Herman Poppleberry and Corn the Sloth take a deep dive into the technical evolution of 3D modeling and its indispensable role in the 2026 generative AI landscape. The discussion was sparked by a domestic observation: their housemate Daniel has been using his smartphone to perform "digital rituals," circling household objects to create high-fidelity digital twins. While consumer-grade apps like Polycam and Luma have made 3D scanning accessible to anyone with a phone, Herman and Corn argue that the professional frontier of this technology is where the real magic happens—especially when integrated with cutting-edge video generation models.

From Point Clouds to Gaussian Splats

The conversation begins by tracing the evolution of 3D capture. Herman explains that traditional methods often relied on "Structure from Motion," a technique where software analyzes 2D images to find common points, using parallax to calculate their position in 3D space. However, the industry has largely shifted toward Gaussian Splatting. Unlike traditional meshes that represent objects as a "skin" of triangles, Gaussian Splatting represents an object as a cloud of millions of tiny, semi-transparent particles. This method is particularly effective at capturing how light interacts with surfaces, making it ideal for the matte textures and complex "fuzziness" of objects like the stuffed animals Daniel was scanning.

The Professional Edge: Cross-Polarization and Simultaneity

While Daniel’s smartphone scans are impressive for hobbyist work, Herman highlights the vast gulf between consumer and professional workflows. In a high-end 2026 studio, the setup involves hybrid arrays of over 110 DSLR cameras firing simultaneously. This simultaneity is critical; even a millimeter of movement—a blink or a breath—can cause the mathematical reconstruction of a 3D model to fail.

Beyond the hardware, the "secret sauce" of professional photogrammetry lies in cross-polarization. By using polarizing filters on both the lights and the camera lenses, technicians can separate the "albedo" (the pure color of the object) from the "specular" (the shiny reflections). This allows artists to create a digital asset that is truly "relightable." Without this separation, reflections are "baked" into the texture, making the object look out of place when moved into a different digital environment.

3D Assets vs. LoRA: Structure vs. Style

One of the episode's most insightful segments compares the use of 3D scans as "geometry priors" against the popular Low-Rank Adaptation (LoRA) approach. A LoRA is a lightweight fine-tuning of an AI model that teaches it the "vibe" or aesthetic of a character based on a few dozen images. While LoRAs are excellent at capturing style, they often struggle with spatial volume and physics.

Herman describes the LoRA approach as working in "latent space"—a world of statistical probabilities where the AI is essentially guessing how a character should look from a new angle. This often leads to "hallucinations" or morphing during complex movements like a backflip. In contrast, a 3D scan provides "ground truth" geometry. When a 3D model is used as a backbone for AI video models like Sora 2 Pro or Veo 3.1, the AI isn't guessing where an arm should be; it is simply "skinning" a pre-defined movement. This ensures perfect temporal consistency, solving the "wobble" that plagued early AI video.

The 3D-to-Video Pipeline

For creators looking to implement these insights, Herman walks through the modern 3D-to-video workflow. It begins with the scan, followed by AI-assisted "retopology" to turn a messy point cloud into a clean, efficient digital model. Next comes "rigging"—the process of adding a digital skeleton—which tools like AccuRIG have now automated.

Once the 3D "puppet" is ready, the creator can apply motion capture data and render a simple, low-detail version of the animation. This render serves as a spatial guide for the generative AI. By providing a text prompt alongside this geometric guide, the AI can generate photorealistic textures, fur simulations, and environmental blending in a fraction of the time it would take a traditional VFX artist.

The Hybrid Future

Ultimately, the hosts suggest that the most powerful results in 2026 come from a hybrid approach. By combining the structural reliability of a 3D scan with the fine-tuned aesthetic detail of a LoRA or IP-Adapter, creators can achieve a level of character consistency that was previously impossible for solo operators.

As the "barrier to entry for three-dimensional modeling crumbles," the episode serves as a reminder that while the tools are becoming easier to use, understanding the underlying physics of light and geometry remains the key to professional-grade results. Whether you are scanning a stuffed sloth or a human actor, the transition from "hallucinated physics" to "explicit geometry" is the defining shift of the current AI era.

Downloads

Episode Audio

Download the full episode as an MP3 file

Download MP3

Transcript (TXT)

Plain text transcript file

Transcript (PDF)

Formatted PDF with styling

Cover · OG · Instagram

Episode #467: From Pixels to Splats: Mastering 3D AI Character Consistency

Hey everyone, welcome back to My Weird Prompts. We are sitting here in our living room in Jerusalem, and today we have a topic that is quite literally hitting close to home. Our housemate Daniel has been running around the house lately with his phone, circling various objects like he is performing some kind of digital ritual. It turns out he is getting deep into Gaussian Splatting for a project he is working on involving some of our favorite household characters.

Herman Poppleberry here, and I have to say, I have been both a subject and a consultant for these digital rituals. Daniel has been trying to create high fidelity digital twins of us. Not the human versions, mind you, but the stuffed animal versions. Corn the Sloth and Herman the Donkey. It is actually a fascinating look at how the barrier to entry for three dimensional modeling has basically crumbled over the last few years.

It really has. I remember when doing anything with three dimensional scanning required ten thousand dollars worth of equipment and a degree in computer vision. Now, Daniel is doing it with an app on his phone while he waits for the kettle to boil. But his prompt today really pushes us to look beyond the consumer convenience. He wants to know about the professional side of things, how this integrates with the generative artificial intelligence landscape of two thousand twenty-six, and specifically, the trade-offs between using these three dimensional assets versus something like a Low-Rank Adaptation model for character consistency.

It is a meatier question than it sounds because it touches on the fundamental way we represent reality in a computer. When Daniel uses an app like Polycam or Luma on his phone, he is often using a technique called Structure from Motion to build a point cloud, or more recently, he is training a radiance field. The software looks at two dimensional images, finds common points between them, and uses the parallax to calculate where those points exist in three dimensional space. But in two thousand twenty-six, we have moved from just making meshes to creating splats, which capture the way light actually hits a surface.

Right, and for a stuffed animal like the physical Corn the Sloth sitting on our shelf, that works surprisingly well because it is a matte surface. It has a lot of texture for the software to grab onto. But Herman, when we talk about professional grade photogrammetry, we are not just talking about a guy walking in a circle with an iPhone. What does that world actually look like in a professional studio today?

Oh, it is a completely different beast. If you go into a high-end studio in two thousand twenty-six, you are looking at a camera rig that looks like something out of a science fiction movie. We are talking about a hybrid array of over one hundred ten digital single-lens reflex cameras. The key difference here is simultaneity and lighting. In a professional rig, all those cameras fire at the exact same millisecond. This is crucial for anything that might move, like a human face or even a piece of clothing blowing in a slight breeze. If you move even a millimeter between shots, the math for the three dimensional reconstruction starts to fail.

Which is why Daniel had an easier time scanning our stuffed animal counterparts than he would scanning us. We tend to blink or breathe, which ruins the math of the scan. But beyond just having more cameras, there is the lighting. I have heard you talk about cross-polarization before. Why is that still the secret sauce for those hyper-realistic professional scans?

That is exactly right. If you want a professional asset that can be used in a movie or a high-end game, you need to separate the color of the object, what we call the albedo, from the way it reflects light, the specular component. In a professional setup, they use polarizing filters on the lights and a different polarizing filter on the camera lenses. By rotating these filters, you can essentially cancel out all the shiny reflections. This gives you a flat, pure color map of the object. Without those reflections baked into the texture, a technical artist can then put that digital model into any lighting environment, and it will react realistically. If you just use your phone, the shine of the room is stuck on the object forever.

That makes a lot of sense. It is the difference between a picture of a thing and the thing itself. But let us bridge this to the generative artificial intelligence side of things. Daniel mentioned using these scans as assets for character generation. We have talked about the Hollywood of One concept back in episode three hundred twenty-five, the idea that a single person could produce a whole show. If Daniel has a high quality three dimensional scan of Corn the Sloth, how does he actually use that with the current video generation models like Sora Two Pro or Veo Three point One?

This is where the workflow has really evolved. A few years ago, we were mostly just using text-to-video, which was like playing a slot machine. You might get something that looks like your character, or you might get a fever dream. But now, we use these three dimensional scans as a geometry prior. Essentially, you take your three dimensional model, you animate it with a simple skeleton, what we call rigging, and then you use that as a spatial guide for the artificial intelligence. NVIDIA calls this three-dimensional-guided generative AI, and it is the backbone of professional character consistency.

So, instead of telling the AI, generate a sloth walking, you are showing the AI a low-detail video of a three dimensional sloth walking and saying, make this look like the high-resolution version from my scan?

Precisely. It is often called a three-dimensional-to-video pipeline. You are using the scan to provide the structural consistency that generative models usually struggle with. One of the biggest problems in AI video has always been temporal consistency. In episode one hundred thirty-two, we discussed how frames would often wobble or morph. But if you have a three dimensional asset as the backbone, the AI knows exactly where the arm should be at every millisecond because the geometry tells it so. It is not hallucinating the movement; it is just skinning the movement.

I want to dig into that comparison Daniel asked about. He mentioned the Low-Rank Adaptation, or LoRA, approach versus this three-dimensional-to-video approach. For our listeners who might have missed some of the technical shifts, a LoRA is basically a small, lightweight fine-tuning of a model. You show the AI twenty pictures of a character, and it learns the vibe of that character. Why would someone choose a three dimensional scan over just training a LoRA?

It is really a battle of structure versus style. A LoRA is fantastic at capturing the aesthetic. It knows the exact shade of green of your fur or the specific shape of your eyes. But a LoRA does not actually understand three dimensional volume. It is working in latent space, which is basically a world of statistical probabilities. If you ask a LoRA-based model to show a character doing a backflip, it might get the colors right, but the proportions might warp because it is just guessing based on two dimensional patterns it has seen before. A three dimensional scan is explicit geometry; it is ground truth.

Right, it is hallucinating the physics. Whereas the three dimensional scan is the physics.

Exactly. If you use a three dimensional model as your source, you have perfect control over the camera angle, the lighting, and the movement. You can put that model in a three dimensional engine like Unreal Engine five point five, light it perfectly, and then use the AI as a sort of hyper-advanced skinning tool. It takes the perfect perspective of the three dimensional model and adds the photorealistic textures, the fur simulation, and the environmental blending that would take a human artist months to do manually.

So, it sounds like the three dimensional approach is for when you need precision and repeatable action, while the LoRA approach is maybe better for quick, conceptual work where a bit of morphing is acceptable? Or maybe for characters that do not have a physical counterpart to scan?

I would say so, but honestly, in two thousand twenty-six, the best results come from combining them. You use the three dimensional scan to handle the movement and the depth, and you use a LoRA or an IP-Adapter to ensure the fine details, like the specific wear and tear on a character's clothes, remain consistent across different scenes. It is like having a stunt double who is a perfect three dimensional puppet and an actor who provides the soul.

That is a great analogy. But I am thinking about the equipment again. Daniel is using his phone, but he is getting frustrated with the limitations. If he wanted to step up without spending fifty thousand dollars on a camera array, what is the middle ground? I remember we touched on some of the home-grown technical challenges of video AI in episode fifty-five. Is there a middle ground for the prosumer today?

There absolutely is. The big shift recently has been toward Gaussian Splatting. Instead of trying to create a traditional mesh, which is like a skin made of triangles, these techniques represent an object as a cloud of millions of tiny, semi-transparent Gaussians. For someone like Daniel, he could get a high-quality mirrorless camera, maybe something like a Sony a-seven-R-six, and take a few hundred high-resolution photos. Instead of using the basic phone apps, he can feed those into a more robust processing pipeline like RealityCapture or Postshot.

And what does that give him that the phone app doesn't?

Dynamic range and sharpness. Phone cameras are amazing, but they have tiny sensors. They struggle with shadows and they tend to smooth out textures to hide noise. If you want to see the individual fibers on a stuffed animal, you need a larger sensor and a sharp lens. Plus, with Gaussian Splatting, you get much better handling of semi-transparent or fuzzy edges, which is perfect for a stuffed sloth. In two thousand twenty-six, we even have relightable splats, which means you can change the lighting on a splat after you have captured it.

It is interesting that even with all this automation, the human element of fine-tuning is still there. But let us talk about the integration with generative AI specifically for character assets. If I am a creator and I have my scanned model, what is the actual workflow to get it into a video? Walk me through the steps, because I think people hear 3D-to-video and think it is one button.

It is definitely not one button yet, though we are getting closer with AI-assisted tools. First, you take your scan. If it is a traditional mesh, you have to perform retopology. This used to be a manual nightmare, but now we have AI tools that can auto-retopologize a million-point cloud into a clean, efficient model in seconds. Then, you rig it. Again, tools like AccuRIG have made this almost automatic. Once you have that digital skeleton, you can apply motion capture data. You could even use your own phone to record your movements and apply them to the digital Corn the Sloth.

I am imagining you doing motion capture for a donkey right now and it is a very entertaining thought.

I have my moments. But once you have that animated three dimensional puppet, you render out a very simple version of it. It might just be a grey, shaded model moving through space. Then, you take that video and you pass it through a generative AI model using something like a ControlNet or a spatial guidance system. You provide a prompt like, a high-resolution, photorealistic sloth walking through a sunlit forest, and you give it the video of your puppet. The AI uses the puppet as a guide for where the pixels should go, but it fills in all the beauty.

This seems like a massive advantage for consistency. If you have ten different scenes in your movie, and you use the same three dimensional puppet as your guide in every scene, the character is going to be the same size, the same shape, and move the same way every single time. That has been the holy grail for AI creators.

It really is. And the second-order effect here is that it democratizes high-end animation. Think about it. Ten years ago, if you wanted a character to interact with a physical object, like a sloth picking up a cup, you had to have an animator spend days making sure the fingers did not clip through the cup. Now, if you have a three dimensional scan of the cup and the sloth, the physics engine handles the interaction, and the AI handles the visual polish. You are skipping the most tedious parts of the pipeline.

But what about the limitations? We always like to look at the edge cases. Where does photogrammetry fail, even with professional equipment? I am thinking about things like hair, or glass, or very shiny surfaces.

You hit the nail on the head. Photogrammetry still hates anything that is transparent, reflective, or extremely thin. If you try to scan a wine glass, the software gets confused because the light is passing through it or bouncing off it from different angles. To the math, the object does not have a stable surface. However, Gaussian Splatting has made huge strides here. Because it is an appearance-based representation rather than a geometric one, it can actually capture the look of a glass bottle or a fuzzy head of hair much better than a traditional mesh can.

So, Daniel's stuffed animals are actually the perfect subjects because they are solid, opaque, and have a lot of surface detail. But if he wanted to scan his glasses, he would still be in a bit of trouble.

Exactly. He would probably end up with a blob of digital noise where the lenses should be. For professionals, they often spray reflective objects with a temporary matte powder to get the shape, and then they recreate the material properties manually in the software. It is a bit of a workaround, but it works.

I want to circle back to the LoRA versus three dimensional model debate one more time because I think there is a hidden cost to the three dimensional approach that we should mention. Training a LoRA takes some images and maybe an hour of GPU time. Creating a high-quality, rigged, and retopologized three dimensional model is a much heavier lift in terms of human hours, right?

Oh, absolutely. It is a significant investment of time. You need to know how to use three dimensional software like Blender or Maya. You need to understand weight painting and vertex groups. For a lot of casual creators, a LoRA is more than enough. But if you are trying to build a brand, or a recurring series where the character needs to perform complex actions, that upfront investment in a three dimensional asset pays off every single time you hit the render button. It is the difference between building a set for a movie and just using a green screen. One gives you much more grounded, believable results.

It feels like we are moving toward a world where the distinction between a filmmaker and a game developer is disappearing. If you are using three dimensional assets and physics engines to generate your video, you are basically making a game and just recording the output.

That is exactly what is happening. We are seeing the convergence of real-time rendering and generative AI. In a few years, I do not think we will even call it video generation. We will call it world simulation. You will have your library of scanned assets, like our house here in Jerusalem, our stuffed animals, maybe even a scan of Daniel's favorite coffee mug, and you will just arrange them in a virtual space and tell the AI what story to tell with them.

It is a bit mind-blowing when you think about the implications for memory and history too. We could scan our entire living room right now, and fifty years from now, someone could put on a headset and walk through it, not as a flat photo, but as a fully realized space.

And with the AI integration, they could even interact with it. They could ask the digital version of that coffee mug what kind of coffee Daniel used to drink. We are moving from capturing images to capturing volumes and contexts. It is what some are calling Spatial Intelligence.

Alright, let us get into some practical takeaways for Daniel and for anyone else listening who wants to dive into this. If you are starting out with consumer tools but you want to get better results, what are the three most important things to do?

First, lighting is everything. You want flat, even lighting. Avoid direct sunlight or harsh lamps that create dark shadows. A cloudy day is actually a photogrammetrist's best friend. Second, overlap is key. When you are moving around the object, you want every part of it to be in at least three different photos from slightly different angles. If you jump too far between shots, the software loses its place. And third, do not forget the bottom. People often scan the sides and top but forget to flip the object over and scan the base. If you want a full three dimensional asset, you need all three hundred sixty degrees.

And for the professional side? If someone is looking to actually integrate this into an AI workflow?

Invest time in learning the bridge between three dimensional software and AI tools. Learn how to use depth maps and normal maps as inputs for your AI generations. That is where the real power lies. Do not just ask the AI to make a video; tell the AI exactly how deep the scene is and which way the surfaces are facing. That is how you get that professional, locked-in look.

That is fascinating. It really shows that while the AI is doing a lot of the heavy lifting, the quality of the input is still the deciding factor. It is still garbage in, garbage out, just at a much higher level of sophistication.

Exactly. The AI is a force multiplier. If you give it a mediocre phone scan, it will give you a beautiful but slightly wonky video. If you give it a professional, cross-polarized, high-resolution scan, it will give you something that is indistinguishable from reality.

I think Daniel is going to be busy for a while. He already has the stuffed animals, now he just needs to build a hundred-camera rig in the kitchen. I am sure he will find a way to make it fit between the toaster and the microwave.

I would not put it past him. He is already half-way there with all the tripods he has been setting up.

Well, this has been a deep dive into a world that is changing literally every week. If you are listening and you have tried photogrammetry or Gaussian Splatting, or if you are using three dimensional assets in your AI art, we would love to hear about it. You can get in touch with us through the contact form at myweirdprompts.com.

And if you are enjoying these deep dives into the technical and the weird, please do us a favor and leave a review on your podcast app or on Spotify. It genuinely helps other curious minds find the show. We have been doing this for four hundred fifty-nine episodes now, and the community feedback is really what keeps us going.

It really does. You can find all our past episodes, including the ones we mentioned today about video AI and the Hollywood of One, at our website. There is an RSS feed there for subscribers too.

Thanks for joining us in our living room today. It is always a pleasure to nerd out about the future of creativity.

Definitely. We will be back next time with another prompt from Daniel or maybe even one of you. Until then, keep exploring and stay curious.

This has been My Weird Prompts. See you next time.

Bye everyone.

This episode was generated with AI assistance. Hosts Herman and Corn are AI personalities.