Why Cuts Work: The Science and Psychology of Film Editing
Section 10 of 13

How Sound Design Affects Film Editing and Emotion

We've arrived at a crucial threshold in understanding Murch's Rule of Six — and now we have to complicate it. The framework we just examined treats editing as a hierarchy of concerns: emotion first, rhythm second, then eye-trace and spatial continuity serving those higher priorities. But there's one critical element we've discussed in isolation, never truly woven into the system. Sound.

Murch himself observed that sound constitutes half of the cinematic experience. Most editors treat it that way: as half the post-production workflow, occurring in parallel with picture editing, sometimes even after picture lock. But neuroscience suggests we've had the relationship backwards. Sound doesn't complement the image. In many contexts, it precedes the image in reaching the audience's emotional response. Close your eyes during a horror film and you'll probably still be terrified. Close your ears — or imagine watching that same film on mute — and the scene loses most of its power. You can watch the monster approach, the shadows lengthen, the protagonist's face twist in realization, and feel... not very much. This tells us something essential: Murch's hierarchy, profound as it is, needs one more layer of understanding. We must ask how audio fits into the emotional-first principle, and why, practically speaking, sound often sets the emotional frame that the image then inhabits.

Why Horror Films Are Audio Arguments

Horror cinema understands the low road instinctively. Think of the shower scene in Psycho without Bernard Herrmann's shrieking strings. Alfred Hitchcock famously shot the scene first without the score — and it was effective, but not devastating. With Herrmann's music, it became one of the most viscerally terrifying sequences in film history. Hitchcock didn't think the scene needed music; Herrmann argued strenuously that it did. Herrmann was right, and the reason is neurological: those high-frequency, dissonant string attacks reach the amygdala fast, prime it for threat response, and then the visual information arrives into a brain already in a state of physiological alarm.

Contemporary horror has refined this into something almost cynical in its precision. The "jump scare" typically consists of a period of silence (priming the auditory system's hypervigilance), followed by an explosive, high-decibel, high-frequency sound burst, followed by the sudden appearance of the threatening image. The sound and the image arrive roughly simultaneously, but the emotional response — the flinch, the gasp, the elevated heart rate — is substantially driven by the audio. You can demonstrate this yourself: take a famous jump scare scene, remove the audio, and replace it with ambient room tone. The visual event remains. The scare largely disappears.

This isn't merely about volume or suddenness. Horror directors also use low frequency sounds — infrasound, drones, sub-bass rumbles — that create unease without any identifiable sound source. These frequencies are below the threshold of conscious hearing but can be detected by the body. Audiences feel dread without knowing why. The image contributes to that dread, but it is the audio layer that initiates and sustains the body's physiological state.

What horror teaches editing, then, is this: audio establishes emotional context before the image arrives, and this context shapes how the image is interpreted. A shot of a person looking off-screen means something entirely different if the soundtrack is a sustained minor-key drone versus a cheerful melodic tag. The image is identical; the meaning changes. This is the Kuleshov Effect operating in the frequency domain.

The J-Cut: When Sound Arrives First

The J-cut — so named because on a timeline, the audio track of the incoming scene extends leftward beneath the tail of the outgoing picture, resembling the descending stroke of a "J" — is one of the most commonly used editorial transitions in cinema, and also one of the most misunderstood by beginning editors. In a J-cut, the sound of the next scene begins while we're still seeing the image of the previous scene. The picture cut comes after the audio has already shifted.

graph LR
    subgraph "J-Cut Timeline"
    A["Scene A: Image ————————|"] 
    B["Scene B: Image              |————————"]
    C["Scene A: Audio ———————|"]
    D["Scene B: Audio        |————————————"]
    end
    
    note["Audio of Scene B begins BEFORE the picture cuts to Scene B\n(Audio extends left like the bottom of a 'J')"]
    
    style D fill:#f9a,stroke:#f66
    style note fill:#ffffcc,stroke:#999

Why does this feel more natural than a hard cut, where audio and video change simultaneously? Because it reflects how we actually experience transitions in the real world.

In daily life, when you walk from one environment to another — from a crowded party into a quieter hallway, from an office into a street — you don't experience a simultaneous replacement of all sensory information. Sound tends to bleed across spatial transitions. You hear the party behind you before the door fully closes. You hear the traffic outside before you push the door open. Your auditory environment and your visual environment update on slightly different schedules. The J-cut replicates this perceptual reality, which is probably why audiences accept it so readily — it exploits the same mechanisms we identified in earlier sections, where the brain is primed to accept cuts that mirror the structure of its own perceptual processing.

There's also a narrative function at work. By hearing the next scene before we see it, the J-cut creates a kind of anticipatory pull. We hear voices, music, or ambient sound from a new environment while still looking at the old one. This creates temporal tension — a mild cognitive gap between audio and visual that the brain wants to resolve by seeing the source of the new sound. The cut to the new image, when it finally comes, satisfies that anticipation. It lands with more intention than a hard cut would have.

Practical example: imagine a scene where a character finishes a phone call with bad news. Hard cut to the next scene — they're at work, surrounded by colleagues, functioning normally. The emotional disconnect is abrupt, even jarring. But if we cut it as a J-cut — we hear the sounds of the office (keyboards, distant chatter, a phone ringing) while we're still on the close-up of the character receiving the bad news — something different happens. The mundane audio of normal life creates ironic counterpoint with the face absorbing difficult information. The emotional weight deepens. Then we cut to the office, and we're already inside it emotionally before the image confirms our location. That's the J-cut working as it should.

The L-Cut: When the Image Arrives First

The inverse operation is the L-cut (sometimes called an "audio delay" or "split edit"): the picture cuts to the new scene while the audio of the previous scene continues playing. The audio trail extends rightward past the picture cut on the timeline — like the base stroke of an "L" — so we see the new image while still hearing the old sound.

graph LR
    subgraph "L-Cut Timeline"
    A["Scene A: Image ————————|"]
    B["Scene B: Image              |————————"]
    C["Scene A: Audio ———————————————|"]
    D["Scene B: Audio                        |————"]
    end
    
    note["Audio of Scene A continues AFTER the picture cuts to Scene B\n(Audio extends right like the foot of an 'L')"]
    
    style C fill:#adf,stroke:#66f
    style note fill:#ffffcc,stroke:#999

This creates a fundamentally different effect from the J-cut. The L-cut is excellent for creating temporal overlap — a feeling that two moments or spaces exist simultaneously, that time is layered rather than sequential. It's also useful for establishing emotional residue: if a character has just experienced something traumatic and we cut to a new location while their labored breathing or a lingering musical phrase continues, the emotional weight of the previous scene travels across the cut. The new image arrives, but the feeling hasn't been fully left behind yet.

Documentary filmmakers use L-cuts extensively for narration — cutting to illustrative images while the speaker's voice continues from the previous setup. The picture establishes a new context; the voice arrives as ongoing commentary. This can create a feeling of images discovering their meaning through sound rather than the reverse.

Both J-cuts and L-cuts belong to the broader category of what editors describe as audio transitions that control emotional tempo — the rate at which a viewer's emotional state is allowed to update as a scene changes. Hard cuts change everything at once, demanding an immediate recalibration. Audio transitions stagger the update, giving the brain's emotional processing system more time to carry meaning across the cut.

Walter Murch: Sound First, Picture Second

Walter Murch — whose "Rule of Six" we examined in the previous section — has an even more radical claim about the relationship between sound and picture in editing. In his practice, and in his theoretical writing, Murch argues that sound decisions should be made first, and that picture cuts should follow the audio decisions.

This inverts the common practice. Most editorial workflows treat sound as something you refine after the picture cut is locked or nearly locked. Murch's view is that this gets things backwards. Because sound is neurologically primary to emotional response, letting picture cuts drive audio choices means letting the slower, more analytical sensory pathway dictate terms to the faster, more emotional one. Murch would build the sound design and musical landscape of a scene first, letting the rhythm of the audio suggest where the cuts should fall.

"The sound film that has existed since the late 1920s is a compound of image and sound that are in many ways equal partners," Murch has said, and his editorial practice reflects this claim. Walter Murch served as editor and sound designer on Apocalypse Now and The English Patient, but he did not work on The Godfather. The Godfather was edited by Peter Cormack and Ralph E. Winters., often serving both functions simultaneously — an almost unprecedented practice at the time. The result was films where audio wasn't supporting the picture but co-constructing the experience alongside it.

The practical lesson for editors: don't think of sound as the final coat of paint. Build the audio landscape of a scene early, let it inform how you cut picture, and notice how different musical or sound design choices suggest different editorial rhythms. The same footage, scored differently, will often want to be cut differently. A slow, sustained piece of music resists short, staccato cuts — you'll fight it. A percussive, high-energy score demands pace. This isn't just aesthetics. The music is setting the expectation in the audience's temporal processing for how frequently updates should arrive.

Music as Emotional Meta-Frame

When music plays under a scene, it does something often described vaguely as "setting the mood." But what's actually happening neurologically is more specific and more powerful. Music activates the brain's emotional processing systems — including the amygdala, the nucleus accumbens (part of the reward circuit), and the prefrontal cortex — and it does so in ways that create an anticipatory framework through which the audience interprets the images they see.

Psychologists call this "emotional priming." The music isn't just reacting to what we see; it's telling the brain what emotional lens to apply to what it's about to see. Research into neural synchrony during film viewing has found that one of the most powerful drivers of inter-subject correlation — the phenomenon where multiple viewers' brains show synchronized activity patterns — is the film's audio track, particularly its musical underscore.

Think about what this means: when an audience watches the same film together, their brains synchronize. They're not just having similar thoughts — measurable neural oscillations begin to align. And audio is a major engine of this synchrony. The music is literally entraining the audience's brains together, creating a shared emotional state that makes their interpretation of the images more uniform, more collectively felt. This is why watching a film in a theater with a thousand people feels categorically different from watching it alone at home — the shared physiological response is substantially an audio phenomenon. The score that makes you cry works partly because it's making every body in the theater tense and release together, a collective arousal that social perception circuits pick up on and amplify. You're not just responding to the music. You're responding to everyone responding to the music.

This is also why removing a score from a scene so often makes it feel ambiguous or cold. The image still contains all its visual information, but the frame that tells the brain how to categorize that information emotionally has been removed. Audiences, left to their own devices, will feel a wider range of responses. Music collapses that distribution, aligning the audience's emotional experience around a common mean.

For editors, this has direct implications for every cut. When you're cutting against music — or choosing where a piece of music enters and exits — you're not decorating the picture. You're setting and resetting the emotional frame through which the entire sequence will be experienced, and you're doing it simultaneously for every person in the room. A musical swell that peaks as a character makes a decision amplifies the decision's weight. The same image, on silence, demands that the weight be supplied by performance and context alone. Both choices are valid. But they're different editorial choices, with different neurological effects — and, at scale, different social ones.

The Silence Cut: Sound's Most Powerful Tool

Of all the audio tools in an editor's arsenal, the most underestimated — and frequently the most powerful — is the strategic removal of sound.

The human auditory system evolved to treat silence as information. In the natural world, sudden silence is often a warning sign: the birds stop singing when a predator approaches. Silence means something. The brain doesn't register silence as neutral; it registers it as significant, as a change of state requiring attention.

In film, the "silence cut" exploits this. You can cut the audio — drop the score, drop the ambient sound, drop everything — and achieve an effect that no amount of sound could match. The moment becomes weightless and enormous at the same time. When Saving Private Ryan's opening beach sequence momentarily drops all sound to show the extreme underwater close-up, the effect is more terrifying than anything the preceding chaos could have produced. The absence of sound doesn't mean the absence of feeling — it amplifies feeling by removing the expected audio scaffolding.

Dialogue scenes benefit enormously from this principle. A scene running wall-to-wall with score can feel overwrought. But a scene of emotional confrontation played in complete silence — where the audience supplies their own internal soundtrack — can be devastating. The brain, deprived of musical instruction on how to feel, reaches into its own emotional memory to fill the gap. The silence invites the viewer in rather than telling them what to think.

Practically: editors often overcrowd scenes with audio because silence feels risky. It can feel like something went wrong, like the room tone dropped out. The skill is learning to distinguish between accidental silence (a technical error) and intentional silence (an editorial choice). The difference, often, is in the frames before and after. If you've built genuine emotional stakes, silence will feel right. If the stakes aren't there yet, silence will feel empty. The silence cut is a test of whether your prior editing has done its work.

Dialogue Editing and the Rhythm of Speech

There's an entire specialized craft within editing dedicated purely to dialogue: the timing of when to cut between speakers, and how the sonic rhythm of speech aligns with (or cuts against) the visual rhythm of performance.

Beginning editors tend to cut on visual cues — they cut when a speaker finishes talking, or when the action completes, or when they see a compelling reaction shot. Experienced editors cut at least as much on sonic cues. The breath before a line. The beat of silence after. The rhythm of a particular actor's speech pattern, which may have a cadence quite distinct from the meaning of the words.

Some actors speak with natural internal punctuation — beats and pauses that feel written into their delivery. Others run words together in ways that resist conventional cutting. Editing dialogue effectively means learning to hear these patterns and cut in ways that feel rhythmically natural even when they're visually unconventional. A cut in the middle of a sentence, if it lands on a rhythmic beat in the speaker's delivery, can feel perfectly smooth. A cut at the end of a complete sentence, if it lands awkwardly against the sonic envelope of the delivery, will feel slightly off.

The classic editorial move is cutting to the listener during dialogue — the reaction shot. When you cut to the listener is partly a visual question (is their reaction compelling enough to pull us away from the speaker?) and partly a sonic one (is there a natural break in the speaker's audio where the cut won't feel disruptive?). The brain's event segmentation system, which we discussed in earlier sections, identifies boundaries partly through changes in auditory information — including the natural coarticulation patterns of speech. Editors, whether they know it or not, are exploiting these speech boundaries when they make dialogue cuts feel smooth.

Sound Design as Spatial Continuity

We've established in previous sections that visual continuity depends on a set of conventions — the 180-degree rule, eyeline matching, directional continuity — that allow the brain to construct coherent screen space across cuts. What's less often discussed is how much of this work is actually being done by audio.

Ambient sound — room tone, environmental sound, the background texture of a location — does something crucial across a cut: it bridges the spatial discontinuity that the visual cut creates. When you cut from one angle to another in the same space, the audio environment (the room's acoustic signature, the ambient hum, the off-screen sounds) remains consistent. This consistency tells the brain that the space is continuous even though the image has changed. Editors who forget to match room tone between shots create an audible artifact that makes the cut feel slightly off — not because anything visual has violated continuity, but because the audio environment has failed to confirm it.

This has a direct corollary: you can use sound design to create spatial continuity that doesn't actually exist in the images. If you need to cut between two locations and want the audience to feel they're connected (a phone conversation, for instance, or parallel action), consistent or complementary audio treatment tells the brain these spaces are related. Documentary editors do this constantly — cutting between archival footage and modern interviews, using consistent music or ambient sound design to create a sense of temporal unity that the images alone couldn't sustain.

The research on event segmentation helps clarify why this works: the brain identifies event boundaries partly through changes in auditory context. Consistent audio across a visual cut signals to the brain that no event boundary has been crossed — that we're still in the same situation, even though we're looking at a different image. Sound design, used this way, isn't decorating the edit. It's doing fundamental cognitive work.

Practical Exercise: The Mute Test and the Score Swap

The best way to internalize the arguments of this section isn't to read more about them — it's to conduct two simple experiments on a scene you know well.

Exercise 1: The Mute Test Find a scene from a film where you feel a strong emotional response. Watch it with the sound off. Notice which parts of the emotional response survive and which disappear. Then watch it with the picture off (or just look away from the screen). Notice what the audio alone conveys. You're isolating the two channels to understand their individual contributions. Most viewers are surprised to discover how much of their emotional response to a scene they were attributing to performance or cinematography was actually being generated by audio.

Exercise 2: The Score Swap If you have access to editing software — even basic free tools — take a short scene and replace its original score with music of a radically different emotional register. A tense confrontation scored with gentle, romantic strings. A love scene scored with a dissonant, unsettling drone. A comedic moment scored with a funeral march. Keep everything else identical: same picture cut, same dialogue, same sound design. Observe how the emotional meaning of the identical images changes. This is the Kuleshov Effect in the frequency domain. It should be disturbing how completely the music rewrites the scene's meaning.

These exercises aren't just pedagogical. They're the kind of thing experienced editors do in the cutting room all the time — trying music against picture before committing to anything, keeping sonic options open long after picture is nominally locked, because they know that the audio decisions aren't refinements on top of the editorial choices. They are editorial choices, co-equal in weight with every decision about where to place a cut.


The argument of this section is simple enough to state but surprisingly radical in practice: sound is not the soundtrack to your edit. Sound is half the edit. The neurological mechanisms underlying this — the amygdala's fast pathway, the brain's susceptibility to auditory priming, the inter-subject synchrony driven by music — are not peripheral facts about cinema. They're the explanation for why audio decisions feel so consequential when you get them right and so wrongheaded when you get them wrong. You're not just choosing what people hear. You're choosing the emotional state through which everything they see will be interpreted. That's not a secondary concern. That's the cut.