Hold a pen between your teeth and cartoons are supposed to seem funnier. Seventeen labs ran the experiment and found nothing. Then the original author proposed that a hidden video camera was the reason --- and the next study found he was partly right. The strange story of the Strack pen-in-mouth experiment is the cleanest case study available of how a single methodological detail can determine whether a behavioral-science finding replicates.
Try this. Find a pen. Hold it sideways between your teeth so that your lips are pulled back into something resembling a smile but without your facial muscles doing the work. Now read a cartoon --- almost anything from The New Yorker will do.
Now try it again, but this time hold the pen between your lips, sucking it in like a straw, so that your face is pulled into something resembling a frown.
According to a 1988 paper that became one of the most-cited findings in social psychology, the cartoon should seem funnier when you have the pen between your teeth. The act of pulling your face into a smile-shape --- even when you have no idea that’s what you’re doing --- should change how you perceive humor. Your body is supposed to tell your brain what to feel.
For nearly thirty years, this experiment was a textbook demonstration of what philosophers and psychologists called “embodied cognition” --- the idea that your body’s state shapes your mind’s state in measurable ways. The pen-in-mouth study was clean, clever, and counterintuitive. It was the kind of result that made introductory psychology lectures feel like they were teaching something profound about how human beings worked.
Then in 2016, seventeen laboratories around the world tried to replicate the original study. Combined sample: nearly two thousand people. They found nothing.
But the story doesn’t end there. The original author, Fritz Strack, didn’t just defend his finding in the usual way (more studies, motivated reanalysis, dismissal of the replication on technical grounds). He proposed something specific and testable. He claimed the replication had failed because of a single methodological difference: the replication labs had filmed their participants with a video camera, and the original 1988 study had not. The presence of a camera, Strack argued, made participants self-conscious about their facial expressions in a way that disrupted the effect.
This was an unusually concrete defense. It was also testable. And in 2018, when a research team actually tested it, what they found was --- uncomfortably for everyone --- that Strack was at least partly right.
This article is about that detective story. Not “did the facial feedback hypothesis replicate?” --- that question turns out to be too simple to be the right question. The real question is “what does it mean when a single methodological detail can determine whether a behavioral finding shows up?” The answer is more interesting and more useful than either “the original was real” or “the original was wrong.”
What the Original Study Actually Did
The founding paper is Strack, Martin & Stepper (1988), “Inhibiting and facilitating conditions of the human smile: A nonobtrusive test of the facial feedback hypothesis,” in the Journal of Personality and Social Psychology. Study 1 had a sample of 92 undergraduates at the University of Illinois at Urbana-Champaign.
The design was elegant. Participants were told they were taking part in a study on how people with disabilities use their mouths to perform tasks. Each participant was given a pen and asked to hold it in one of three positions: between their teeth (creating a smile-like configuration), between their lips (creating a frown-like configuration), or in their non-dominant hand (control). While holding the pen, they rated a series of cartoons for funniness on a 10-point scale.
The headline result: cartoons were rated about 0.82 points funnier in the teeth condition than in the lips condition. The teeth condition was about 0.5 points funnier than the hand condition.
The interpretation was that the physical configuration of the face --- independent of any conscious smile --- was producing a measurable shift in how humor was perceived. Because the pen-holding task gave participants no reason to consciously infer “I am smiling, therefore I am happy” (they thought it was a study about coordination), the effect had to be operating on a more direct, embodied pathway.
This was a small but suggestive finding. It became the touchstone of a much larger research program that came to be called embodied cognition. Studies built on the framework appeared throughout the 1990s and 2000s. Holding a warm cup made you judge other people as warmer. Sitting in a hard chair made you negotiate more rigidly. Standing tall made you feel more confident. Many of these subsequent claims have not survived replication either --- but the pen-in-mouth study was the founding moment, and for nearly three decades it was treated as well-established.
The 2016 Registered Replication Report
In 2016, the journal Perspectives on Psychological Science coordinated a Registered Replication Report (RRR) of the Strack 1988 study. Wagenmakers and seventeen co-author labs preregistered an identical protocol with input from Strack himself. Each lab ran the same experiment using the same cartoons, the same pen positions, the same rating scale, and the same exclusion criteria. The combined sample was 1,894 participants.
The result: the mean difference between the teeth and lips conditions was 0.03 points on the 10-point scale, with a 95% confidence interval from -0.11 to +0.16. Essentially zero. The confidence interval was tight enough to rule out any meaningful effect of the size originally claimed.
Publication of the RRR was a defining moment for the replication crisis. The Strack finding was canonical. It was in every textbook. It was the cleanest demonstration available of embodied cognition. And in seventeen carefully run replications, it disappeared.
Strack’s response in the same issue was unusual. Rather than simply defending the original finding or attacking the replication methodology, he proposed a specific testable hypothesis: the replication had failed because the labs had video-recorded their participants. The original 1988 study did not use cameras. Strack argued that the presence of a camera made participants self-conscious about how their faces looked, which disrupted the spontaneous, unselfconscious facial-feedback process that the experiment was supposed to test.
This was, on the face of it, the kind of post-hoc defense that researchers offer when their findings fail to replicate. But Strack’s claim was concrete enough to be tested directly. And someone did.
The 2018 Camera Test
In 2018, Noah, Schul & Mayo published “When both the original study and its failed replication are correct: Feeling observed eliminates the facial-feedback effect,” in the Journal of Personality and Social Psychology. The design was the cleanest possible test of Strack’s hypothesis: run the pen-in-mouth experiment with a camera in some conditions and without a camera in others.
The result was striking. In the conditions without a camera --- matching the 1988 original --- the facial-feedback effect appeared. Cartoons were rated funnier in the teeth condition than the lips condition. In the conditions with a camera --- matching the 2016 RRR --- the effect did not appear.
Noah and colleagues had effectively shown that both the original 1988 study and the 2016 replication were, in some sense, correct. The 1988 study had measured a real effect under specific conditions. The 2016 replication had correctly shown that those conditions had been disrupted by the addition of a camera. The disagreement between the two studies was not “one of them was wrong” but “the conditions that produced the original effect had been changed in the replication.”
This was an uncomfortable result for everyone. It meant the field couldn’t simply file the facial-feedback hypothesis under “didn’t replicate, ignore.” It also meant the field couldn’t comfortably defend the broader pen-in-mouth methodology --- the effect was apparently so fragile that adding a video camera, a tiny methodological detail, could make it vanish entirely.
The 2019 Meta-Analysis Complication
The Noah paper might have settled the camera question, but in 2019 Coles, Larsen & Lench published a much broader meta-analysis in Psychological Bulletin: “A meta-analysis of the facial feedback literature: Effects of facial feedback on emotional experience are small and variable.” They aggregated 286 effect sizes from 138 studies across the full facial-feedback literature.
The aggregate facial-feedback effect across all studies was small (d ≈ 0.20) and highly heterogeneous. More relevant to the camera-moderator hypothesis: when Coles and colleagues tested whether the presence or absence of cameras systematically moderated effects across the full literature, they did not find clear evidence that it did. The camera moderator that worked beautifully in the focused Noah 2018 test did not, when tested across many studies with many other differences, emerge as a clean moderator.
This left the field in an awkward position. The Noah camera-moderator finding was real in their specific test. The broader literature did not show camera-presence as a clean moderator across many studies. Both could be true if camera-presence interacts with other methodological features in ways that aren’t visible when you aggregate across heterogeneous studies. But it meant the simple “Strack was right, the camera was the problem” interpretation was harder to defend than it had looked in 2018.
The Many Smiles Collaboration (2022)
The most recent and most methodologically careful effort came in 2022. Coles, Larsen, and dozens of collaborators ran the Many Smiles Collaboration, published in Nature Human Behaviour: “A multi-lab test of the facial feedback hypothesis by the Many Smiles Collaboration.” Combined sample: 3,878 participants across 19 countries.
The Many Smiles team tested three different facial-feedback paradigms: facial mimicry (asking participants to copy expressions in photographs), voluntary action (explicitly asking participants to smile or frown), and the original Strack pen-in-mouth task. They preregistered the protocol, including which moderators they expected to matter.
The results were nuanced and worth understanding precisely.
The mimicry and voluntary action paradigms produced small but reliable effects: when participants made happy faces, they reported feeling marginally happier. The effect was modest (d ≈ 0.20 to 0.30 depending on paradigm) but consistent enough to constitute support for the broader facial-feedback hypothesis.
The pen-in-mouth paradigm --- the specific Strack 1988 task --- remained inconclusive. The effect was small and not consistently distinguishable from zero in this large multi-country sample.
The Many Smiles result is now the most authoritative single piece of evidence on the question. It suggests three things. First, the broader facial-feedback hypothesis has modest support --- facial muscles do appear to feed back into emotional experience in some conditions. Second, the specific pen-in-mouth paradigm is not a reliable demonstration of the effect, regardless of whether cameras are present. Third, the Strack 1988 result was probably a real but small effect, in a small sample, that happened to be detected because the original sample was lucky and the effect, in pen-in-mouth specifically, is not robust enough to consistently emerge.
Why the Original Looked Real
The pen-in-mouth story is unusual in the replication-crisis literature because the failure isn’t simply “small sample, lucky finding, publication bias.” It’s more nuanced.
The original was probably a real but small effect. Unlike some replication failures where the original finding appears to have been mostly noise, the facial-feedback literature aggregates to a real but small effect (Coles 2019, d ≈ 0.20). Strack 1988 detected a version of that effect. The problem is that the effect is small enough that any individual study at his sample size has perhaps a 50/50 chance of detecting it --- meaning a single study is essentially a coin flip on whether to publish.
The original effect size was inflated by small-sample noise. Strack 1988 reported an effect of about 0.8 points on a 10-point scale. The aggregated meta-analytic effect across hundreds of studies is closer to 0.2 points. Small samples not only have low statistical power; when they happen to detect an effect, they tend to detect it with inflated effect sizes (a phenomenon called the winner’s curse). The 1988 sample was small enough that the detected effect, if real, was probably exaggerated.
The cultural amplification was disproportionate to the evidence. The pen-in-mouth study became a textbook example partly because it was clever and partly because it lined up with a broader cultural narrative about embodied cognition. A small N=92 study with an exaggerated effect size became the iconic demonstration of a field. Even when the field has since revised down both the effect size and the certainty around the specific paradigm, the cultural memory of “pen in mouth = funnier cartoons” is much stronger than the revised academic version.
The methodology was fragile in ways no one understood at the time. Whatever the facial-feedback effect actually is, it’s apparently sensitive enough to small methodological choices (was a camera in the room?) that any single study is partly measuring those choices rather than the underlying construct. This is itself important to understand: behavioral-science findings that depend on subtle methodological conditions are not reliable platforms for general claims about human nature.
The Honest Verdict Today
The honest current picture has three layers.
Layer 1: The broader facial-feedback hypothesis has modest support. Across paradigms and across countries, facial muscle configurations do appear to feed back into emotional experience in small ways. The effect is real, modest (d ≈ 0.20), and heterogeneous.
Layer 2: The specific Strack pen-in-mouth paradigm is not a reliable demonstration of this. Even after careful methodological work, the pen-in-mouth task does not consistently produce the effect, regardless of camera conditions. The 1988 finding was probably an inflated detection of a real but smaller effect.
Layer 3: The camera-moderator story is partly true but does not rescue the original methodology. Strack’s hypothesis that cameras disrupt facial feedback was empirically validated in the Noah 2018 study, but the broader literature doesn’t show this moderator as cleanly. The honest verdict: in some conditions, camera presence may disrupt the effect; in others, the effect may not be present regardless.
If you are an educator teaching the facial-feedback hypothesis, the responsible current framing is: “There is modest evidence that facial expressions feed back into emotional experience. The famous pen-in-mouth study has not reliably replicated, but better-designed studies suggest a small effect exists. The pen-in-mouth experiment should be taught as a historically influential paradigm rather than as a current demonstration of the phenomenon.”
If you are a consumer of pop-psychology claims about “fake-smiling your way to happiness” or “your face shapes your mood,” the responsible current framing is: there might be a small effect, but it’s modest, conditional, and not reliable enough to build behavior changes around. Do not expect to engineer your mood by aggressively smiling at yourself in the mirror.
What This Means If You’re a Strategist
Three takeaways from the pen-in-mouth detective story, all of which generalize beyond facial feedback.
1. “Did it replicate?” is sometimes the wrong question. The facial-feedback story shows that replication outcomes can depend on methodological details that were considered trivial at the time of original publication. A behavioral finding might be “real” in some conditions and “not present” in others, and which version is reported in the headline depends on which specific conditions any given study happened to use. When you are evaluating a behavioral-science claim, the binary “did it replicate?” question is sometimes less informative than “under what specific conditions does this effect appear?”
For organizational decisions, this means: if you are applying a behavioral-science finding to a real situation, the specific conditions of your application matter more than the average literature might suggest. A finding that depends on a specific moderator (camera presence, time of day, sample demographics) may not generalize to your context even if the general effect is real. This is especially important for any UX, hiring, or marketing application of social-psychology findings --- your conditions are almost never identical to the conditions of the original studies.
2. Small-sample effect-size inflation is a systematic problem in the social sciences. The original Strack effect was four times larger than the aggregated meta-analytic estimate. This pattern --- where the original finding has an inflated effect size relative to the true effect --- is common in social psychology. Small samples have low power to detect true effects, and when they happen to detect them, they tend to over-estimate the size. This is called the winner’s curse, and it means that the effect size you read about in any individual paper is probably a high estimate.
For practical decisions, this means: when a behavioral-science finding claims a large effect from a small original study, the effect is probably real but smaller than reported. Apply a mental discount factor. A reported d = 0.8 from N = 92 should be treated as more like d = 0.3 in expectation. This calibration is useful when deciding how much weight to put on any individual finding.
3. Fragile findings rarely make robust foundations for strategy. The facial-feedback effect is sensitive to small methodological details --- camera presence, sample size, exact task instructions. Findings with this kind of fragility are poor foundations for organizational decisions, regardless of whether the underlying effect is technically real. A finding that disappears when a video camera is added to the room is not a finding you can rely on to behave reliably when you apply it to your customers, employees, or product design.
The general principle: prefer findings that are robust across methodological variations and across populations. Robustness --- meaning the effect shows up across multiple paradigms, multiple labs, and multiple populations --- is a stronger signal than effect size in any single study. If a finding requires a specific methodology to appear, treat it as a curiosity rather than a tool.
Sources
- Strack, F., Martin, L. L., & Stepper, S. (1988). Inhibiting and facilitating conditions of the human smile: A nonobtrusive test of the facial feedback hypothesis. Journal of Personality and Social Psychology, 54(5), 768-777. DOI: 10.1037/0022-3514.54.5.768 --- original paper.
- Wagenmakers, E.-J., et al. (2016). Registered Replication Report: Strack, Martin, & Stepper (1988). Perspectives on Psychological Science, 11(6), 917-928. DOI: 10.1177/1745691616674458 --- 17-lab RRR, null.
- Noah, T., Schul, Y., & Mayo, R. (2018). When both the original study and its failed replication are correct: Feeling observed eliminates the facial-feedback effect. Journal of Personality and Social Psychology, 114(5), 657-664. DOI: 10.1037/pspa0000121 --- camera-moderator test.
- Coles, N. A., Larsen, J. T., & Lench, H. C. (2019). A meta-analysis of the facial feedback literature: Effects of facial feedback on emotional experience are small and variable. Psychological Bulletin, 145(6), 610-651. DOI: 10.1037/bul0000194 --- broader meta-analysis.
- Coles, N. A., et al. (2022). A multi-lab test of the facial feedback hypothesis by the Many Smiles Collaboration. Nature Human Behaviour, 6, 1731-1742. DOI: 10.1038/s41562-022-01458-9 --- most recent multi-country test.
Related: Other Studies in This Series
This article is part of an ongoing series on famous behavioral-science studies that did not survive replication --- or that survived in modified form. Other entries cover the Stanford Prison Experiment, power posing, the marshmallow test, ego depletion, the bystander effect, and the Mozart Effect. The full hub lives at /replication-crisis/.
If you’re applying embodied cognition or facial-feedback claims in product design, customer experience, or training programs and want a careful evidence review, book a consultation.
FAQ
So does smiling actually make you happier? A small effect probably exists. The Many Smiles 2022 multi-country study found modest but reliable effects when participants explicitly made happy faces or copied happy expressions. The effect is on the order of d ≈ 0.20 --- small enough that you probably won’t notice it in your own life, but large enough to be detectable in well-powered experiments. The “fake-smile your way to happiness” framing in pop-psychology overstates this; the underlying effect is real but modest.
Is the pen-in-mouth experiment still taught in textbooks? Many introductory psychology textbooks still teach it, often without the replication-failure update. Some updated editions have added the Wagenmakers 2016 RRR. The 2018 Noah paper and 2022 Many Smiles result have not yet propagated into most undergraduate teaching materials.
Was Strack vindicated by the camera-moderator finding? Partly. The Noah 2018 study did empirically support Strack’s specific hypothesis that camera presence disrupts the effect. But the broader 2019 meta-analysis did not show camera presence as a clean moderator across many studies, and the 2022 Many Smiles result found the pen-in-mouth paradigm unreliable even in carefully controlled conditions. Strack’s defense was substantive and partly validated, but it doesn’t fully rescue the original methodology.
What’s the difference between facial feedback and “embodied cognition” generally? Facial feedback is a specific claim --- facial muscle configurations affect emotional experience. Embodied cognition is a broader research program claiming that many bodily states (warmth, weight, posture) affect cognitive and emotional states. The broader embodied-cognition literature has had an even tougher time with replication than the facial-feedback subliterature. Many specific embodied-cognition findings (warm coffee = warmer judgments, heavy clipboards = more important judgments, sitting in a hard chair = more rigid negotiation) have failed to replicate.
If the effect is so fragile, why does it matter at all? The interesting result here is not that facial feedback is a powerful tool for changing your mood --- it isn’t. The interesting result is what the detective story tells you about behavioral-science evidence generally. Findings can be both real and fragile. Findings can depend on methodological details that no one anticipates being important. The discipline of evaluating evidence on robustness --- not just on effect size or original publication --- is what matters, and the pen-in-mouth saga is one of the clearest case studies available.
replication-crisis behavioral-science social-psychology methodology evidence-evaluation
Free Tool
Built for Experimentation Teams
GrowthLayer is the experimentation platform I built for CRO teams --- test management, AI-powered insights, and pattern recognition across your entire program.
Explore GrowthLayer → (opens in new tab)
Share this article
LinkedIn (opens in new tab) X / Twitter (opens in new tab)
Copy link
Go deeper
Methodology The PRISM Method Case Studies $30M+ in Results Work Together Services & Mentoring
Experimentation and growth leader. Builds AI-powered tools, runs conversion programs, and writes about economics, behavioral science, and shipping faster.
← Previous
Growth Mindset: When the Effect Is Real But a Tenth the Size You Were Told
Next →
Ego Depletion: How Willpower Became a Glucose Tank --- And Why That Story Collapsed