Most popular personality assessments — MBTI, DISC, Enneagram, Insights Discovery, True Colors — have weak or no empirical foundations. The Big Five (OCEAN) is the exception. Discovered through factor analysis of trait adjectives across decades of independent research traditions, replicated across more than 50 cultures, with consistent predictive validity for job performance, mortality, divorce, and dozens of other life outcomes — the Big Five is the rare personality model that survived the replication crisis nearly intact. This is the anti-example article in a hub full of takedowns: a calibration piece that says, honestly, here is what does work, here is why, and here is how to tell the difference between assessments worth using and assessments worth refusing.
If you have been reading through this replication-crisis hub, you have watched a long parade of canonical social-psychology findings get dismantled. Power posing did not survive Carney’s own recantation. Ego depletion collapsed in the Hagger 2016 multi-lab replication. Implicit-bias measurement (the IAT) turned out to have weak test-retest reliability and near-zero predictive validity for individual behavior. Multiple Intelligences (Gardner 1983) loaded on a single g factor when actually tested. Emotional Intelligence (Goleman 1995), as a distinct construct beyond IQ and the Big Five, mostly disappeared under jangle-fallacy scrutiny. The pattern is depressing enough that a rational reader might conclude that all of personality and individual-differences research is suspect.
That conclusion would be wrong, and this article exists to explain why.
Because in the same period that produced all those replication failures, one personality-research program kept holding up. It held up across cultures with completely different language families and religious traditions. It held up across measurement instruments — self-report, observer-report, peer-rating, and even text-based and behavioral-residue assessments converge on roughly the same structure. It held up in large-scale predictive studies where the dependent variable was not a paper-and-pencil outcome but actual job performance, actual mortality, actual divorce filings. It held up in Soto’s 2019 Life Outcomes of Personality Replication Project, where 87 percent of preregistered, high-powered direct replications of trait-outcome links came back statistically significant in the predicted direction — a hit rate that would be the envy of almost any other corner of social psychology. And the magnitude of the predictive effects was not embarrassingly small. Roberts and colleagues (2007) showed that personality traits predict mortality, divorce, and occupational attainment at effect sizes comparable to socioeconomic status and cognitive ability — three of the most heavily studied life-outcome predictors in all of social science.
The model that holds up this well is the Big Five — sometimes called the Five-Factor Model, or OCEAN for the acronym of its five dimensions: Openness to experience, Conscientiousness, Extraversion, Agreeableness, and Neuroticism. It is the standard model of personality in academic psychology. It is also, importantly, almost none of the models you are likely to encounter in a corporate training catalog, a recruiter’s intake form, or a leadership coaching package. MBTI, DISC, Enneagram, Insights Discovery, Predictive Index, True Colors, Strengths Finder — the popular instruments are mostly typological, mostly not built from empirical factor analysis, mostly have weak test-retest reliability, and mostly fail to predict outcomes any better than chance once you control for the underlying Big Five traits they accidentally pick up on.
This article walks through how the Big Five was actually discovered (the lexical-hypothesis tradition and the questionnaire-factor tradition converging on the same five factors), what McCrae and Terracciano’s 50-cultures study actually demonstrated, what Barrick and Mount’s 1991 meta-analysis on job performance found and how subsequent meta-analyses extended it, what Roberts et al. 2007 showed about the predictive power of personality compared with SES and cognitive ability, what Soto 2019 found when the Life Outcomes of Personality Replication Project tested 78 trait-outcome links in preregistered direct replications, what distinguishes the Big Five empirically from MBTI/DISC/Enneagram, what the honest limits of the framework are, and what all of this means for an operator evaluating personality assessments for hiring or coaching. The goal is calibration. If you spend a hub criticizing social science, you owe readers the parts that worked.
How The Big Five Was Discovered
The Big Five did not begin as a single elegant theory proposed by a single researcher. It emerged, slowly and somewhat reluctantly, from two completely independent research traditions that turned out to be measuring the same thing.
The first tradition is the lexical hypothesis, articulated most clearly by Sir Francis Galton in the late 1800s and developed methodologically by Gordon Allport and Henry Odbert in the 1930s. The lexical hypothesis says: the most important individual differences in human behavior will, over time, become encoded in everyday language as descriptive adjectives. If a personality trait matters enough that people repeatedly need to discuss it, the language will develop a word for it. Conversely, if no one bothers to coin a word for a trait, it probably does not matter much in daily life. So if you want a comprehensive map of personality dimensions, you can start with the dictionary.
Allport and Odbert did exactly that in 1936. They went through Webster’s Unabridged Dictionary and extracted roughly 18,000 English words referring to personal characteristics. Raymond Cattell, working in the 1940s and 1950s, took Allport and Odbert’s list, condensed it through clustering and rating procedures down to a smaller set of trait variables, and then ran factor analyses on the resulting ratings. Cattell ended up with 16 factors and built the 16PF questionnaire around them. But subsequent researchers re-analyzing his data and collecting new data with similar methods kept finding that Cattell’s 16 factors were not really 16 — they kept collapsing into roughly five higher-order factors. Tupes and Christal, working for the U.S. Air Force in 1961, were the first to clearly identify the five-factor structure. Warren Norman replicated it in 1963 across multiple samples. Lewis Goldberg systematized the lexical-tradition work through the 1980s and 1990s and is largely responsible for the modern “Big Five” terminology.
The second tradition is the questionnaire-factor tradition, which started from completely different theoretical premises. Hans Eysenck in the 1940s and 1950s built personality questionnaires based on biological and learning-theory hypotheses rather than on dictionary analysis. Eysenck initially proposed two factors (Extraversion and Neuroticism), later added Psychoticism. Other questionnaire builders — including Costa and McCrae, who began their joint program of personality research at the National Institute on Aging in the 1970s — kept finding that the dimensions captured by their instruments, when factor-analyzed, converged on a similar small set of higher-order traits.
The convergence moment came in the 1980s, when researchers comparing the lexical-tradition factors and the questionnaire-tradition factors realized they were measuring approximately the same five dimensions. The lexical work yielded Openness, Conscientiousness, Extraversion, Agreeableness, and Emotional Stability (Neuroticism reversed). The questionnaire work yielded Openness, Conscientiousness, Extraversion, Agreeableness, and Neuroticism. Two methodological traditions, completely different starting points, decades of independent work, converging on the same factor structure. That kind of convergence is the strongest evidence the field can produce that the construct being measured is real and not an artifact of a single instrument or sample.
The instrument that became the modern reference standard is the Revised NEO Personality Inventory (NEO-PI-R) developed by Paul Costa and Robert McCrae and described in Costa, P. T., & McCrae, R. R. (1992). Revised NEO Personality Inventory (NEO-PI-R) and NEO Five-Factor Inventory (NEO-FFI) Professional Manual. Psychological Assessment Resources. The NEO-PI-R measures each of the five domains plus six facets per domain, has high internal-consistency and test-retest reliability, has been validated against observer ratings and behavioral measures, and is the instrument most cross-cultural and predictive-validity studies subsequently used.
The Big Five was not invented. It was discovered, repeatedly, by independent research groups using different methods, and the structure replicated.
Cross-Cultural Replication: McCrae & Terracciano 2005
A finding that replicates within one culture is interesting. A finding that replicates across cultures, especially across cultures with very different languages, religions, family structures, and economic systems, is much stronger evidence that you are measuring something approximating a real human universal rather than a quirk of one population.
The standard reference for the Big Five’s cross-cultural validity is McCrae, R. R., & Terracciano, A. (2005). “Universal features of personality traits from the observer’s perspective: Data from 50 cultures.” Journal of Personality and Social Psychology, 88(3), 547–561. DOI: 10.1037/0022-3514.88.3.547.
McCrae and Terracciano coordinated a project that recruited college students in 50 cultures spanning Europe, the Americas, Africa, Asia, and Oceania. Each student was asked to identify an adult or college-aged man or woman they knew well, and to rate that target on a third-person version of the Revised NEO Personality Inventory translated into the local language. The full study involved 11,985 ratings of 11,985 targets across the 50 cultures.
The use of observer ratings rather than self-ratings is a methodologically important choice. Self-report personality measurements can be confounded by cultural differences in self-presentation, modesty norms, and acquiescence bias. Observer ratings substantially reduce, though do not eliminate, those confounds — what is being measured is how an informant describes a known target’s typical behavior, not how the target chooses to describe themselves.
The factor analyses within each of the 50 cultures showed that the five-factor structure that was originally derived from American self-report data was clearly replicated in most cultures and recognizable in all of them. The five factors emerged from the data in the same configuration of trait-adjective loadings; the same items loaded on the same factors in approximately the same magnitudes. Some cultures showed minor variations — a few items loaded slightly differently in some non-Western samples, and the boundaries between adjacent factors were occasionally fuzzier in some samples than in others — but the overall structure held across the entire sample.
This is not the same thing as saying that every human society has identical average levels of each trait. Average trait levels do vary by culture — some samples were on average more Extraverted, others more Agreeable, others more Open. What was universal was the structure — the fact that human personality variation, when measured well, organizes itself into five major dimensions in roughly the same configuration regardless of which culture the data comes from.
The 50-cultures study has been extended in subsequent years. Schmitt, Allik, McCrae, and Benet-Martínez led an even larger 56-nation study published in 2007. Bartram coordinated a study analyzing personality data from 30,000 candidates in over 30 countries. The five-factor structure has now been replicated across an extremely wide range of languages and cultures, including non-Western and pre-industrial societies. There are interesting open debates about whether the structure is equally clean in every cultural setting (the Tsimané forager-farmers of Bolivia, in one widely discussed paper, did not yield a clean Big Five structure on standard instruments). But the broad finding — that the same factor structure recurs across most human populations when measured carefully — is one of the most robust empirical results in personality psychology.
That is much stronger evidence of construct validity than anything available for MBTI, DISC, or any of the popular typological assessments.
Predictive Validity: Barrick & Mount, Roberts, And Soto
Discovery and cross-cultural replication establish that the Big Five is a real structure. The question that matters for any practical application is whether the structure predicts anything outside the questionnaire itself. The personality literature on this point is extensive and the headline result is clear: yes, and at magnitudes comparable to the most heavily studied predictors in social science.
The foundational predictive-validity paper on job performance is Barrick, M. R., & Mount, M. K. (1991). “The Big Five personality dimensions and job performance: A meta-analysis.” Personnel Psychology, 44(1), 1–26. DOI: 10.1111/j.1744-6570.1991.tb00688.x. Barrick and Mount aggregated 117 prior studies of personality and job performance across five occupational groups (professionals, police, managers, sales, and skilled/semi-skilled workers) and three performance criteria (job proficiency, training proficiency, and personnel data such as salary growth and tenure). The headline finding was that one Big Five dimension — Conscientiousness — predicted job performance across essentially every occupational group and every performance criterion, with corrected validity coefficients in the .20 to .25 range. Other Big Five dimensions predicted performance in specific occupational contexts (Extraversion for sales and management, Openness for training proficiency) but Conscientiousness was the universal predictor.
The 1991 meta-analysis has been extended many times. Salgado’s 1997 European meta-analysis replicated the Conscientiousness finding outside the U.S. Hurtz and Donovan in 2000 used a stricter Big Five operationalization and got similar results. Mount, Barrick, and Stewart in 1998 examined personality and performance in jobs involving interpersonal interaction and found Agreeableness adds predictive power in those settings. The conscientiousness-predicts-job-performance finding is one of the most replicated results in industrial-organizational psychology.
The broader life-outcomes case is made in Roberts, B. W., Kuncel, N. R., Shiner, R., Caspi, A., & Goldberg, L. R. (2007). “The power of personality: The comparative validity of personality traits, socioeconomic status, and cognitive ability for predicting important life outcomes.” Perspectives on Psychological Science, 2(4), 313–345. DOI: 10.1111/j.1745-6916.2007.00047.x. Roberts and colleagues compiled meta-analytic effect sizes for the predictive validity of Big Five personality traits, socioeconomic status, and cognitive ability across three major life outcomes: mortality (does the person die earlier or later than expected), divorce (does the marriage dissolve), and occupational attainment (where the person ends up in the labor market by middle age).
The result that surprised some readers was that the magnitudes of the predictive effects were comparable across the three predictor families. Personality traits — particularly low Conscientiousness for mortality and low Agreeableness and high Neuroticism for divorce — predicted important life outcomes at effect sizes statistically indistinguishable from the effects of SES (parental income and education) and cognitive ability (IQ). This was a striking result because SES and IQ have decades of well-funded research showing them to be powerful life-outcome predictors. The Roberts paper documented that the Big Five personality dimensions, measured well, sit in the same predictive-power league as the two most heavily studied non-personality predictors in social science.
The most recent and methodologically strongest replication of the personality-predicts-outcomes literature is Soto, C. J. (2019). “How replicable are links between personality traits and consequential life outcomes? The Life Outcomes of Personality Replication Project.” Psychological Science, 30(5), 711–727. DOI: 10.1177/0956797619831612. The Life Outcomes of Personality Replication Project (LOOPR) was a preregistered, large-sample direct replication of 78 previously published Big Five trait-outcome associations. Sample sizes were large (median N = 1,504), the statistical tests were preregistered, the analytic decisions were locked in before the data came in, and the project was deliberately designed to give the original findings a fair test.
The headline result was that 87 percent of the 78 replication attempts produced statistically significant effects in the originally reported direction. The effect sizes were typically about 77 percent as large as the original effects — meaning some shrinkage, the same shrinkage you would expect from any honest replication project, but not the collapse that has hit many other social-psychology programs. For comparison, the Open Science Collaboration’s 2015 mass replication of 100 psychology findings produced significant replications in roughly 36 percent of cases, with replication effect sizes about half the originals.
The Big Five trait-outcome literature replicated at more than twice the rate of social psychology overall and with substantially less effect-size shrinkage. That is the kind of replication profile you want before trusting a research program enough to use it operationally.
What Distinguishes The Big Five From MBTI, DISC, And Enneagram
A reasonable question at this point is: if the Big Five replicates so well, why is it not the dominant instrument in corporate hiring, leadership coaching, and team-building work? The honest answer is that the corporate personality-assessment market is dominated by instruments that are easier to sell, more entertaining to take, and less likely to surface uncomfortable individual differences — and that the empirical inferiority of those instruments to the Big Five is not generally known to the buyers paying for them.
The four most commercially dominant alternatives are the Myers-Briggs Type Indicator (MBTI), DISC, the Enneagram, and Insights Discovery. Each of them differs from the Big Five in ways that matter empirically.
MBTI is the most widely sold personality assessment in the world. It was developed in the 1940s by Katharine Briggs and Isabel Briggs Myers based on their reading of Carl Jung’s Psychological Types. It sorts test-takers into one of 16 four-letter types based on dichotomous scoring across four dimensions (E/I, S/N, T/F, J/P). The empirical problems with MBTI are well documented. The four dimensions show extremely high test-retest unreliability — somewhere between 39 percent and 50 percent of takers retest into a different type within a few weeks, which is incompatible with the claim that the types reflect stable personality structure. The dichotomous scoring (you are either an Extravert or an Introvert, never in between) is empirically wrong — the underlying distributions are continuous and unimodal, not bimodal, so the cutoff classification creates artificial sharp boundaries that do not exist in the data. Predictive validity for job performance, leadership effectiveness, team functioning, or life outcomes is consistently weak in the peer-reviewed literature. MBTI’s marketing materials cite “research” extensively, but most of that research consists of internal validation studies by the publisher and is not equivalent to the independent academic literature on the Big Five.
DISC assessments — built on William Marston’s 1928 theoretical framework of Dominance, Influence, Steadiness, and Conscientiousness — share many of MBTI’s empirical problems. The DISC dimensions partly overlap with two of the Big Five (Extraversion and Conscientiousness) and partly do not map cleanly onto anything well-validated. The academic literature on DISC’s predictive validity is thin and inconsistent. DISC is dominant in sales training and team workshops not because of its predictive power but because of its simplicity, the ease of generating colorful reports, and the well-developed network of certified DISC trainers selling the product.
The Enneagram has the weakest empirical foundation of any widely commercial personality assessment. It claims to identify nine basic personality types with deep roots in spiritual traditions, was popularized in the 1970s through writings by Oscar Ichazo and Claudio Naranjo, and has effectively no peer-reviewed empirical literature establishing the reality of the nine types or the predictive validity of the typology. Enneagram type assignments are not stable across retests, the type categories do not emerge from factor analyses of behavioral data, and there is no convincing evidence that Enneagram type predicts any practical outcome better than chance.
Insights Discovery and similar four-color personality systems (True Colors, Hartman Color Code, etc.) are commercial rebrandings of Jungian typology, generally lacking peer-reviewed validation studies and sharing the same dichotomous-classification problems as MBTI.
The empirical distinctions are clear and matter. The Big Five was discovered by factor analysis of behavior-relevant trait language; the popular alternatives were proposed by individual theorists from non-empirical starting points. The Big Five is dimensional (every person is somewhere on a continuum on each of five dimensions); the popular alternatives are typological (every person is sorted into one of N discrete categories). The Big Five has high test-retest reliability; the popular alternatives mostly do not. The Big Five predicts job performance, mortality, divorce, and dozens of other life outcomes at meta-analytic effect sizes comparable to IQ and SES; the popular alternatives mostly do not predict anything once you control for the Big Five variance they accidentally capture.
If your organization is paying for personality assessments and the instrument is not a Big Five derivative (NEO-PI-R, IPIP-NEO, BFI-2, HEXACO, or a validated commercial Big Five such as the Hogan Personality Inventory or 16PF in its current form), the assessment is almost certainly not worth what you are paying for it.
Honest Limits
This is an anti-example article and the point is to argue that the Big Five replicates and predicts well. But intellectual honesty requires flagging the legitimate limits of the framework.
First, the predictive effect sizes are real but modest at the individual level. When the Big Five literature describes a “robust” effect, it typically means a correlation in the .15 to .30 range — meaningful and decision-relevant in the aggregate, but not strong enough to forecast individual behavior with high confidence. A high-Conscientiousness applicant is, on average, a better-performing hire than a low-Conscientiousness applicant of equivalent qualifications, but the prediction is probabilistic and individual-specific factors swamp the trait signal in any single case. The right framing is “use Big Five as one input among several,” not “use Big Five to definitively rank candidates.”
Second, the Big Five describes structure, not causes. The model tells you that human personality variation organizes itself into five dimensions and that those dimensions predict outcomes. It does not tell you why the dimensions exist, what their biological or developmental causes are, or how much of the trait variance is genetic versus environmental. Behavior-genetic studies suggest substantial heritability for all five traits (somewhere around 40-50 percent of trait variance is genetic in twin-study designs), but the causal architecture is not part of what the Big Five model itself claims.
Third, cultural variation in factor purity exists. The 50-cultures study showed the five-factor structure replicates across most cultures, but the replication is not perfect everywhere. Some non-Western samples show fuzzier boundaries between adjacent factors; the Tsimané study in Bolivia found a structure with fewer than five clean factors. The Big Five is the best-replicated personality model in cross-cultural research but not a perfect universal in every population.
Fourth, measurement quality matters more than most users realize. Short Big Five inventories (10-item BFI, 60-item NEO-FFI) are useful for research and screening but lose substantial information compared with the full 240-item NEO-PI-R or the IPIP-NEO. Self-report measures can be confounded by impression management in high-stakes contexts (hiring). Multi-method assessment (combining self-report with observer ratings or behavioral data) gives substantially better signal than self-report alone. If you are deploying a Big Five assessment for hiring, the instrument and the measurement context matter as much as the underlying construct.
Fifth, the Big Five is the best replicated personality model but not the only credible alternative. The HEXACO model adds a sixth factor (Honesty-Humility) and has reasonable empirical support for that addition; some researchers prefer it. There is ongoing scholarly debate about whether the five-factor solution is the right level of granularity or whether higher-order or lower-order solutions better describe the data. The Big Five is the established consensus model and the right default choice, but the field is not closed.
These limits are real and worth acknowledging. They do not, however, undermine the basic case: the Big Five is the most empirically supported personality model available, by a wide margin, and the popular typological alternatives are not in the same evidence league.
What This Means For Hiring And Coaching Programs
For an operator making real decisions about personality assessment in a hiring or coaching program, the implications of the empirical literature are reasonably clear.
Use validated Big Five instruments. The defensible options in commercial use today include the NEO-PI-R (the academic reference standard), the IPIP-NEO (public domain, free to use), the BFI-2 (well-validated short-form), the Hogan Personality Inventory (commercial Big Five derivative widely used in industrial-organizational settings), and the HEXACO Personality Inventory (the six-factor extension). Each of these has independent peer-reviewed validation studies and supports defensible inferences about candidate or coachee personality. If you are working with an external assessment vendor, ask whether the instrument has been factor-analytically validated, whether it produces dimensional rather than typological scores, and whether independent (non-vendor) academic studies establish predictive validity. If the vendor cannot answer these questions cleanly, the assessment is probably not worth the budget.
Use Conscientiousness specifically for performance prediction. Across the meta-analytic literature, Conscientiousness is the single most reliable Big Five predictor of job performance across roles. Other traits add incremental validity in specific contexts (Extraversion for customer-facing and leadership roles, Agreeableness for team-based work, low Neuroticism for high-stress roles, Openness for innovation and training-intensive contexts). But if you can use only one trait, use Conscientiousness.
Use Big Five scores as one input among several, not as a primary screen. The predictive validities, while real, are modest. Combining a structured Big Five assessment with structured interviews, work samples, cognitive-ability measurement, and reference checks produces meaningfully better hiring outcomes than any one method alone. Pure Big Five screening risks both false positives (high-Conscientiousness applicants who fail in role for non-personality reasons) and false negatives (lower-Conscientiousness applicants who succeed due to compensating strengths).
Reject typological assessments as primary tools. If your organization is currently spending six figures annually on MBTI training, DISC certifications, or Enneagram coaching, that budget is not generating commensurate organizational value relative to what a Big Five-based alternative would generate. These programs persist in corporate settings largely for cultural and ritual reasons, not because they outperform validated alternatives. The internal political cost of removing them can be significant, but the empirical case for the switch is unambiguous.
Be explicit about what assessments can and cannot do. Personality assessment is useful for predicting average-case behavior across many decisions but is poor at predicting specific decisions in specific contexts. Position personality data appropriately in your hiring and coaching processes: useful for “this candidate is, on average, likely to behave in these ways across many work situations,” not useful for “this candidate will definitely succeed in this specific role.” Overclaiming predictive power, even for validated instruments, sets expectations the data cannot meet.
The most expensive personality-assessment mistake an organization can make is paying for an assessment that does not predict anything and then making real hiring and development decisions based on the output. The empirically supported alternative is available, well-documented, and often cheaper than the typological products it would replace.
What This Anti-Example Tells Us About Personality Research Overall
The Big Five is in this hub as a calibration example, alongside the Default Effect article and a small number of others. The point is to articulate what a robust social-science finding actually looks like, so that the contrast with the many failed findings catalogued elsewhere in the hub is sharp.
The pattern that distinguishes the Big Five from MBTI, DISC, Enneagram, Multiple Intelligences, and Emotional Intelligence is consistent. Successful programs have operational definitions that allow direct measurement — the Big Five was operationalized through factor-analyzed inventories with established reliability and validity, while MBTI relies on a self-classification logic with poor test-retest behavior. Successful programs have factor structure that emerges from data, not factor structure imposed by theory — the Big Five emerged from factor-analyzing trait adjectives and questionnaire items, while the Enneagram’s nine types were proposed by individual theorists and never derived from data. Successful programs have cross-cultural replication of the underlying structure, not just within-culture validation — the Big Five replicates across 50+ cultures, while DISC has essentially no equivalent cross-cultural validation literature. Successful programs have predictive validity for outcomes that matter, established by independent researchers and not just by the assessment vendor — the Big Five predicts job performance, mortality, divorce, and dozens of other outcomes in peer-reviewed independent studies, while typological assessments mostly do not.
Operational definition. Factor structure from data. Cross-cultural replication. Predictive validity. These are the four hallmarks of a personality construct worth taking seriously. The Big Five has all four. The popular alternatives have one or two at best. That difference is the difference between research worth using and research worth refusing.
For a CEO, founder, or HR leader evaluating which personality programs to fund and which to retire, this is the diagnostic to apply. If the assessment cannot show its working on all four dimensions, it does not belong in your decision stack. The Big Five can.
Sources
- Allport, G. W., & Odbert, H. S. (1936). “Trait-names: A psycho-lexical study.” Psychological Monographs, 47(1), i–171. DOI: 10.1037/h0093360
- Barrick, M. R., & Mount, M. K. (1991). “The Big Five personality dimensions and job performance: A meta-analysis.” Personnel Psychology, 44(1), 1–26. DOI: 10.1111/j.1744-6570.1991.tb00688.x
- Costa, P. T., & McCrae, R. R. (1992). Revised NEO Personality Inventory (NEO-PI-R) and NEO Five-Factor Inventory (NEO-FFI) Professional Manual. Psychological Assessment Resources.
- Goldberg, L. R. (1990). “An alternative ‘description of personality’: The Big-Five factor structure.” Journal of Personality and Social Psychology, 59(6), 1216–1229. DOI: 10.1037/0022-3514.59.6.1216
- John, O. P., Naumann, L. P., & Soto, C. J. (2008). “Paradigm shift to the integrative Big Five trait taxonomy: History, measurement, and conceptual issues.” In Handbook of Personality: Theory and Research (3rd ed.). Guilford Press.
- McCrae, R. R., & Terracciano, A. (2005). “Universal features of personality traits from the observer’s perspective: Data from 50 cultures.” Journal of Personality and Social Psychology, 88(3), 547–561. DOI: 10.1037/0022-3514.88.3.547. PubMed link
- Open Science Collaboration. (2015). “Estimating the reproducibility of psychological science.” Science, 349(6251), aac4716. DOI: 10.1126/science.aac4716
- Roberts, B. W., Kuncel, N. R., Shiner, R., Caspi, A., & Goldberg, L. R. (2007). “The power of personality: The comparative validity of personality traits, socioeconomic status, and cognitive ability for predicting important life outcomes.” Perspectives on Psychological Science, 2(4), 313–345. DOI: 10.1111/j.1745-6916.2007.00047.x. PubMed link
- Salgado, J. F. (1997). “The five factor model of personality and job performance in the European Community.” Journal of Applied Psychology, 82(1), 30–43. DOI: 10.1037/0021-9010.82.1.30
- Soto, C. J. (2019). “How replicable are links between personality traits and consequential life outcomes? The Life Outcomes of Personality Replication Project.” Psychological Science, 30(5), 711–727. DOI: 10.1177/0956797619831612. PubMed link
- Tupes, E. C., & Christal, R. E. (1961). “Recurrent personality factors based on trait ratings.” Technical Report ASD-TR-61-97, U.S. Air Force, Lackland Air Force Base, TX.
Related
- Replication Crisis Hub — index of evaluated findings, including the small set of constructs (Big Five, default effect, IQ structure) that survived methodological scrutiny.
- Howard Gardner’s Multiple Intelligences — the educational typology that promised eight intelligences and turned out to load on a single general factor when actually tested. The clearest cousin of “this looks like personality but is not validated personality.”
- Daniel Goleman’s Emotional Intelligence — the corporate-leadership construct that mostly disappears under jangle-fallacy scrutiny once you control for IQ and the Big Five.
- Grit / Duckworth’s Oversold Claims — the celebrity construct that turned out to be essentially redundant with Big Five Conscientiousness once they were jointly measured.
- The Halo Effect — the rater-bias phenomenon that helps explain why some personality assessments produce inflated within-rater correlations and why multi-source measurement matters.
- IAT Implicit Bias Measurement — another widely sold individual-differences construct, with much weaker reliability and predictive validity than the Big Five despite far more cultural prominence.
FAQ
Q: What about MBTI? It is the most popular personality assessment in the world.
Popularity is not the same as validity. MBTI has substantial test-retest unreliability (35-50 percent of takers retest into a different type within weeks), forces dichotomous classifications on what are actually continuous distributions, and consistently underperforms the Big Five in predictive-validity studies for job performance, leadership effectiveness, and life outcomes. MBTI is dominant in corporate training because it is easy to brand, easy to deliver, and the resulting type assignments are flattering and non-threatening to take-takers — not because it measures personality well. The peer-reviewed academic literature on MBTI is small and unfavorable; most published “research” cited in MBTI marketing is from the publisher itself.
Q: What about DISC? Is it as bad as MBTI?
DISC shares many of MBTI’s problems. The four DISC dimensions partly overlap with two of the Big Five (Extraversion and Conscientiousness) but the typological scoring, the empirical disconnect between the proposed dimensions and factor-analyzed personality structure, and the thin peer-reviewed validation literature make DISC a weaker choice than any Big Five-based alternative. DISC is widespread in sales training and team workshops mainly because it is simple to teach and the certification ecosystem is well-developed, not because it predicts behavior better than what it would replace.
Q: Is the Big Five universal across all cultures?
The five-factor structure has been replicated across 50+ cultures (McCrae & Terracciano 2005) and extended in subsequent multi-nation studies. The replication is robust in most populations including Western, East Asian, Latin American, African, and Eastern European samples. There are interesting open scholarly questions about whether the structure is equally clean in every cultural setting — a study of the Tsimané forager-farmers in Bolivia did not produce a clean five-factor solution, raising legitimate questions about generalizability to non-industrialized small-scale societies. But for any context that includes participants from large-scale literate societies, the Big Five is the best-replicated personality structure available and the right default model.
Q: Should I use Big Five assessments in hiring?
Yes, with appropriate framing and as one input among several. Validated Big Five instruments (NEO-PI-R, IPIP-NEO, Hogan Personality Inventory, BFI-2) have meta-analytic predictive validities for job performance that are real, replicated, and decision-relevant. Conscientiousness is the most consistent predictor across job families; other traits add incremental validity in specific roles. Combine Big Five data with structured interviews, work samples, cognitive-ability assessment, and reference checks — that combination outperforms any single method. Do not use Big Five as a primary screen on its own; the effect sizes are real but modest at the individual level.
Q: What is the difference between the Big Five and the Five-Factor Model?
The terms are essentially synonymous in current usage. The “Big Five” terminology comes from the lexical tradition (Goldberg) and emphasizes the five-factor structure as discovered through factor analysis of trait adjectives. The “Five-Factor Model” terminology comes from the questionnaire tradition (Costa and McCrae) and emphasizes the five-factor structure as instantiated in inventories like the NEO-PI-R. The two traditions converged on the same five dimensions, so the labels are interchangeable in modern practice.
Q: Is the Big Five better than HEXACO?
This is an open scholarly question. HEXACO adds a sixth factor (Honesty-Humility) that captures meaningful variance in honesty, modesty, fairness, and lack of greed — variance that the Big Five Agreeableness factor only partially captures. There is reasonable empirical support for HEXACO as an alternative to the Big Five, particularly for predicting outcomes related to integrity, ethical behavior, and dark-side personality. For most general personality-assessment purposes, the Big Five remains the established consensus and either model is a defensible choice over typological alternatives.
Q: How accurate is a free 10-question Big Five quiz online?
Short Big Five inventories (10-item BFI-10, 15-item TIPI, etc.) are reasonably good for research screening and rough self-knowledge but lose meaningful information compared with full-length instruments. Test-retest reliabilities are lower, internal consistency is lower, and the ability to distinguish closely related facets within each domain is essentially absent. For hiring or coaching decisions, use a full-length validated instrument (full NEO-PI-R or equivalent). For casual self-exploration, a short online Big Five inventory is much better than an MBTI quiz but should not be over-interpreted.
Q: Did the replication crisis affect Big Five research?
Yes, but the Big Five literature came through the replication crisis substantially better than most of social psychology. The Soto 2019 Life Outcomes of Personality Replication Project preregistered direct replications of 78 trait-outcome links and found 87 percent replicated significantly in the predicted direction, with effect sizes about 77 percent of original magnitudes. Compared with the 36 percent replication rate found in the Open Science Collaboration’s 2015 mass replication, the Big Five literature is unusually robust. The structural finding (factor analyses reliably yield five factors across cultures) is the most replicated result in personality psychology. Some specific trait-outcome links and some boutique applications of Big Five measurement have failed individual replications, but the core framework has held up.