On High-Range Test Construction 24: Alex Tolio
Publisher: In-Sight Publishing
Publisher Founding: March 1, 2014
Web Domain: http://www.in-sightpublishing.com
Location: Fort Langley, Township of Langley, British Columbia, Canada
Journal: In-Sight: Independent Interview-Based Journal
Journal Founding: August 2, 2012
Frequency: Three (3) Times Per Year
Review Status: Non-Peer-Reviewed
Access: Electronic/Digital & Open Access
Fees: None (Free)
Volume Numbering: 13
Issue Numbering: 1
Section: E
Theme Type: Idea
Theme Premise: “Outliers and Outsiders”
Theme Part: 32
Formal Sub-Theme: High-Range Test Construction
Individual Publication Date: November 22, 2024
Issue Publication Date: January 1, 2025
Author(s): Scott Douglas Jacobsen
Word Count: 2,534
Image Credits: Photo by Shubham Dhage on Unsplash.
International Standard Serial Number (ISSN): 2369-6885
Please see the footnotes, bibliography, and citations, after the publication.*
Abstract
Alex Tolio is someone interested in I.Q. tests and high-range test construction. He discussed common errors in creating valid, reliable high-range IQ tests, highlighting test constructors’ biases, insufficient abstraction, and questions overly reliant on novelty or complexity. Tolio notes that a balance between creativity and clarity is essential to prevent ambiguous items. He suggests beta-testing, revision, and removing unnecessary elements to reduce bias. He advocates for heterogeneous tests to best capture the general intelligence factor (g) and highlights the importance of correlating new tests with validated ones. Confirmatory factor analysis and careful data preparation are key for precise measurement. Feedback issues often involve claims of ambiguous items.
Keywords: bias in test creation, abstraction importance, heterogeneous tests, test validity, construct validity, confirmatory factor analysis, reducing ambiguity.
On High-Range Test Construction 24: Alex Tolio
Scott Douglas Jacobsen: What are common mistakes in trying to make high-range tests valid, reliable, and robust?
Alex Tolio: I suppose the most common mistake may be one’s own biases towards what they consider valid.
While there exists objectivity in what parameters constitute a good test, there does not seem to be a set consensus on what really is “absolutely needed” to measure IQ’s supposedly above 3-4SD.
I can only suppose with certainty that the need for increased abstraction (rather than speed) has to be present, but there exists a fine line between what is really abstract or simply convoluted.
Because of this, test-creators (including myself), are likely implementing their own bias within the question(s) themselves, which ultimately may reflect one’s own attitude to measuring very high IQ. In other words, what one considers great items or perhaps even format, is already biased.
I believe this can manifest itself in many forms; such as:
- A tendency towards creating items that may be inherently polluted or prone to by external factors; examples of such may be unnecessary layers in hopes of increasing difficulty, which, when made poorly may falsely promote novelty in cost of quality. I believe the main flaw of such items may be that a conscientious but yet less capable individual; may be able to untangle them given enough time. This means that the difficulty of the item in question is shallow; given it may be sufficiently understood with enough time, it is only a mere product of the layers of which the true solution may be.
- Implementation of too much creativity, I’ve found this may be counter-intuitive to creating a solid test, but I believe there need be a balance. In the process of one’s own journey to create novelty, one will be faced with such issue, where a large risk is esotericism, which may ultimately bring its own set of problems. Such items are also prone to interpretation, which ultimately hurts a test’s quality, because the reliability is primarily based on responses given by the candidates that answered an item. Because reliability is a product of this, such questions will dramatically lower the consistency, because of the variety of the responses received. I believe there are numerous tests out there of which this is an issue; and possibly a reason of the lack of report for such statistic.
To note, a test may be good in of itself, but the sample to which it is administered will have a direct effect on its quality.
- Misjudge of sufficiently discriminating questions may result a test author to produce a test that is too difficult.
- Some test authors are not attempting to create tests to measure IQ, but is rather a showcase of their own ability, in which superfluous and unreasonably difficult items are present.
- Recycling one’s own questions (something I’m equally guilty of)!
- Poor balance of difficulty within a test; or too “graded” difficulty. If difficulty is too calculated, or a steep sudden increase occurs, there may exist a common “wall” that collapses in a certain raw; of which people of a ball-park of ability may fall into, possibly seizing to discriminate further.
A good example of that is when a norm has a jump of almost +10-15 points per raw; this means that the people who are in this raw are not distributed.
- Creating too many difficult questions, does not make for a great test as a whole.
- Possibly the repetitive use of patterns, of which one has seen, thought of, or previously used in tests. This creates a author-specific learning effect, of which familiarity with such pattern(s), assuming they have been solved once, will not require sufficient amount of ability afterwards.
I believe this is also a mistake I tend to make. However, I believe it is ultimately unavoidable as one proceeds to solve an author(s) tests. One will have to exert the most effort in the first couple tests, of which may truly measure “IQ”. But as one proceeds to take multiple tests by an author, their scores are prone to increase. I truly think this reflects that tests may seize to be robust in particular cases; this seems to be irrespective of whether author’s work is of quality, but may be related to becoming more aware of what may be asked by a certain author; developing an intuition for such; or perhaps an eventual reverse-engineer on how the author may tend to think, and general familiarity. This, for me, highlights the human error that is present in such tests, which constitutes imperfect measurement; but also possibly invalid after a certain period of time.
- Test questions do not discriminate properly, as-per-flawed construction. This ties well with shallow questions. I believe questions that poorly discriminate are likely the ones that are also prone to be more polluted by (external) factors required; that is in addition to pattern recognition (core).
Whereas those factors may ultimately eliminate the quality and need (albeit not absolute) for pattern recognition.
- Items that are poor may result in an “artificial ceiling” of which the test does not bare the ability to truly measure to the levels it is reporting.
- False or manipulated statistics (or lack thereof)
Jacobsen: What are the core abilities measured at the higher ranges of intelligence or as one attempts to measure in the high-range of ability?
Tolio: I believe what is unitary of such tests is their increased demand on abstraction.
This means that the focus is shifted to the depth rather than speed, of which may challenge the strength of ones understanding.
Because understanding requires pattern recognition as per item, candidates answer directly reflects their understanding of an item.
Assuming answer is commonly unique, then the discrimination should occur naturally; a phenomenon better seen by the (likely) common understanding of items between candidates of different ability levels.
This does not necessarily imply there truly is an alternative (or weaker answer), but is a consequence of lack of sufficient understanding, of which ambiguity may seem apparent.
However, it is also possible there does exist a weaker “common”, of which ideally should be eliminated.
The strictness of the answer is perhaps also a prerequisite, of which candidates of higher ability tend to be extremely rigorous, and may not rush to think they have “the” answer.
In short, the core abilities are depth of understanding (reason), pattern recognition, higher ability for abstraction, and often inherent divergence that may result.
Pattern recognition of the highest form, in my opinion, is not about recognizing a pattern in all the chaos (or noise), but proposing structure to it; such that it is sufficiently and simply understood.
This means unnecessary elements implemented in questions may contradict this notion, and presentation ought to be clean.
If an element does not bare meaning, or is not useful in any particular way, is probably best removed. Which may also reduce item likelihood to load on external polluting factors.
Jacobsen: How do you remove or minimize test constructor bias from tests?
Tolio: It is not possible in its absolute form, but certain measures may help to eliminate it.
-Questions should (ideally) be beta-tested before release.
-Conscious effort of elimination by author, rigorous revision of items.
-Creating questions that are more pure; so that cultural difference is minimized
-Exclusion of esotericism and the inherent creativity
-Removal of unnecessary elements, promoting a clean presentation of an item
-Too many clues may be as good as none
-Very careful implementation of novelty, and preferring a more universally understood form of expression.
-Eliminating the projection of one’s own subjective ideas of sufficient discrimination between higher-ability candidates;
Jacobsen: What should be done with homogeneous and heterogeneous tests?
Tolio: Despite my own construction of homogeneous tests, it is clear that heterogeneous tests are the best measure.
It is not possible to extract true g if the test is homogeneous, and is likely extracting the most “general” factor the test is loading on.
I believe heterogeneous tests are the only way to extract g, because it is also simply a manifestation of a unitary excellence in various cognitive facets;
I’ve come to the recent conclusion that it is best that a test’s ceiling should be raised by this variety, rather than the implementation of extremely difficult questions in one area.
The latter being flawed, as these questions are the most challenging to make “healthy”.
- Cooijmans tests are a great model, but I would also encourage possible “inventions” for authors. Such forms may be of the problems or task in question.
Jacobsen: What tests and test constructors have you considered good?
Tolio: P. Cooijmans has the most thorough and professional work, which could additionally be used for educative purposes.
- Jouve has professional work; and implements a rather linear approach. Rather accurate; however I believe it could be potentially limiting after an arbitrary threshold.
And for this I think Johnathan Wai is a better fit.
Johnathan Wai’s work is more of well-suit to the high range.
I am not very well acquainted with old generation of authors, or non-western.
- Prousalis work is promising.
There are other authors of which I like, but may lack substantial statistics.
Jacobsen: When trying to develop questions capable of tapping a deeper reservoir of general cognitive ability, what is important for verbal, numeral, spatial, logical (and other) types of questions?
Tolio: While overlaps with the previous, however I will add:
Question difficulty may be effectively raised by:
Careful implementation of additional rules and the complexity of them; (reason)
Increasing the abstraction level of the idea in question; (horizon)
These seem to be primary ways, but not necessarily only ways.
In the process of a candidate tackling an item, one must first generate a plethora of ideas of whom (most, may not be meaningful) in solving the item.
However; the discrimination of such items occurs as per their idea, and not necessarily that they are difficult to work-around (reason).
It is likely that if one is not capable, they will never generate the idea to solve the item in question, of which requires associative horizon (see P. Cooijmans).
The most important part of every item is whether discrimination occurs; while it may seem intuitive that an author should be very rigorous with their own idea(s) and logic, or to even the disambiguation of their own items; this perfection is not always necessary in the process of creating a good item.
I’ve come to observe that an author mainly needs to think of the best application of their idea, in combination to the best presentation of such; of which, if made correctly,
should naturally dis-encourage alternative solutions from being present. In other words, disambiguation may occur as a consequence of this.
Jacobsen: What is efficient means by which to ballpark the general factor loading of a high-range test?
Tolio: I suppose since ballpark is used:
Correlation of authors test to a test or test(s) known to have captured g.
This means that correlations with professional tests are absolutely necessary, because they indicate construct validity.
Jacobsen: What is the most precise or comprehensive method to measure the general factor loading of a high-range test, a superset of tests, or a subset of such a superset?
Tolio: The most precise method I believe would be confirmatory factor analysis, of which samples are usually not apt.
This means it may misrepresent a test’s true g loading, and this is not necessarily towards the “better”, but often contrary, however it can be both ways.
It is also very likely that the extracted factor is not true g, of which a test may have extracted A “general factor”.
Correlations help to ensure of this.
When omega η is estimated, the square root of omega η is thought to be the g loading.
Data may need be prepared very carefully, but meaningful factor analysis may be least possible with a clean sample of N=70-100, albeit not precise.
Intercorrelation between tests, or one’s own tests, may be flawed because of the assumption the tests have captured g; unless correlations previously made have indicated construct validity, of which this method may be then considered valid.
Jacobsen: Have test construction and norming processes evolved in the aggregate for you?
Tolio: It started as a creative outlet and hobby for me; and I believe such lack of seriousness may be vaguely reflected, however, I attempt to provide honest statistical work/methods. It has been (and still is) a learning curve for me, of which I’ve tried to educate myself from the variety of work provided by authors, particularly P. Cooijmans, and using it as a stepping-stone to my own conclusion(s). There is still a lot to learn, in fact I would claim that I’m not even remotely advanced with statistics.
So I may use a disclaimer in the case something said is of ignorance; P. Cooijmans norming method is the definite best for norming a high-range test. Because of the linear issue proposed by Z-scores.
I have tried to propose at the very least my own, of which seems to have began with a flawed premise.
I proposed that many high range IQ tests do not contain sufficiently discriminating items; such promoting an essentially “artificial ceiling”.
This means that the items may not scale further upon a certain level, and are likely reporting false IQ scores beyond that.
I thought of attaching a “grade” or an IQ level for each problem, of which may be estimated by its solvability, of which is tied to the reported IQ of candidates that solved this item,
of which try to extract the least “known” IQ possible to solve a question, thus attaching this grade to the question.
The reason of this, would be to ball-park the true IQ content/difficulty present within a test.
This would help with promoting a healthier ceiling, or at the least; estimating true discrimination of a test.
Some but not all flaws:
The flawed premise is that a question absolutely necessarily needs a minimum IQ, this assumes that every question is good.
And of course; the dishonest reports of IQ’s, or scores.
Jacobsen: What is the most common mistake people make when submitting feedback about your tests?
Tolio: I’ve received primarily positive feedback, however there are often cases of which candidates attempt to point out a supposedly ambiguous item.
To eliminate this issue, whilst being entirely aware of the feedback this gives; I report every single “incorrect” answer to items.
I hope this approach is transparent enough, so that there should be no further discussions.
I’m fully aware however that such approach (and availability of tests), is promoting further familiarity/practice.
I hope to tackle this issue by creating more novel items.
However, it has also served as an experiment for me, and thus far it does not seem to be of extreme benefit to candidates.
This means that thus far; despite reporting the incorrect answers, the correct answers are still not “correctly” found, by re-occuring candidates in case an item may have been re-used.
This does not necessarily prove the contrary about the approach, but it may be of lesser impact as initially thought of.
Footnotes
None
Citations
American Medical Association (AMA 11th Edition): Jacobsen S. On High-Range Test Construction 24: Alex Tolio. November 2024; 13(1). http://www.in-sightpublishing.com/high-range-24
American Psychological Association (APA 7th Edition): Jacobsen, S. (2024, November 22). ‘On High-Range Test Construction 24: Alex Tolio’. In-Sight Publishing. 13(1).
Brazilian National Standards (ABNT): JACOBSEN, S. On High-Range Test Construction 24: Alex Tolio’. In-Sight: Independent Interview-Based Journal, Fort Langley, v. 13, n. 1, 2024.
Chicago/Turabian, Author-Date (17th Edition): Jacobsen, S. 2024. “On High-Range Test Construction 24: Alex Tolio’.” In-Sight: Independent Interview-Based Journal 13, no. 1 (Winter). http://www.in-sightpublishing.com/high-range-24.
Chicago/Turabian, Notes & Bibliography (17th Edition): Jacobsen, S. “On High-Range Test Construction 24: Alex Tolio.” In-Sight: Independent Interview-Based Journal 13, no. 1 (November 2024). http://www.in-sightpublishing.com/high-range-24.
Harvard: Jacobsen, S. (2024) ‘On High-Range Test Construction 24: Alex Tolio’, In-Sight: Independent Interview-Based Journal, 13(1). http://www.in-sightpublishing.com/high-range-24.
Harvard (Australian): Jacobsen, S 2024, ‘On High-Range Test Construction 24: Alex Tolio’, In-Sight: Independent Interview-Based Journal, vol. 13, no. 1, http://www.in-sightpublishing.com/high-range-24.
Modern Language Association (MLA, 9th Edition): Jacobsen, Scott. “On High-Range Test Construction 24: Alex Folio.” In-Sight: Independent Interview-Based Journal, vo.13, no. 1, 2024, http://www.in-sightpublishing.com/high-range-24.
Vancouver/ICMJE: Jacobsen S. On High-Range Test Construction 24: Alex Tolio [Internet]. 2024 Nov; 13(1). Available from: http://www.in-sightpublishing.com/high-range-24.
License & Copyright
In-Sight Publishing by Scott Douglas Jacobsen is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License. ©Scott Douglas Jacobsen and In-Sight Publishing 2012-Present. Unauthorized use or duplication of material without express permission from Scott Douglas Jacobsen strictly prohibited, excerpts and links must use full credit to Scott Douglas Jacobsen and In-Sight Publishing with direction to the original content.
