Skip to content

On High-Range Test Construction 2: Hindemburg Melão Jr. on the Sigma Test Extended

2024-08-01

 

 

 

 

 

 

 

 

 

Publisher: In-Sight Publishing

Publisher Founding: March 1, 2014

Web Domain: http://www.in-sightpublishing.com

Location: Fort Langley, Township of Langley, British Columbia, Canada

Journal: In-Sight: Independent Interview-Based Journal

Journal Founding: August 2, 2012

Frequency: Three (3) Times Per Year

Review Status: Non-Peer-Reviewed

Access: Electronic/Digital & Open Access

Fees: None (Free)

Volume Numbering: 12

Issue Numbering: 3

Section: E

Theme Type: Idea

Theme Premise: “Outliers and Outsiders”

Theme Part: 31

Formal Sub-Theme: High-Range Test Construction

Individual Publication Date: August 1, 2024

Issue Publication Date: September 1, 2024

Author(s): Scott Douglas Jacobsen

Word Count: 12,656

Image Credits: Hindemburg Melão Jr.

International Standard Serial Number (ISSN): 2369-6885

*Original article in Portuguese (Brazilian).*

*Please see the footnotes, bibliography, and citations, after the publication.*

Abstract

Hindemburg Melão Jr. is the author of solutions to scientific and mathematical problems that have remained unsolved for decades or centuries, including improvements on works by 5 Nobel laureates, holder of a world record in longest announced checkmate in blindfold simultaneous chess games, registered in the Guinness Book 1998, author of the Sigma Test Extended and founder of some high IQ societies. Melão Jr. discusses: building tests; conclusions about the tests previously; the origin and inspiration for making tests; some definitions and examples of meanings of words; the levels of the Sigma Test Extended; development or improvement of tests; trying to develop questions that tap into a deeper reservoir of skills; the hurdles that candidates tend to have; the process from conception to development and publication; the ideal number of test takers; tests and test builders; and learn from doing this test and its variants.

Keywords: creativity in test construction, developing new tests, founding Sigma Society, importance of accurate translation, interest in high-range IQ tests, limitations of IQ measurement, potential of high-IQ individuals, recognizing intellectual potential, role of environmental factors, utilizing intelligence for societal benefit.

On High-Range Test Construction 2: Hindemburg Melão Jr. on the Sigma Test Extended

Scott Douglas Jacobsen: Okie dokie, let’s get this show on the road. Like most people in these high-range test construction fields, you are self-taught. A strong point in this is the creativity in testing the construction. When did this interest in building tests really arise for you?

Hindemburg Melão Jr.: First of all, I would like to thank you for the kind invitation to discuss this important subject. It is a topic that has required attention for many years, but has been neglected and even corrupted in recent years. I will comment more on this in response to a related topic.

In 1991, I made drafts of a test I called “Alpha Tests.” Some questions were interesting, but I still had no idea how to create appropriate standards. In 1997, I started accessing the Internet and in 1999 I discovered Miyaguchi’s website, where several high range IQ tests were available. In the same year I founded Sigma Society and reused some of the old questions from the Alpha Tests, along with other new questions, which gave rise to the Sigma Test.

Initially ST was put online in Portuguese, translation software was still very primitive and I am not fluent in English. I tried to do a translation using PowerTranslator 7 from Globalink, but it was very bad. Fortunately, several people became interested in ST and offered to help translate it into other languages, starting with Petri Widsten, who spoke 9 languages fluently. He translated into English, Finnish, French, Italian, and before he began other translations, more people suggested offering to revise details in the Italian and French translations, and to make new translations. In total, it has been translated into a total of 14 languages. In addition to translating, Petri offered the ST for publication in the magazine Mensalainen, from Mensa Finland, and in the magazine IQ Magazine from the International High IQ Society, then Albert Frank published the ST in ComMensal in Belgium and in Gift of Fire by Prometheus. Albert also wrote an article about the ST that was published in Glia’s Papyrus.

Jacobsen: What were the conclusions about the tests at the time and the need to develop your own?

Melão Jr.: If you don’t mind, I’d rather talk about my impressions of the current tests (which include the oldest ones). I believe my current opinion is more useful.

To begin this response, I would like to analyze two recent comments (a few hours ago and a few minutes ago) posted by Tianxi Yu, in which he touches on important points, which illustrate some of the reasons why I developed new tests, new standardization methods and a new scale.

I started to respond to Tianxi’s message, but soon I exceeded Facebook’s character limit. Furthermore, as I developed the answer, I realized that it would be quite suitable to add as an answer to this question. Considering that the comments are in public posts, I believe that my friend Tianxi will have no objection to them being used here, even because his opinions on this subject are very similar to mine, with few points of divergence. In any case, if he wants me to remove the screenshot, that’s fine with me.

Post 1:

Perhaps what Tianxi meant was not exactly what he said. Some generalizations like “always uses” would not be representations of reality in the context he used. I would almost interpret it as the opposite, and in my network of contacts almost never anyone uses a Ph.D. as “proof” (or corroboration, or evidence) of intelligence. They use it for several other reasons, including because it is an achievement after years of effort in a process of acquiring knowledge and training in the application of the scientific method and certain procedures. They use it for social and intellectual prestige in the eyes of the majority, they use it for commercial, professional, social reasons, etc.

Anyway, I believe that the criticism that Tianxi would like to make, based on the context of what he wrote, is that in general people are more proud of a Ph.D. title than of a corresponding IQ (from 125 , depending on the field and institution) or even a higher IQ, although the title’s rarity level may be lower than the IQ’s rarity level. It would be like a person being proud of some bronze medals in a certain modality than of gold medals in another modality, and this has a derogatory effect on the second modality. In Tianxi’s view, people should feel proud of their genius, and externalize this feeling, and I partially agree with him.

However, people in high IQ societies do not seem engaged in valuing the attributes they have prominently and promoting the recognition of these attributes in the eyes of society. As a result, they lose space to people who “advertise” academic titles that represent less, from an intellectual point of view, but are seen with more admiration and respect by society.

A long analysis would be necessary here, and it would not be possible to analyze all the ramifications. I would select the branch that leads to Andrew Wiles’ criticism of IMOs. Wiles doesn’t place much value on IMO because they are very simple problems that can be solved in 1-2 hours, whereas big real-world problems are much harder and more complex that often take decades or even more than a few generations to be resolved, like what Wiles himself resolved.

There are several points to consider. The first is that IQ is predominantly genetic, the person didn’t have to work hard to achieve it, so I don’t think there would be much reason to be proud. What could be a source of pride is the use of IQ in solving important problems. In this sense, a typical Ph.D. with 125-135 can contribute more to the common good and the expansion of knowledge than a genius with 190.

This generates discredit and marginalization of high IQ societies, which are not admired or even respected by great intellectuals, nor by the population in general. Most great intellectuals are not even interested in joining these groups. Most of the smartest people are outside of high IQ societies. This does not represent a big problem. But on the other hand, people in high-IQ societies have great potential as “problem solvers”, and there are many difficult problems to be solved in the world, but there is no effective connection between these points, resulting in an immense waste of potential. .

I don’t want to comment on Kim because I’m irritated by his recent attitudes, and I don’t want to run the risk of being unfair with excessively harsh criticism for emotional reasons, but at the same time I can’t help but make an objective and impersonal observation about Kim. number cited about Kim’s 276 IQ, this is clearly a joke. Most high range IQ tests measure intelligence reasonably well up to about 170, some go as high as 180 but not much beyond that. They may put labels of 250 on the test norm, but the score does not reflect the correct IQ for levels above 180. I have already made attempts to raise this ceiling with the creation of the ST and the STE, but I am aware that I have not succeeded. achieve complete solutions, although perhaps I was able to push the limit a little higher and improve accuracy on the higher scores.

There are truly brilliant people in high IQ societies, but they haven’t produced much for different reasons. There are other brilliant people who effectively used their potential in some relevant contributions, such as Petri Widsten, Marco Ripà, João Antonio Locks Justi, Andrew Beckwith. Among those that did not produce, I see some allegations that seem plausible and fair to me, and others that are lame excuses.

I see my own case as an example of a situation of difficulties and many obstacles, my parents were very poor, I live in a backward country where people are prejudiced against intelligence, against science, against logic. I started my degree and stopped after 2 months, so I don’t even have half a semester of college left. Despite all this, I improved the works of 6 Nobel laureates in Economics and 1 in Physics and made dozens of original contributions in different fields of knowledge. Objectively comparing my contributions to Economics – especially Econometrics – with those of the winners of the Sveriges Prize for Economic Sciences in Memory of Alfred Nobel, my work is more relevant than that of 90% of the laureates. However, my articles are in Portuguese and are read by few people. Recently, two friends drew attention to this and it is possible that in 2025 I will receive two or more nominations for the “Nobel” in Economics, this depends, in part, on my articles being translated into English and published in indexed journals.

Of course, if I had been educated in a more stimulating environment, I could have produced much more and better, but even in a hostile and impoverished environment, this did not stop me from developing relevant innovations.

Before continuing the argument, I would like to cite one more example: Newton also faced difficulties in childhood and adolescence, according to some authors, Newton cleaned the floors and carried the potties out of his colleagues’ rooms, among other similar services, in exchange for the opportunity to study, but that didn’t stop him from achieving extraordinary results. Furthermore, he was beaten by his stepfather and suffered bullying at school, his mother abandoned him as a child to live with a farmer, among other problems. But he was perhaps the person who most expanded the horizons of knowledge in relation to what was understood before and after him.

So there is a bit of unfounded whining from certain people, who could and should produce much more. There are others who cannot be blamed, because they are experts in IQ test questions, they do not produce Science because their specific talent does not encompass the kind of aptitude that real world problems require. So it wouldn’t be fair to charge them that. The specific talent for solving questions at the level of difficulty and complexity of IQ tests is like the talent for Chess, or Music, or Mathematics.

In an interview, Fischer said that “he was not a chess genius. He was a genius who played chess, but he could be a genius at any other intellectual activity he chose.” This is partially right. He was indeed a genius with multiple talents, but not equally extraordinary. For Chess it was at a level of perhaps 1 in 100 billion. For other areas, such as Mathematics, Physics or Literature, perhaps at the 1 in 1 million level. Therefore he would probably achieve good results in any activity, but not at such a high level as he achieved in Chess.

The best chess players are not necessarily equipped with cognitive faculties for scientific creation at a level similar to what they have for playing chess. In the case of Mathematics, although it involves cognitive processes more similar to those of scientific production, there are still important differences that make it difficult for the majority of great pure mathematicians to achieve exceptional performance in Physics or Investments.

The typical mathematician is excessively concerned with every detail, with extreme rigor and accuracy, while the physicist is content with plausible finite inductions and reasonable evidence. This allows the great physicist to advance quickly in the analysis of very complex problems, while the great mathematician remains trying to demonstrate something in one of the initial stages of the problem and does not advance beyond that point, because for him it is very important to prove each step.

The physicist is satisfied with 99.9% accuracy (or even a little less) or with a sample of 100,000 events, while the mathematician does not accept just 99.99999999999999999999999999999999999999% accuracy nor does he accept just googolplexianth corroborating events, or even infinitely many corroborating events (if these infinities do not represent all cases). Ernst Kummer proved Fermat’s Last Theorem for infinitely many cases, but it did not represent all cases.

Physicists can build more complex solutions, but with a greater risk of containing errors. In practice, even if there are some “errors”, if the approximations are good enough, things end up working. Ptolemy’s cosmological model, for example, worked, made reasonably accurate predictions, even though it was fundamentally wrong.

What is useful and sufficient for Physics or Astronomy may not be useful for Mathematics. And if the physicist or engineer tried to achieve the same level of rigor as mathematicians for each detail, they would spend much more time on each stage and would not be able, in the short span of a lifetime, to produce much of what exists today. Therefore, it would be a naive mistake to believe that a great mathematician would necessarily be a great physicist if he had chosen to study Physics. Certainly a great mathematician is more likely to be a great physicist than a person drawn at random, or even than a person with great ability in another area that requires talents more different from those required for physics than Mathematics. In other words, there is a strong correlation between competence in Physics and Mathematics, but this correlation becomes weaker at higher levels, where specificities become more relevant.

Therefore, Fischer’s interpretation was partially correct, he was indeed a genius with multiple talents, but not equally high. At this juncture, the specific ability to solve IQ test questions is not very useful for predicting or diagnosing high intellectual production capacity in the real world, whether in Science or Mathematics. Even IMO problems, which are more like mathematical creation than IQ test problems, are also not good predictors, as Andrew Wiles warned.

That’s why one of my main objectives with ST and STE was precisely to fill this gap, creating a test that tries to assess the ability to solve major real-world problems. I was pleased with the result, and the ST and successors (ST-VI, STE, STL) have attracted the attention of some prominent intellectuals, and have received much praise.

Among the people who have done the ST, STE, STL so far, Petri Widsten had 212 and was the author of some innovations and patents, had the best doctoral thesis in Finland in the biennium 2002-2003 for which he also received a Summa Cum Laude distinction, He placed first in some international Logic and Puzzles competitions, including this competition: http://www.worldiqchallenge.com/rankings.html . Marco Ripá was 202, he is the author of some innovations in Mathematics and he is still very young, he will probably make other contributions that are even more important than he has done so far. Some people are taking the STE and STL but haven’t finished yet, but they are likely to have high scores. Lukas Pöttrich scored above 200 on other tests and at age 8 he scored higher than Terence Tao on the SAT-Math when Tao was 8 years old; Lukas got 800, while Tao got 760, as far as I know, that’s a world record. Diego Andrés de Barros Lima Barbosa (Bronze in the World University Mathematics Championship, 1 Silver and 2 Bronze in the Iberoamerican University Mathematics Championship), Federica Zanni (Bronze in IMO) recently registered on the Sigma Society website and spent a long time on the STE page, Kawan Duarte Guimarães Vieira, Davi Filipe de Melo Pereira, João Italo Marques de Lima, José Osmar de Souza Júnior, Mateus Melo and other young talents in Mathematics, Physics, Chemistry, Computer Science, etc. are taking the STL or STE.

It is very gratifying that the ST, STE, STL are also well accepted outside high-IQ societies, being recognized as a psychometric instrument differentiated by its content and standardization methodology. I feel happy and proud about this, because it leads me to assume that there seems to be good agreement about this type of question being suitable for correctly assessing intellectual production capacity in real problems, and people with good experience in solving very difficult problems agree with that. The IMO, despite the limitations pointed out by Wiles, continue to be the best instrument for predicting great talents in Mathematics, and perhaps the STE is the best for predicting talents for Science, in addition to being the best instrument for intellectual assessment at the highest levels .

I find figure sequence tests interesting because (theoretically) they do not require knowledge, on the other hand they assess a relatively narrow and primitive skill. Chess is heavily saturated with knowledge, but for people who have just learned to move pieces, the kind of skill measured in Chess is better suited to measuring intelligence than the ability to solve series of figures, because in Chess there is much greater complexity and sophistication, in addition to not having a single answer in most cases, but rather a wide variety of answers with different levels of “quality”, bearing greater similarity to real-world problems. Even though Chess is more effective, it is still inadequate to correctly assess the intellectual level, especially at the highest levels.

People have broad sets of general skills at a basic level that are strongly correlated, but as progressively higher levels are considered, the skills branch out and capillarize in different ways and the correlations begin to become weaker. In the IQ range of 70 to 140, grades in Mathematics, Physics, Chemistry, Writing, IQ scores generally correlate strongly with each other, between 0.6 to 0.85. But if you consider the range of 140 to 190, the correlation between these same skills becomes much weaker, close to 0.2 to 0.3. A similar effect occurs with IQ tests that use questions that are appropriate to measure correctly in the 70 to 130 range, or the 90 to 150 range, but cease to be appropriate above 160 and even worse above 170, 180, etc.

Another of Tianxi’s criticisms that needs to be examined carefully is about people with an IQ of 190 not posting content that he considers compatible with that intellectual level. An exhaustive analysis would take months, but I will try to focus on two points: if a person wants to post photos of cats, or collect license plates (like Sidis), or study alchemy (like Newton) and astrology (like Kepler), this does not reduce her IQ doesn’t even cancel out her merits. The person must have freedom to choose their leisure and work activities. But I also understand that if a person exclusively does these things, it can be a waste of potential.

As I mentioned above, the ability to obtain high scores on IQ tests does not imply that the person also has the ability to solve large scientific or mathematical problems. In this case, it is not fair to demand results in Science or any other area. Even if the person has the capacity to produce in Science, I don’t think it’s right to demand anything from them, but it would be desirable for them to be aware of the importance that their potential represents for the common good, and adopt a compatible stance.

Some people with IQ scores 200+ in IQ tests do not have the necessary attributes for scientific, technological or mathematical production, including high creativity, the ability to maintain focus for years in solving a very difficult problem, the ability to see important details that go unnoticed by the majority, ability to formulate innovative and more effective strategies for solving specific problems that no one had thought of before, etc. YoungHoon Kim is an example, with scores above 200 on some tests, but I know of no evidence that he has solved any really difficult real-world problems.

In the case of Henry Poincaré, when he worked on the 3-body problem, he thought of a completely different approach from what other great mathematicians had been adopting. There was a huge redundancy between what the entire mathematical community did, as if 1000 mathematicians did almost the same thing. Then Poincaré radically changed the way of analyzing and, in doing so, made important advances. The same when considering Poincaré’s work on the shape of the Earth, treating the problem from an unusual perspective and with surprising results, which dramatically expanded our understanding of the subject and even led to the creation of a new branch of Mathematics. Same for Newton, Cantor and others.

High range IQ tests generally do not include questions that adequately assess this type of ability. They just rely on the bet that the kind of skills that work for 90—160 should also work at levels higher than 170, but practical experience has shown that this is not the case. The design of test questions would need to be very different, to require appropriate attributes to measure correctly at the highest levels.

When Leonardo Da Vinci tried to solve the problem of “flying”, he did it very differently from what everyone had been doing before him, instead of imitating birds with wings, he tried to understand what was the essence of the physical laws that explained the flight of birds. , and understood that he didn’t need wings; could do this with a propeller.

The results achieved by Leonardo show that some important advances do not require decades of work, but rather an insight of a few seconds, although implementation may take months, years, decades or centuries. That’s why IMO problems, when the solution depends on this type of insight, end up being more effective in predicting great mathematicians.

In the case of Leonardo’s aircraft, the idea was right, but there was no adequate technology, there were no engines with enough power, there were no sufficiently light and resistant materials. There are small flaws in his idea, such as the absence of a second propeller to compensate for the transmission of angular momentum, but he would quickly discover this if he had an engine and light materials that would allow him to test the prototype, and in the first experiments he would detect the errors, correct them and would end up flying. He would not deduce Bernoulli’s principle, nor Newtonian dynamics, but he would intuitively understand the relevant phenomena and make the thing work, even without knowing the physical concepts or the underlying mathematical formalism.

Einstein is a very interesting case. In a previous conversation with my friend Iakovos Koukas, he said he thought Einstein wouldn’t get 160+ on a modern high range IQ test. I agree, with the caveat that Einstein’s correct IQ is well above 200, perhaps around 245 on an interval scale of antilog potentials with mean 100 and standard deviation 16 (obviously the distribution is not Gaussian). This corroborates that IQ tests are not measuring correctly above a certain point. The tests measure anything above 170, but that something is not a faithful and accurate representation of intelligence.

I’ve already written a lot about this and I won’t repeat it here, but in short, clinical IQ tests use questions suitable up to 130. Some tests generate scores of 155, 183, 197 and even more than 200, but the meaning of these scores can only be interpreted as an adequate representation of intelligence up to about 130 on clinical tests and up to 160—170 on most high range IQ tests. There are two main reasons for this: the difficulty of the questions is inappropriate for higher levels and there is no construct validity at higher levels.

In the article I analyze errors in the WAIS – including psychometric, logical, semantic and epistemological errors – some of the most serious problems I point out are the inadequacy of the tasks to correctly measure up to 155 or 160. Almost all of the sub-tests are very basic. , some of them could be solved by a well-trained chimpanzee. This is useful for evaluating whether an entity (person, animal, AI or ET) can quickly solve tasks with difficulty accessible to an IQ of 80 or less, but solving these tasks very quickly does not indicate an IQ of 100 or 120 or 148.

The psychometric instruments commonly used are good (accurate, reliable, effective) for measuring intellectual capacity up to a certain level. Clinical tests measure up to about 135, regardless of whether nominal ceilings go up to 225, like SB-IV. Some high range IQ tests correctly measure up to around 160 or 170, regardless of whether the nominal scores reach 250.

Some people in high IQ societies have a clear perception of this fact. Others believe (or want to believe) that an IQ of 196 on a test with sequences of figures or numbers is adequate to name one of the 8 most intelligent people alive.

Apparently there is confusion between the meanings of some words, especially the meanings of IQ and IQ test score. Here is an important clarification about the meanings of “IQ”, “intelligence” and “IQ test score”:

Intelligence is an intrinsic ability of the person, which evolves throughout life, generally increasing rapidly until about 15—18 years of age, then continues to increase more slowly until 25—30 years of age, remains almost stable for a few years, and then begins to slowly decline. In my article in which I describe the meanings of the words used in the STL report, I explain this in more detail and present some curves that represent the variation in intellectual level as a function of age.

IQ (intelligence quotient) is the result of mental age divided by chronological age multiplied by 100. If the meaning is changed, the abbreviation must also be changed, replacing the word “quotient”.

Wechsler proposed a different meaning, but continues to use the term “quotient”. An extensive, complex and in-depth discussion would be appropriate here, but I will summarize the main points:

  1. On the one hand, as the term “IQ” has become widely known, it would be bad to change it. So let’s preserve the term “IQ”, even if it is not the quotient of a division. However, other important facts cannot be lost sight of: Binet and Simon’s initial idea turned out to be reasonably correct. If the curve of evolution of the intellectual level as a function of age is corrected, instead of using linear growth up to 16 years and stability thereafter, Binet’s idea can be rescued with relative success. There are a few more problems that need to be resolved, but adjusting an appropriate curve is already an important advance. Another point that needs attention is that, in a “panoramic” view over the decades, a smooth curve offers good representation, but in a “microscopic” view over short periods, there are seasonal oscillations in this curve, with seasonality throughout the day, the week of the year. So although there is growth from 0 to 29 years old, when a person wakes up in the morning , after 7 hours of sleep, at 11 years old, they may be more intelligent than they will be at 12 or 13 after staying awake for 20 hours straight, or with a headache, or under the influence of alcohol. Therefore there are many small fluctuations throughout the day, the week, the year, which can sometimes be greater than the variation in average IQ from one year to another. These short-term fluctuations pose a problem in measurements in supervised testing.
  2. A 10-year-old child with the mental age of a typical adult would have an IQ of about 160, but how do we interpret the meaning of this child’s IQ when he is a 20-year-old adult? It would not make sense to consider that it would be equivalent to a 32-year-old adult, nor would there be age values in the corrected curve for an adjustment in this case. In this context, the term “IQ” needs a reformulation, as I explain in the “Golden Book of Intelligence”.
  3. Another important point to consider is that a person who reached the intellectual level of an adult when he was 5 years old is someone who at 5 years old solved problems typical of average adults. This does not mean that this child, when he becomes an adult, will be able to solve much more difficult and more complex problems than an average adult. Generally yes, but not necessarily and not to the same extent. Children like Gauss, Pascal, Galois, von Neumann present, from early childhood, different characteristics that are not present in average adults, and the different attributes of these children are not considered in IQ tests. Children like Ainan Cawley, Adragon de Mello, Michael Kearney, showed abilities of average adults very early, but did not have the differentiated abilities of Gauss or Galois. Sidis’s case is at an intermediate level, he had very early abilities of average adults and also had differentiated abilities that are not present in an average adult, although at a level not as notable as that of von Neumann and others.
  4. The standard deviation calculated based on IQ measured in this way is about 24 for children (depends on age) and 16 for adults. The standard deviation presents significant variations from one test to another, or one sample to another, but in general it is like this. This provides a physical value for the standard deviation, rather than the almost arbitrary value suggested by Wechsler. What Wechsler did would be like measuring people’s heights, finding that there is a standard deviation of 7.23 cm, rounding to 7 cm and changing the entire scale to accommodate that. It is not a recommended procedure and has several undesirable implications. It would only make sense if there was no physical meaning to the standard deviation and the values could be freely manipulated, but that is not the case.

IQ test score is the result of an attempt to measure IQ.

Therefore, there is a person’s intrinsic IQ and there is a score that is an attempt to measure intrinsic IQ. People often interpret the score as if it were IQ itself, which is a serious mistake. I’ve even seen people say that “IQ is the variable measured by IQ tests”. It is not. IQ is an inherent attribute of the person, partially genetic, partially influenced by the environment. What the IQ test measures is a set of abilities to perform certain tasks that are assumed to be reasonable representations of intellectual level, therefore useful for estimating intrinsic IQ. These estimates will be better (more accurate, more reliable) if the questions are more suitable for the level of ability that the test intends to measure.

Considering traditional tests, scores on these tests are usually strongly correlated with true (intrinsic) IQ within a certain range, as long as the test meets certain conditions, especially construct validity for the respective IQ range. Often tests meet conditions in a narrower range than that in which the test is intended to measure, resulting in skewed scores at one or both ends.

This leads to discredit in these scores, because they are not correctly predicting the intellectual level. When Terman selected his 1528 children with IQs above 135 in 1926 and followed the evolution of these children for decades, it became clear that they were in fact much more productive than the population average in cultural, financial, professional and academic success. This is because the tests that Terman used correctly discriminate above 130 and below 130. However, they fail above 130. Two Nobel laureates were examined by Terman and both failed because they were below 130 in the tests applied. Furthermore, there is the famous case of Feynman, who had a score of 123, although he was a Putnam winner, Nobel Prize winner in Physics and author of numerous contributions to Science.

Given this scenario, in order for there to be greater credibility in the results produced by IQ tests at different levels, a broad reformulation of metrics, methods and processes is necessary.

Tianxi talks about “pride of genius”, but what exactly would that be? Proud of finding the next number or figure in a sequence? It might be a difficult sequel, and there’s certainly some merit to that, but it would be better to focus on solving some of the big real-world problems. They don’t need to be BIG, but some problems that broaden the horizons of knowledge and generate benefits for humanity. This seems to me a fairer and more sensible reason to be proud, in addition to being a more correct indication of high intelligence. I am not mixing moral and intellectual criteria in the evaluation process. Creating new and “better” (more effective) weapons, as Archimedes and Leonardo did, are also signs of high intelligence, but applied to the harm of some people. This is part of the thesis I defend. Another part of the same thesis is that it would be desirable to use intelligence for Good, but it is not based on the size of the good generated that intelligence is measured.

I find Tianxi’s point of view interesting, perhaps with small different details. The profile of the person he describes in his critique is perhaps more similar to what is found in some chapters of Mensa. In the case of Mensa Brasil this is common, there are really many people who fit what Tianxi described, but I don’t see many people like that in other high IQ societies. So perhaps the criticism should be directed more precisely at a specific group. Anyway, what I consider important about this are basically 3 items:

  1. Correct the bizarre theoretical percentiles, which are obviously wrong in cases far above 130, especially above 160.
  2. Improve standardization methods.
  3. Improve the content of the questions.

I resolved items 1 and 2 in 2003, item 3 I improved a part in 2000, and continued to improve until 2006, then resumed in 2022.

Post 2:

This second post mentions some friends and I prefer not to discuss this point. But generally speaking, I have observed similar problems. In our first In-Sight Journal interview, I already discussed some of these points, so I won’t repeat them here. I would just like to elaborate on some previous comments.

ST and STE solve some of the problems that were open, among which the following could be listed:

  1. Establishment of a proportion scale. This need was identified by Thurstone in the 1940s and has been the Holy Grail of Psychometrics. Until 2003, the scales were approximately interval for scores below 130 and ordinal when including scores above 130, with distortions in the scale. With my 2003 ST standard I introduced the first scale whose antilogs of scores are on a potential proportion scale, preserving uniform intervals across the spectrum and with a conceptually valid meaning.
  2. Improves construct validity, especially at higher levels. Unfortunately, I wasn’t able to completely resolve this, but I promoted relevant advances.
  3. Adjustment of the difficulty of the questions, seeking to cover all the levels that the test proposes to measure. With STE the real difficulty ceiling of the high range IQ tests rose a few points. Although there may still be, near the ceiling, distortions between nominal and real scores, these are smaller distortions than in other tests.
  4. Appropriately weight the points depending on the difficulty of each question. This has several important effects, especially minimizing penalties for carelessness, when a person gets a very difficult question right and gets some very easy ones wrong.
  5. Assigning fractions of points to each item, with fair weighting, to refine the score.
  6. Review of rarity levels and percentiles associated with each score, especially at the highest levels. I had already written an article about this in 2001 and revised it in 2002, but it was theoretical. In 2003 I gathered data to provide an empirical approach, quantitatively showing the size of the distortions and correcting them. I also calculated new norms for the Mega and Titan, using raw data available on Miyaguchi’s website about these tests. The Sigma Test norms were also calculated based on this new methodology, which is explained in more detail in my article https://www.sigmasociety.net/escalasqi 
  7. Determination of the “proportion of potential”, as well as the introduction of this concept, which is necessary as part of the standardization process, and also brings some new useful information for different purposes. This is also analyzed in more detail in the article cited above.

In the most recent version of the STE, there were a few more small improvements, including an attempt to determine the curves of variation in intellectual level as a function of age for different IQ ranges. No data from the STE itself was used for this, but rather data on the evolution of the Chess rating as a function of age combined with results from other tests.

At the end of 2023, I started writing the “Golden Book of Intelligence”, simultaneously with other books (“Apodictic Guide” and “Project T”). In the “Golden Book of Intelligence” I present some contributions to Psychometrics, including a review of the WAIS, a review of Richard Lynn’s study on the average IQ in several countries, an exhaustive review of the meaning of “intelligence”, demystifying some models such as those of Guilford and Gardner, reviewing and improving some concepts such as “fluid” and “crystallized” intelligence, and proposing that the meaning of intelligence varies with IQ, among other topics.

Jacobsen: So, you’re the creator of the Sigma Test Extended. You intend this to be the most difficult and reliable cognitive test. What was the origin and inspiration for creating this test – the facts and feelings?

Melão Jr.: I think that in some previous answers I ended up answering this one too. 🙂

Perhaps it is worth commenting a little more on construct validity here, which is extremely important. Several subtests of the WAIS measure latent traits that are not closely related to intelligence, although they are correlated for indirect reasons. This requires a more detailed explanation, and I will use an example to make it more didactic: the “information” subtest has almost no relation to intelligence, they are shallow questions with simplistic answers, they do not require analysis. Despite this, there is a moderate or even strong correlation between intelligence and cultural level, because generally intelligent people also acquire more culture. But this correlation becomes weaker at higher levels and undermines the measurement.

It would be possible to formulate questions that required more complex knowledge, involving analysis. For example: “Why did Einstein, instead of Poincaré or Lorenz, take credit for the Theory of Relativity?” This is the type of knowledge that would lead to a complex and dense discussion, instead of just automatically repeating memorized information, and in this case it would be better related to intelligence, on the other hand, in this example there would be some problems, because the examiner would need to be exceptionally smart and master the topics related to each question. Another problem is that this would be a very specialized question, and if the person being examined did not have much knowledge on the topic, they would not be able to give an adequate answer, even if they were exceptionally intelligent, and in that respect it would be bad.

However, if the test included questions such as those in the WAIS “Information” subtest, it would be desirable for them to be questions that required in-depth and complex analysis, rather than simple repetition and, at the same time, sought to minimize the need for specific knowledge to perform the test. analysis. Even so, there would be the “problem” of requiring exceptional intelligence from the examiner. Therefore, ideally, questions should avoid specialized knowledge, but require thought as part of the answer, rather than simple mnemonic retrieval.

Despite this problem in the “Information” subtest, the scores in this subtest show a moderately strong correlation with the rest of the test and with other tests. This happens because in the range from 80 to 120, generally more intelligent people are also more educated, but above 120, the cultural level progressively ceases to be a good representation for the intellectual level.

We can make an analogy with height, although the correlation between intelligence and height is weaker, the effect is easier to understand. Intelligent people are also generally taller, but it would not be appropriate to include a subtest based on the person’s height and include height as part of the total score calculation, because although there is a positive correlation between height and the rest of the test, the correlation weakens at higher levels. higher and becomes practically null above a certain level, generating more spurious noise than contributing to improving measurement accuracy.

If one of the subtests were simply measuring height, a person with an IQ of 2.20 m and 135 on the rest of the test would be no smarter than someone with 1.50 m and 138 on the rest of the test. The same problem occurs when using an “Information” subtest, which impairs measurement at higher levels.

Of course there are some fundamental differences and this analogy is not entirely fair, because culture can provide some tools that help with problem solving, while height cannot (or at least not at the same level). But the point is that the effective weight of culture, of how much culture contributes to the total intellectual level, is much smaller than the weight that the “Information” subtest plays in the total score, resulting in distortions for IQs above a certain level, instead of contributing to making the score more accurate. In other words, high scores on the WAIS would be more accurate if the “Information” subtest, which hinders more than helps, were removed.

In a practical example: a person with an IQ of 150 on the WAIS who got all the Information questions right and got 2 of the Arithmetic questions right is not as intelligent as someone who got all the Arithmetic questions right but 2 Information questions wrong, or even if he got all of them wrong of information. There is a similar problem in the “Vocabulary” subtest, as well as different problems in other subtests.

Jacobsen: What skills and considerations, in general, seem important both for constructing test questions and for creating an effective outline for them?

Melão Jr.: There are several different skills and the lack of some of these skills can be compensated by excellence in others. For example: a vast knowledge of varied issues can compensate for less creativity in creating new issues and vice versa. So there would not be a “closed” set of questions.

Regarding standardization, there are good statistical tools, but cognitive models are still bad. Guilford’s opinions add nothing useful and Gardner’s opinions bring more problems than solutions. They call these opinions “theories”, without any empirical verification or attempt at falsification. In Gardner’s case, some recent studies have made it clear that the “multiple intelligences” he proposes are a fantasy. This was predictable and relatively obvious. If Gardner was right, almost every other science would be in trouble using Factor Analysis, which is an important tool in Physics, Astronomy, Economics, Sociology, etc.

The people who promoted relevant advances in Psychometrics were Galton, Cattell (James McKeen Cattell, not Raymond Cattell, whose contributions were minor and unrelated to this specific topic), Pearson, Spearman and Thurstone, in addition to those who contributed to IRT models such as Birnbaum and Lord. I could include Georg Rasch in this list and perhaps a few others. Binet’s works were also important from a different perspective. Wechsler was a disproportionate success, he added half a cent and even made some things worse, in addition to suspicions that I make in my article about WAIS.

The contributions of Pearson, Spearman and Thurstone go beyond the field of Psychometrics and gain space in many other areas. Almost all current major scientific theories use Pearon’s linear correlation, Lemaître and Hubble discovered the recession between galaxies using correlation, Henrietta Leavitt discovered the relationship between period and luminosity of Cepheids using correlation, among many other discoveries. Thurstone’s contributions were even more notable and could be said to have appeared “ahead of time”, only beginning to be more widely used much later, including in AI in recent years and decades.

Analyzing the big names in Psychometrics, the common traits between them, we can intuit some useful characteristics to have a good understanding of the area. In the standardization process, a good understanding of Statistics is important. When preparing questions, it is more difficult to determine what the questions are, as I mentioned in the first paragraph of this answer. But generally creativity and rigorous logical thinking avoid certain problems, as I mentioned in the case of STH at Cooijmans, in our 2022 interview.

Jacobsen: You give some definitions and examples of meanings of words used in the Sigma Test. So any interested reader can get definitions there. Technically, how long has the Sigma Test been in development leading up to the Sigma Test Extended?

Melão Jr.: The first questions that are still present in some Sigma tests were created in 1991, but there was no continuous work throughout that time. In 1991, dedicate a few hours over a few days. In 1999 I dedicated about 1 week to new questions for the ST, with some questions based on known problems and others new ones. The standardization process took longer, and I improved as I received more responses, as with the increase in the number of tests, the use of certain tools and methods that were not possible with smaller samples were being implemented, as well as the creation some new tools and some new methods. In 2007 I closed ST applications.

When the STE was created, I included almost all the questions from the ST and some from the ST-VI, as well as some from the Moon Test. This process took a few weeks. The STL was a joint creation with Tamara, she prepared several questions.

Some differences between the STL and the previous ones are that many questions are in video and photos, showing a real situation from different angles. One can find solutions at different levels and through different methods, just as the methods of Roemer, Bradley, Fizeau, Foucault, Froome and others allow measuring the speed of light with very different strategies, and very different levels of accuracy, the answers can be achieved in different ways. Video questions also make the use of AI difficult, although it is only a matter of time before new AIs emerge.

So the first questions were formulated in 1991, but the total time dedicated to constructing the test was somewhere between 200h and 300h. Time in standardization is difficult to estimate because there have been many updates, but perhaps around 1,000 to 3,000 hours. If you compute the time related to the study and creation of statistical tools, methods, etc., perhaps 10,000 to 30,000 hours, but it would not be correct to interpret this time as applied to this, because many of the statistical tools developed were for other purposes, especially Econometrics, management risk and genotype ranking.

Jacobsen: You separate the levels of the Sigma Test Extended into Level I (100) Average, Level II (110) Above Average, Level III (120) Superior Intelligence, Level IV (132) Gifted, Talented, High Skills, Level V ( 144), Level VI (156), Level VII (168), Level VIII (184), Level IX (202), Level X – EXTRA (221). If we correlate these 10 levels to real-world achievements or merit recognition, what jobs, achievements, educational achievements, etc., should we generally expect at each level of the Sigma Test Extended?

Melão Jr.: For scores below 130, it might be useful to reproduce some studies on typical IQs in different professions. Searching on Google, you can find many other lists, tables and graphs like this:

It is important to highlight that in each profession there are quite wide ranges that intersect. We must also remember that Langan, Rosner, Grady Towers have already worked in activities that are very incompatible with their intellectual level, just like me and my father. Therefore, factors such as network and cultural aspects in certain countries may be more relevant than IQ in positioning a person professionally or even academically.

It is also important to remember that specific skills weakly correlated with IQ can play a central role in success and diverse achievements. Nakamura, for example, may not have an IQ above 120 or 130, but he has a very developed talent for chess and has achieved a rating that normally people with an IQ of 180 or 200 may not reach even if they train a lot for it. The same goes for different professions, which may require some specific skills, such as surgeon, where fine motor coordination could not be replaced by any IQ score.

Having made these reservations, we can try to make some estimates of typical achievements for each IQ range.

In this study I review typical IQs at different universities in the USA: https://www.sigmasociety.net/artigo/qi-universidade-escolas 

With an IQ above 160, depending on the area of activity and the nature of the research carried out, the possibility of winning a Nobel Prize becomes plausible. Although there are cases of Nobel Prize winners with an IQ below 140 and even below 130, what is observed is that the vast majority with an IQ above 160 do not win the Nobel Prize, therefore having an IQ of 160 cannot be interpreted as a predictor of a high probability of a Nobel Prize. , but it can be interpreted as “meeting a minimum requirement” for this. It’s not easy to answer this, because exams like the SAT and GRE are not appropriate for testing above 130, and most Nobel laureates have never taken an IQ test with appropriate difficulty and construct validity at their level. Studies that indicate around 155 for the average IQ of Nobel laureates in Science simply reflect the inadequacy of IQ tests to measure at the highest levels. It would be naive to think that Nobel laureates are at the 1 in 3,000 level of intellectual rarity. The most reasonable interpretation is that they were examined with inadequate tests.

A more realistic estimate would be about 170—180 for the average Nobel Science winner, and perhaps 160 is an “inclusive” cut-off point.

In general, most presidents of different countries have an IQ between 120 and 155, rarely above 160 or below 120. Information has already circulated on the Internet that George Bush Sr. would have an IQ of 91 or 102, but he obtained a score of 1206 on the pre-1974 SAT, that would correspond to about 132, which is more plausible for a president with the minimum attributes for his role. Netanyahu is cited as having a 180, I never got around to researching in depth the accuracy of this information and adequacy of this score (the information may be legitimate, but the score may be based on inadequate testing). I think it’s reasonable that Netanyahu could actually have something between 160 and 180, but it’s a rare case.

Therefore 130 may already be enough to be president in most countries, which represents a serious problem. The problems a president must deal with are extremely difficult and complex, to the point that not even 190 or 200 would be enough to adequately resolve most issues. The big mistake is that heads of state are appointed based on elections. There should be a better set of criteria, based on the country’s effective ability to deal with problems. When David Ben-Gurion invited Einstein to be president of Israel, it seemed to me an extremely intelligent and appropriate invitation, although the methodology (invitation) is very dangerous, it can work if the person (or committee) making the invitation is suitable and competent.

To work at Big Techs, 150 to 160 is usually enough. Champions in IMO and similar generally have around 170 to 190, occasionally they can have much more, but they rarely have much less than 170. Around 170 in conjunction with a lot of training and specific talent for Mathematics or Physics can represent good chances of medals in IMO and other intellectual olympiads. The correlation of IQ with Chess is weaker than with Mathematics, and this correlation decreases at higher levels , so it would not be possible to make many predictions about Chess achievements based on IQ.

People like Musk, Gates, Zuckerberg, Bezos generally have an IQ between 150 and 160, but very few people with an IQ between 150 and 160 reach the level of financial success they did because it depends much more on other factors, including luck, network, discipline , dedication etc. In Leonard Mlodinow’s book “The Drunkard’s Walk”, the author analyzes several cases in which in large population samples the factor of luck can play a large role in determining success at a very high level, and he attributes to Gates and others a great luck. In my opinion, in these cases luck also accounts for most of the result obtained, but talent was also fundamental. If Gates had just been lucky, he obviously would not have developed the products or managed the various situations successfully. Factors related to personality also end up being very important. IQ is just one of the variables in determining economic success, and the weight of IQ depends on several other factors. In some cases IQ can be decisive, in others it can be almost irrelevant.

The cases of Musk and Jobs are a little different. Musk may have an IQ of less than 160, but he appears to be very creative, at a level equivalent to about 180. Jobs scored 1440 on the GRE, which corresponds to about 148, but most likely the GRE did not correctly reflect his IQ, nor creativity, which would be much higher, perhaps at a level of creativity a little below that of Musk.

For awards such as the Fields Medal, Abel Prize, Einstein Prize, the “necessary” IQ is similar to the “necessary” for the Nobel Prize, but accompanied by a set of specific aptitudes for Mathematics. This does not mean that the average IQ of the winners of these awards is similar to the average of the Nobel Prize winners in Science. As the rarity is greater and the questions are similar, I estimate that the average IQ is slightly higher among winners of these awards in Mathematics.

Jacobsen: What does this dimensioning propose as development or improvement of tests like WAIS?

Melão Jr.: The article I wrote about WAIS points out some problems, but from a strictly technical point of view, I don’t believe it is appropriate to “fix” WAIS. Due to the number of corrections, it would be more interesting to start something from scratch. However, from a commercial point of view, as WAIS already has good acceptance, for this reason a broad review could be justified (commercially).

Jacobsen: When trying to develop questions that tap into a deeper reservoir of skills, what is important about verbal, numerical, spatial, and other types of questions?

Melão Jr.: In some cases, it may be interesting to exclusively create sequences of numbers or figures or both. In other cases, tests with analogies and/or associations. In other cases, a diversified test and a heavy dose of randomness in that diversification may be preferable. In the introductory text of the STE I discuss some negative aspects of a test consisting exclusively of sequences of figures, or exclusively of associations, or exclusively of analogies, which results in a very high “internal consistency”, and the meaning of this may be a narrowing of skills measurements, redundancy, and other undesirable effects.

The term “internal consistency” should not be the term used. Cronbach’s Alpha measures homogeneity, which should not be interpreted as “internal consistency”. A very high Cronbach’s Alpha indicates that the test measures a very narrow and redundant range of latent traits, and this may not be very useful if the primary goal is to measure the g-factor, which would be a broadly applicable trait.

On the other hand, it has been verified that tests consisting exclusively of sequences of figures, such as Raven, Cattell or some subtests of the WAIS and DAT, present a sufficiently strong correlation with the score in more comprehensive tests, in order to allow the scores in these tests ( of figures) are accurate estimates of g at least in the 75 to 125 IQ range and perhaps a slightly wider spectrum.

At levels above 140 and, especially, above 150, the use of these questions becomes increasingly inappropriate. The complexity and difficulty that can be achieved in a test based on a sequence of figures is limited, and they are also solved by exhaustive attempts rather than by brilliant and profound ideas. So what is being measured is something more akin to persistence, patience, determination than intelligence. Some Power Test questions can be resolved in very laborious, time-consuming and non-creative or ingenious ways. The STE also presents this problem in some questions, unfortunately I was not able to completely eliminate this, but in the STE this ends up being a contamination of the question, not the essence of the question, that is, the main difficulty of the question lies in having some creative idea, but part of the solution also requires a laborious and time-consuming process, so I consider it “tolerable”, but if the problem can be solved exclusively through a laborious and time-consuming process, without the creative idea, the purpose is lost. In some cases, it is very difficult to avoid the solution being laborious and time-consuming, but one should try, whenever possible, to require creativity and deep thinking in the most difficult issues.

Jacobsen: What are the hurdles that candidates tend to make in terms of thought processes and assumptions about time commitments on these tests? So they get artificially low scores on high-stakes tests.

Melão Jr.: This is an interesting and difficult problem to solve. Perhaps there is no complete solution, because to serve people who do not have a lot of time, it would be necessary to press on time and harm those interested in engaging in very difficult and time-consuming issues. Andrew Wiles criticized the IMO precisely because the time available is too short (3 hours) to propose challenges with an appropriate level of difficulty and complexity, compromising the purpose of trying to identify future great mathematicians. On the other hand, there would be many operational difficulties if the IMO race took much longer, there would even be the problem of lack of supervision, or the need to host people from several countries for a long time at the competition headquarters, and monitoring them continuously could generate problems related to privacy, since people would need to be supervised after they knew the statement, so if the person took 10 days to resolve an issue, they would need to be monitored so as not to receive help or use prohibited means. Alternatively, supervision could be dispensed with if the issues were unresolved real-world problems.

It would be an interesting idea to hold math and science Olympiads lasting a few months, using much more difficult problems, including unsolved real-world problems, gathering sponsors, etc. But apparently the organizers of these events are satisfied with the way things are.

My focus has been on the correct measures at the highest levels, so I have not been as concerned about the problem you described in this question, but it does represent a source of distortion in scores. On the other hand, I believe that most traditional tests used in clinics already meet this requirement reasonably well, measuring with good precision and accuracy in the range of 70 to 130. I believe that the IQ range in which errors are still large, and they need greater attention, whether at the highest levels, and in these cases time does not seem to be such a demotivating factor, because they are generally much more competitive people and for them it is important to achieve as much as they can, reducing the risk of associated distortions to the time required for resolution.

I also read the text you sent me with the interview with AntJuan Finch and it seems to me that he is already doing excellent work in this regard, as well as Chris Cole, increasing reliability in unsupervised online tests, and encouraging more people to take the tests in a short time and at no cost. With this, I believe that an alternative to clinical tests has emerged with a comparable (or higher) level of accuracy and reliability, accessible to a greater number of people.

Jacobsen: Without spoiling the mental sport of HRTs, what was the process from conception to development and publication of the Levels I to IV STE questions? What was the process from conception to development and publication of the STE questions for levels V through VIII? What was the process from conception to development to publication of the STE questions for levels IX and X?

Melão Jr.: I will try to give an answer by grouping this question and the following two, choosing some items that I consider most interesting to be analyzed individually and making some general comments about all the items.

Some questions are trivial and there would be no way to get away from that much, due to the relatively low difficulty, but even among the questions for levels I to IV I tried to require the person to understand some facts, rather than just applying a formula. I couldn’t go too far into the explanation without providing some important “clue,” but I can say that some Ph.D.s. in Physics, Engineering and Mathematics missed fundamental details in some questions that seem trivial.

The information that the questions are roughly ordered by difficulty is useful to know that some questions that seem easy are actually not, and there are “hidden” details to be discovered. It’s not a “prank”, that’s not the objective. These hidden details are “natural” and important ones that people should consider but often don’t realize. In some ways they are similar to the Monty Hall problem, which seems simple and obvious at first glance, but when you start to dig deeper you realize that there are subtleties and complexities.

Question 22 is an interesting example that the vast majority got wrong, including astronomers and mathematicians. I even thought about changing the position of this question to a higher level, because if you consider the number of correct answers out of the total number of respondents, it has a lower correct answer rate than questions that are at higher levels. However, I decided to keep it where it is because it is not actually “more difficult”, the problem is that people underestimate the difficulty. There are people from Giga Society who made mistakes, but I believe that if they had “respected” the difficulty more and believed that it was at a level compatible with its difficulty, they would have analyzed it more carefully and would have gotten it right. This comment is in a way a useful “clue”, but I don’t see a problem in providing this clue because the position of this question at level V is also a clue, however people don’t believe it has level V and this leads to error, so I see no harm in reinforcing that “she is really level V and maybe a little higher”.

Question 35 raised a long debate with Peter David Bentley, D. Phil. (=Ph.D.) and Post Doctoral in Physics from the University of Oxford. Petri Widsten and Albert Frank entered the debate. When a person has a score above 180, they are notified of a question they got wrong and they can debate whether they consider their answer should be accepted, and that happened in this case. It was an analysis that lasted several days. (this question was part of the ST, Peter did not take the STE)

Question 50 has a detail that perhaps I should make more explicit, because some people have consulted the distance from the Moon to the Earth in ephemeris software, and this really does not violate the general statement of the test that allows using any available resource. So perhaps I should make it clearer that for this specific question the person needs to use the data available in the photo and text of the statement, which is why higher resolution photos are available for download. When the person resolves it using ephemeris software, I ask them to send it again using the photos.

Question 45 has also received responses in which the person underestimates the difficulty and I ask them to send it again.

People generally realize that there are hidden subtleties that make the problem more difficult than it seems at first glance, but in some items most people don’t notice.

In question 48, I wanted to get an idea about whether people in high IQ societies were aware that the percentiles in groups above 130 are wrong and the error grows at higher levels, as well as I would like to know if they have a approximate idea of the magnitude of the error. Apparently the vast majority are aware that at the highest levels there are big mistakes.

The questions that I find most interesting are 51, 49, 23. Among the easy questions, 19 is one of the ones that I find most interesting. When I say “interesting” it is because they are more different from other standard problems and require resolution methods that are also different from traditional paths. 19 is not quite like that, as it is simple, but it has some interesting peculiarities for the difficulty level it is at.

Jacobsen: Pragmatically speaking, for really good statistics, what is the ideal number of test takers? You can’t say “8,000,000,000”.

Melão Jr.: The method I describe in the 2003 Sigma Test standard has a list of important advantages compared to other methods. One of these advantages is enabling more accurate standards based on fewer samples. This happens for a simple reason: in the theoretical normal distribution, rarity decreases rapidly. As measured IQ becomes higher, the addition of a few IQ points implies a large increase in the level of rarity, and test questions are not naturally adjusted to keep pace.

For scores below 140 and especially below 130, IQ scores generally grow almost linearly with the raw scores, and this tracks reasonably well with the theoretical rarity corresponding to each score. But for much higher scores, the gain of 2 or 4 points in the score should not add up to even 1 point in the IQ, because that 1 point in the IQ would imply a very large increase in rarity. In practice, however, IQ scores continue to grow almost linearly with raw scores even for IQs above 140, 150, 180…

The real problem is not in this almost linear growth, but in believing that the real distribution of scores continues to adhere to a normal distribution for scores well above 130, because this obviously does not happen. The number of people with IQs above 200, sd=16 is much higher than would be predicted based on the hypothesis that IQs are normally distributed across the spectrum. When scores are standardized using the method used by Wechsler, the scores are forced to fit a normal distribution, but this only happens within the range determined by the size of the sample used in the standardization (generally 2000 to 3000 people).

Between 70 and 130 the “natural” distribution of scores is very similar to a normal one, and with a “push” it is possible to force scores from 130 to 150 to also be normalized, but in a sample with 3000 people from a non-selected population It is not possible to push scores above 155 close to normal and the distribution collapses. But even if it were possible to use a sample with 8 billion people and push all the scores to the predicted theoretical rarity positions, this would not help at all, it would only expand the distortion by widening the range in which the scores lose intervalarity.

Wechsler’s idea of standardizing scores was interesting and would be good for solving some problems, but it creates other problems. In Measurement Theory, whenever possible, it is important that the variable of interest is on a proportion scale. If not, it is recommended to adopt appropriate transformation methods to place the variable on a proportion scale. Height, for example, is naturally on a scale of proportion. IQ measured by the relationship between mental and chronological age is naturally on a scale similar to a ratio scale. But when Wechsler put his finger on it, he distorted most of the scores to “fix” the problem of IQ variation with age and the wider standard deviation for children.

One of the appropriate solutions for this is the one I propose in the 2003 ST standard, with an updated version in 2022 in this article https://www.sigmasociety.net/escalasqi , with a complete reformulation of the standardization method, generating scores on a scale of proportion (antilog of a proportion scale), correcting rarity levels to realistic values and allowing more accurate normalizations with smaller samples, in addition to other advantages.

We can make an analogy with height or chess. First with height: if you try to estimate a person’s height based on rarity level, you will need gigantic samples to measure above 2.10 m and you will still have serious distortions in the results. But if you use a tape measure, a measuring tape, a Leica laser gauge or any other tool for measuring length, you standardize the scale intervals and eliminate the need for large samples.

Chess example: to measure Carlsen’s strength at his peak (2882) with reasonable accuracy and precision based on his results against opponents rated 1000, hundreds of thousands of games between them would be necessary, because the theoretical probability is in favor of Carlsen in a approximate ratio of 50,000:1, so with 100,000 games there would be an expectation of only 2 points for the player with rating 1000. If the player with 1000 scored 1 or 3 points, the error would be large in relation to the 2 points expected, with great uncertainty in measure. It would need a sufficient sample for the player with 1000 to get at least a few dozen points, and for that the sample would require a few million games of him against Carlsen, making it unfeasible.

However, it could introduce players with 1500, 2000 and 2500. The one with 2500 would play 1000 games against the one with 2000 and another 1000 games against Carlsen. The 2000 would play 1000 against the 2500 and 1000 against the 1500. The 1500 would play 1000 against the 2000 and 1000 against the 1000. This way, with a few thousand games it would be possible to achieve a more accurate and precise estimate for Carlsen’s rating, because the expected probabilities in the 500 point intervals are about 94.68% points for the strongest, so there would be a few dozen points for the weakest in each match.

Generalizing the same idea, instead of players with 1000, 1500, 2000, 2500, it could include several players with different ratings playing against each other, using something like the Swiss Pairing System, so that players of similar strengths prioritize clashes with each other, and this would optimize the accuracy and precision of the measurement, without needing a huge number of matches. With players with ratings varying from 100 to 100 points covering the range of 1000 to 2800, and a network with a few hundred matches between them, it would be possible to make a more accurate estimate than if millions of matches were played placing the player at 1000 playing directly against Carlsen.

This is only possible because the method for calculating chess ratings uses the Rasch system, adopted by Arpad Elo. If you tried to evaluate the strength of players based on rarity or percentile, it wouldn’t work and you would need a very different path and with much larger samples.

For this to work with IQ tests, the standardization method needs to be as I described in the 2003 standard, which also uses a Rasch-like model. In this way, the calculation is essentially the equivalent of treating each test item as an opponent in Chess. Solving each item means “winning”. The difficulty of the items is equivalent to the strength of the opponents. And for everything to make sense, the approach I give to the problem with the concept of “potential IQ” is necessary.

With this, it can be measured at very high levels with relatively small samples. There is also a more detailed description in the book “Chess, 2022 best players of all time, two new rating systems”, in which I discuss several additional details, including the problem of the draw, which in the Chess Elo system is inadequately valued. ed as “0.5”, without the necessary adjustments to preserve the consistency of the method.

The problem with the draw value is because the Rasch model used by Elo was created for dichotomous variables, but Chess is trichotomous. Arpad Elo tried some fixes, but couldn’t find a good solution and surrendered to simply awarding 0.5 for the draw. There is a 2015 study by Miguel Ballicora that attempts to assign a “fair” value to the draw, and represents an advance compared to the Elo system, but it still incurs several other errors. In my book, I analyze this subject in detail. 

Jacobsen: What tests and test builders have you found to be good?

Melão Jr.: I will try to give a generic answer, which complements part of the comments I have already made in the introductory texts for Sigma Test Extended and Sigma Test Light (I also recommend reading these, as a complement). I see 3 main problems (I could divide the problems into 4, 5 or 6 groups, or another number, but in this case I believe that 3 allows an adequate description).

  1. Inadequate construct validity, especially at higher levels.
  2. “Naive” and inflated norms for scores above 135, with progressively greater distortion in higher scores.
  3. Inadequate difficulty.

I could also mention other problems, such as leakage of solutions, retests with fake names, etc. But I will focus on the 3 above.

Good tests that do not fit into one or more of these problems are rare. Furthermore, there are tests with even more serious problems, such as standards based on 1 or 2 people, and even based on 0 people. In some cases, it is very difficult to start standardization with 0 people, but it would be more prudent to estimate a conservative initial norm and eventually correct upwards (after collecting empirical data), however what is most often observed is the opposite.

Therefore, good tests are those that do not incur these problems, that present a sufficiently large number of items with different levels of difficulty in order to measure correctly in each IQ range, preserving construct validity at each level.

Another point to consider is that a test may be suitable for a certain IQ range, but not for a different range. WAIS is a good example. Although it has several flaws, it generates scores that are very close to correct in the range of 85 to 115, and reasonably correct in the range of 75 to 125. It still generates acceptable scores between 70 and 135. Above that, the errors are already worrying. The Power Test can measure well between 110 and 150, and still generates reasonable results up to 160.

Jacobsen: What did you learn from doing this test and its variants?

Melão Jr.: Psychometrics uses some tools that are widely used in other areas, but it also has its own tools, which are rarely used in other areas. I ended up learning some new statistical tools, in addition to developing others.

Jacobsen: Thank you for the opportunity and your time, Melão.

Melão Jr.: I thank you for the reminder and the stimulating questions!

Bibliography

None

Footnotes

None

Citations

American Medical Association (AMA 11th Edition): Jacobsen S. On High-Range Test Construction 2: Hindemburg Melão Jr. on the Sigma Test Extended. August 2024; 12(3). http://www.in-sightpublishing.com/high-range-2

American Psychological Association (APA 7th Edition): Jacobsen, S. (2024, August 1). On High-Range Test Construction 2: Hindemburg Melão Jr. on the Sigma Test Extended. In-Sight Publishing. 12(3).

Brazilian National Standards (ABNT): JACOBSEN, S. On High-Range Test Construction 2: Hindemburg Melão Jr. on the Sigma Test Extended. In-Sight: Independent Interview-Based Journal, Fort Langley, v. 12, n. 3, 2024.

Chicago/Turabian, Author-Date (17th Edition): Jacobsen, Scott. 2024. “On High-Range Test Construction 2: Hindemburg Melão Jr. on the Sigma Test Extended.In-Sight: Independent Interview-Based Journal 12, no. 3 (Summer). http://www.in-sightpublishing.com/high-range-2.

Chicago/Turabian, Notes & Bibliography (17th Edition): Jacobsen, S “On High-Range Test Construction 2: Hindemburg Melão Jr. on the Sigma Test Extended.In-Sight: Independent Interview-Based Journal 12, no. 3 (August 2024).http://www.in-sightpublishing.com/high-range-2.

Harvard: Jacobsen, S. (2024) ‘On High-Range Test Construction 2: Hindemburg Melão Jr. on the Sigma Test Extended’, In-Sight: Independent Interview-Based Journal, 12(3). <http://www.in-sightpublishing.com/high-range-2>.

Harvard (Australian): Jacobsen, S 2024, ‘On High-Range Test Construction 2: Hindemburg Melão Jr. on the Sigma Test Extended’, In-Sight: Independent Interview-Based Journal, vol. 12, no. 3, <http://www.in-sightpublishing.com/high-range-2>.

Modern Language Association (MLA, 9th Edition): Jacobsen, Scott. “On High-Range Test Construction 2: Hindemburg Melão Jr. on the Sigma Test Extended.” In-Sight: Independent Interview-Based Journal, vo.12, no. 3, 2024, http://www.in-sightpublishing.com/high-range-2.

Vancouver/ICMJE: Scott J. On High-Range Test Construction 2: Hindemburg Melão Jr. on the Sigma Test Extended [Internet]. 2024 Aug; 12(3). Available from: http://www.in-sightpublishing.com/high-range-2.

License & Copyright

In-Sight Publishing by Scott Douglas Jacobsen is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License. ©Scott Douglas Jacobsen and In-Sight Publishing 2012-Present. Unauthorized use or duplication of material without express permission from Scott Douglas Jacobsen strictly prohibited, excerpts and links must use full credit to Scott Douglas Jacobsen and In-Sight Publishing with direction to the original content.

Leave a Comment

Leave a comment