政府新闻

City News

Chat Gpt-4o,还有其他6种AI聊天模型均未通过中国高考数学试题   2024-06-23

 



Seven large language models, including OpenAI’s ChatGPT-4o, were made to ‘sit’ China’s notoriously difficult college entrance exam recently. They did relatively well in the English and Chinese language tests, but each one failed the math paper.

Chat GPT-4o as well as open source models developed by China’s Alibaba Group Holding, 01.AI, Zhipu AI, Shanghai Artificial Intelligence Laboratory, and France’s Mistral AI, were put to the test by OpenCompass, the Shanghai AI Lab’s evaluation system. 

China’s tough college entrance exams are a good way of gauging LLM’s intelligence, the Shanghai AI Lab said. The tests were all marked manually and the examiners were not told that they were taken by machines. The exams contained both objective and subjective questions, it added.

Alibaba’s Qwen 2-72B was the smartest, scoring 303 points out of a total of 420 in the three subjects, according to the results published by OpenCompass yesterday. It was followed by US firm OpenAI’s Chat GPT-4o with 296 and the Shanghai AI Lab's InternLM 2.0 with 295.5. Mistral AI’s LLM came last with 185.

Each one failed the math test, however. InternLM 2.0 achieved the highest score of just 75 points out of 150. GPT-4o was second with 73. 

The examiners found that the generative AI models’ answers to subjective math questions were illogical and confused. Sometimes the reasoning was wrong, but the answer was correct. The LLMs are able to memorize formulas well, but they have trouble in explaining how they solved the problems.

This shows that LLMs have much room to improve their math skills, Lin Dahua, a scientist at the Shanghai AI Lab, told Yicai. Math involves complex reasoning, which is a key ability if LLMs are to be used in finance and other vital areas.

The AI models performed well in terms of modern Chinese language, but there was a big gap in their knowledge of classical Chinese. Qwen scored highest with 124 out of 150 points, while GPT-4o excelled in English with 109 out of 120 points.

In English, most humans who take the test lose points for not writing enough, but the AI models tended to have points deducted for exceeding the word limit.

Source: Yicai Global

 


注册记者登录

 

 

记者点此免费注册 | 忘记密码

采访申请流程

06月08日 21315203 受理中
02月16日 21315167 已办结
01月26日 21315166 已办结

咨询申请流程

06月12日 02131545 已办结
05月12日 02131544 已办结
05月06日 02131541 已办结

查看全部 »

共性问题提示

Q: 问:如果想要迅速了解上海这座...
A: 答:请注册登陆本网站“今日上...
Q: 问:如果您想在上海进行采访,...
A: 答:(1) 请注册登陆本网站...
Q: 在哪里可以买到上海的地图?
A: 上海各大书店中均有出售,一些...