한국전자통신연구원(ETRI)이 사람이 말로 작업을 명령하면 스스로 작업 절차를 이해하고 계획을 수립해 수행하는 절차 생성 인공지능(AI)의 성능을 자동 평가할 수 있는 로타벤치마크(LoTa-Bench) 기술을 개발했다.
▲ETRI researchers testing LoTa-Bench technology, which automatically evaluates the performance of procedural generation artificial intelligence (AI).
Language model process generation performance evaluation cost and time reduction and objective evaluation
A domestic research team has developed the world's first technology to automatically evaluate the performance of procedures created based on large language models (LLMs), which is expected to enable fast and objective performance evaluation of procedure creation.
The Electronics and Telecommunications Research Institute (ETRI) announced on the 7th that it has developed LoTa-Bench technology that can automatically evaluate the performance of procedure-generating artificial intelligence (AI) that understands, plans, and executes work procedures on its own when a person gives verbal commands for the work.
ETRI announced that according to the ALFRED-based benchmark results, OpenAI's GPT-3 had a success rate of 21.36%, GPT-4 had a success rate of 40.38%, Meta's LLaMA 2-70B model had a success rate of 18.27%, and MosaicML's MPT-30B model had a success rate of 18.75%.
The larger the scale, the better the procedural generation capability. A 20% success rate means that 20 out of 100 procedures were successful.
The research team published a paper at the International Conference on Representation Learning (ICLR), one of the world's top artificial intelligence academic conferences, and released the results of the performance evaluation of procedural generation of a total of 33 large language models using this technology through GitHub.
The research team said that with the development of this technology, the time and cost of evaluating the performance of robot task planning technology using large language models can be significantly reduced.
In addition, the research team expects that by releasing the software as open source, companies, schools, etc. will be able to freely use this technology, thereby accelerating the development of related technologies.
The RotaBenchmark technology developed by ETRI executes the work procedures generated by a large language model according to the user's commands and automatically compares whether the results are the same as the instructed goals to determine success or failure.
So evaluation time and cost can be minimized, and the results are objective.
The performance evaluation was conducted in a virtual simulation environment developed by the Allen Institute for Artificial Intelligence (AI2-THOR) in the United States and MIT (VirtualHome) in the United States for the purpose of research and development of robot and embodied agent intelligence.
We evaluated the dataset containing routine household task instructions such as “Put the cooled apple in the microwave” and each task procedure.
Additionally, the researchers took advantage of the RotaBenchmark technology, which allows for easy and rapid validation of new procedure generation methods, and discovered two strategies to improve procedure generation performance through training with data.
In-Context Example Selection and Feedback and Replanning.
In addition, the effect of improving the performance of procedure generation through fine tuning was also confirmed.It was human.
ETRI Social Robotics Lab Principal Researcher Jang Min-soo said, “Rota Benchmark is the first step in developing AI for procedural generation. In the future, we plan to research and develop technologies that can predict task failures in uncertain situations or continuously improve task generation intelligence by asking people questions and receiving help. This technology is absolutely necessary for the realization of the era of one robot per household.”
ETRI Social Robotics Lab Director Kim Jae-hong said, “ETRI is dedicated to research and development of advanced robot intelligence using the Foundation Model to realize robots that can create and execute various mission plans in the real world.”
This technology was developed as part of the 'Human-Centered Artificial Intelligence Core Source Technology Development Project' of the Ministry of Science and ICT and the Institute of Information and Communications Technology Planning and Evaluation (IITP) through the 'Development of Agent Technology that Recognizes Uncertainty and Grows While Asking Questions' project.