Introduction

In an era where artificial intelligence and robotics are reshaping industries, the promise of home robots stands out as a beacon of innovation. Imagine a robot that can assist with household chores, recognize family members, engage in meaningful conversations, and adapt to dynamic environments. This vision inspired my graduation thesis, where I designed and implemented a mobile robot system tailored to interact seamlessly within a home environment.

A few years later, I learned that my project and work were used as teaching materials and the foundation for the teams from the School of Automation, Southeast University, to participate in the RoboCup@Home China. I’m very proud that my work can contribute to the competition. This also means my desire to dive deep into Human-Computer Interaction to combine cutting-edge technologies and business to create inspired products and services that solve customers’ problems.

This is a list of the current desired technical abilities that the tests in RoboCup@Home will focus on.

Navigation in dynamic environments

Fast and easy calibration and setup: The ultimate goal is to have a robot up and running out of the box

Object recognition

Object manipulation

Detection and Recognition of Humans

Natural human-robot interaction

Speech recognition

Gesture recognition

Robot applications RoboCup@Home is aiming for applications of robots in daily life.

This project aims to develop highly relevant service and assistive robot technology for future personal domestic applications. The focus lies on the following domains but is not limited to: Human-Robot Interaction and Cooperation, Navigation and Mapping in home environments, Computer Vision and Object Recognition under natural light conditions, Object Manipulation, Adaptive Behaviors, Standardization, and System Integration.

Core Functions

Scenario: Welcoming Friends at Home with the Robot Assistant

It’s a quiet afternoon at home, and you’ve invited a couple of friends over to relax and catch up. Just as you hear the doorbell ring, your robot assistant springs to life, ready to help create a seamless and welcoming experience.

Step 1: Greeting the Guests

As your first friend steps into the house, the robot assistant moves forward to greet them. Its camera focuses on their face, and it says warmly:
“Hello, welcome! May I know your name?”
Your friend replies, “I’m Alex.” The robot repeats their name to confirm and then adds:
“It’s great to meet you, Alex. May I ask, what’s your favorite drink?”
Alex responds with a smile, “I love Cola.” The robot confirms their choice and saves their details, associating the name and drink preference with their face.

When your second friend arrives, the robot repeats the same process, efficiently recognizing and recording their name and drink preference as well.

If one of your friends is a frequent visitor, the robot has already recognized them instantly then will greet them with personalized messages like:
“Welcome back, Alex! Would you like the usual Cola today?”

Step 2: Guiding the Guests to the Living Room

Once both friends are inside, the robot gestures and says:
“Please follow me, and I’ll guide you to your seats.”
It smoothly navigates through the house, leading your friends to the living room. It identifies the unoccupied seats and gestures politely:
“Here are two comfortable seats for you. Please, have a seat.”

Step 3: Personalized Drink Delivery

After your friends are seated, the robot turns to you for a moment and confirms:
“Would you like me to prepare the drinks for Alex and Jamie?”
You nod, and the robot replies, “Understood. Please relax while I handle it.”

The robot moves to the kitchen, identifies the correct drinks, and carefully retrieves them. It returns to the living room and delivers the drinks to each guest by name:
“Alex, here’s your Cola. Jamie, here’s your favorite soda. Enjoy!”

Even if the robot can’t physically pick up the drinks, it uses a clear gesture or light indicator to direct your guests to their beverages, ensuring a thoughtful touch.

Step 4: Creating a Relaxed Atmosphere

As the conversation flows, the robot subtly monitors the interaction, staying on standby in case anyone needs assistance. Its graphical user interface, displayed on a nearby screen, shows a friendly message such as:
“Let me know if you need anything else!”

If your friends move around, the robot adjusts and remains responsive, making sure they feel at home.

Product Function and Modules

The product’s core functionalities include the following five aspects:

Visitor Identification Based on Visual Recognition: Assuming there are two visitors, each arrives and pauses in front of the robot upon entry. The robot’s camera detects and captures their faces, automatically recognizes them, and saves the visitors’ facial images.
Voice-Based Greeting Function: After detecting the visitor’s face, the robot provides a voice greeting and asks for the visitor’s name. It confirms the name by repetition and records it. The robot then asks for the visitor’s favorite beverage, confirms it, and records the information. Both the name and the preferred beverage are linked to the facial image, with the image file named after the visitor’s name.
Seating Guidance Function: The robot directs the guest to an unoccupied seat. It issues a voice prompt, “Please follow me,” and leads the guest to the available spot, arranging for them to sit.
Beverage Delivery Function: After the guest is seated, the robot moves to the beverage station, picks up the guest’s preferred drink (represented by an empty bottle if necessary), and delivers it to the correct guest. (If the grasping function is not feasible, the robot can instead point to the beverage.)
Graphical User Interface Control: A graphical user interface is implemented to control the entire process.

Additional Potential Enhancements:

Adding object recognition to enable functionalities like object grasping and delivery.

To meet the basic requirements, the product requires at least five functional modules: a camera detection module, a microphone acquisition module, a path planning module, a trajectory optimization module, and an arms control module. The correspondence between functions and modules is shown in Table 1.

Table 1: Functions and Modules

Function	Camera Detection Module	Microphone Acquisition Module	Path Planning Module	Trajectory Optimization Module	Arm Control Module
Visitor identification based on face recognition	✓
Voice-based greeting function		✓
Seating guidance function	✓	✓	✓	✓
Beverage delivery and identification	✓			✓	✓
Graphical interface control	✓	✓	✓	✓	✓

Design and Methodology

System Overview

The system is built on a Turtlebot platform, enhanced with a HOKUYO URG-04LX-UG01 laser radar and an RGB-D camera for advanced functionality. Facial detection and recognition use MTCNN and FaceNet models, trained and fine-tuned on NVIDIA 2080Ti GPUs or Google Colab.

Development Environment

OS: Ubuntu 16.04
Frameworks: ROS for Turtlebot development, CUDA 9.0, and TensorFlow 1.7 for model training.

Final Deployment

The fully developed system was successfully implemented on the MORO dual-arm robot, seamlessly integrating navigation, facial recognition, and robotic control. Since the MORO platform was still at the very beginning stage then, the grasp tasks were replaced by pointing to the drinks, and the KUKA LBR IIWA robot arm and RGBD camera realized the grasp tasks.

Visitor identification

The visitor identification task module primarily uses an RGB-D camera to actively capture images for face detection and recognition. The goal is to identify whether a detected face is new or already stored in the system database and transition seamlessly to the inquiry task once a face is recognized.

Process Flow

Initialization:
- The robot begins in a stationary position at the door.
- The camera thread is activated to continuously capture frames.
Face Detection:
- Each frame is analyzed for face detection with a time limit of 3 seconds per detection attempt.
- If no face is detected within 3 seconds, the robot performs small rotational movements to expand its field of view and continues detecting.
Recognition and Database Check:
- If a face is detected:
  - The system marks this recognition task as successful.
  - It checks the database for the detected face:
    - If the face is new, the calculated facial feature data is extracted and stored in the database.
Failure Handling:
- If no face is detected after
  
  3 seconds of rotation
  
  , the system:
  - Confirms whether any face was found during the current session.
  - If no face was found:
    - All threads are stopped, and the robot pauses operations for 3 seconds before restarting the recognition task.
  - If a face was found, the recognition task concludes, and the system transitions to the inquiry task.
(The chart is genrated by ChatGPT.)

Face Recognition will be introduced in a new blog for the learning path in Face Recognition.

Voice-based greeting function

The IFLYTEK voice transcription realizes a voice-based greeting function to obtain voice information, which will be stored in the database with face information and personal information. (I have not learned NLP then, so have to turn to open-source tech or tools for help.)

The inquiry task module leverages a microphone array, an RGB-D camera, and a speech output module to interact with household members and collect their preferences. For example, the robot might ask, “What’s your favorite drink?” Upon receiving a response, it converts the speech to text, detects keywords (e.g., Sprite, Coke, milk tea), and stores the data for future use.

Process Flow

Module Initialization:
- Start the RGB-D camera thread and microphone array thread.
- Rotate the camera to search for faces.
Face Detection:
- Once a single face is detected, the camera aligns to keep the face centered.
Interactive Questioning:
- The robot asks: “What’s your name?” Processes and saves the name.
- Then asks: “What’s your favorite drink?” Transcribes the response and extracts key preferences.
Data Storage:
- The system links the detected face with the name and preferences.
- Packages the data into a structured record in the database.
Repeating the Process:
- After one member’s information is collected, the system checks for other undetected faces.
- If faces remain, the camera rotates to locate another person and repeats the inquiry process.
Task Completion:
- When all faces have been detected and information collected, the module concludes the inquiry task and transitions to the guidance task.

(The chart is genrated by ChatGPT.)

Guidance function

The primary functionalities required for this task include Marker search and recognition, motion planning, and obstacle avoidance. Each function works in harmony to ensure the robot can execute tasks efficiently and safely. The markers are used for turtlebot, and it can be replaced by pre-programming in Moro robot system to guide to the pre-defined places.

You can visit the blog regrading to the Motion planning and Obstacle avoidance to learn the specific algorithms. TODO: Motion planning blog links.

Process flow

The robot initializes the task by scanning for the first marker to determine its starting point.
Upon detecting a marker:
- The marker’s information is decoded to determine the next action.
- The robot plans its route to the specified target area.
During navigation:
- Obstacles are detected and avoided using real-time LiDAR data.
- The robot updates its path dynamically to ensure smooth movement.
If the robot reaches the final marker, it concludes the task or transitions to a new task as per the marker’s instructions.

Grasp Function

Result

Face Recognition

Face Recognition with LFW Dataset and MTCNN

MTCNN Training on LFW Dataset
- The LFW dataset was used to train MTCNN for face detection.
- The dataset was divided into 80% for training and 20% for testing and validation, along with a custom-built face dataset.
- The final detection accuracy was 96.4%.
FaceNet Training with Processed Images
- All images in the LFW dataset were resized to 160×160 pixels using MTCNN preprocessing.
- The processed images were then fed into the FaceNet model for training.
- The trained FaceNet model achieved a recognition accuracy of 98.5%.
Challenges in Recognition Accuracy
- Misrecognitions occurred in two primary scenarios:
  a. When an image contained two or more faces, the target face might not be selected.
  b. Poor lighting, occlusions, or unfavorable angles reduced the confidence score of the target face compared to others.
Example of MTCNN Misrecognition
- In one example, the male face had a confidence score of 0.78, while the female face had a score of 0.85.
- The target was the male face, but due to occlusion, MTCNN detected and cropped the female face instead.
- As a result, the male face was excluded.

Conclusion:
MTCNN and FaceNet demonstrate high accuracy for face detection and recognition. However, challenges like multiple faces, occlusions, and lighting variations can lead to misrecognitions, indicating areas for further improvement.

Voice Greating Function

Inquiry Module

The inquiry module primarily utilizes the iFlytek speech-to-text API. This API allows for the transcription of a local .wav audio file into text. For example, a local file named input.wav can be transcribed into text such as, “我是猪八戒，我喜欢喝可乐” (translated: “I am Zhu Bajie, and I like drinking cola”). An example of this process is shown in the following Figures.

The two images above illustrate the process of recording audio for family members and saving it locally as the input.wav audio file. The workflow is as follows:

Audio Recording: Family members’ voices are recorded and saved as a .wav file (e.g., input.wav).
Speech Transcription: The audio file is transcribed into text using the iFlytek API.
Keyword Extraction: Keywords, such as names and preferences, are extracted from the transcribed text and saved locally in the mine.txt file.
Image Naming: Collected facial images are linked to the corresponding names from the transcribed data and renamed accordingly.

This integrated approach ensures that each family member’s face and preferences are organized systematically for further use.

The whole process

To evaluate the overall effectiveness of integrating all components, the following images illustrate the testing process for the entire task workflow:

Face Detection and Recognition:
As shown in Figure (a), when the target person enters the robot’s field of view, the robot captures images from the video feed via its camera. The system then performs face detection and recognition, extracting the facial information of the person and storing it in a local database. Once the facial information is obtained, the system issues a command to change the task state and transitions to the voice interaction module.
Voice Interaction:
In Figure (b), the microphone prompts the target individual with a question: “Hello, what is your name?” The person responds, and the system attempts to recognize the spoken name. If the name is not accurately recognized, the system either repeats the question or confirms the name by ensuring that the same name is detected twice consecutively. If there is a mismatch, the system continues to ask until two consistent results are obtained. After obtaining the name, the system proceeds to inquire about the person’s favorite beverage. This process is similar to the name inquiry, and the results are displayed in Figure (c).
Saving Information to the Database:
Once all information is collected, the system saves it in the database. As shown in Figure (f), the database includes details such as the member’s ID, name, age, preferences, and facial data. The collected facial images are named according to the person’s name and saved locally.
Proposing Personalized Tasks:
After completing the inquiries, the robot addresses the person by their name and makes a proposal, such as: “Wang Dapeng, would you like a cup of milk tea?” This question is formatted as: “Target name, would you like a cup of [favorite beverage]?” An example of this interaction is shown in Figure (d)

Re-encountering a Recognized Member:
When a previously recognized member reappears in the robot’s field of view, such as after registering “Wang Dapeng” and later encountering another person “Liu Tao,” the system logs their information as new. However, when “Wang Dapeng” appears again, the system recognizes them as an existing member and generates an “old” message to indicate prior recognition. The system then repeats the personalized interaction, as it did during the first encounter, by asking: “Wang Dapeng, would you like a cup of milk tea?” The result of this process is shown in Figure (e).

Improvements

As described above, this paper presents the design of a simple home service robot system that performs several tasks. However, there are still many limitations for home service robots, which can be summarized as follows:

Communication Delays:
The communication delay between the face recognition and voice recognition modules is relatively long. In the future, the adoption of 5G transmission could significantly improve the speed of data transfer, especially when dealing with large datasets.
Misidentification in Multi-Face Scenarios:
When multiple faces appear in the robot’s field of view, the system may misidentify the target face. Future improvements in algorithms, such as the adoption of 3D face recognition, could greatly enhance the accuracy of identification.
Limited System Functionality:
The system designed in this study is relatively simple and performs only basic, single-purpose tasks. It is not yet suitable for complex home environments. In the future, home service robots will likely become smarter and more versatile, capable of completing a wide range of tasks quickly and efficiently.