With Daisykit – Everyone can build AI projects!

Imagine you are a software engineer or a DIY hobbyist with great ideas to build AI-powered projects. However, it will be such difficulty when you know very little or nothing about AI. The complexity of deep learning models is a barrier for everyone wishing to integrate AI services into their projects. Understanding this problem, we are designing and building an AI toolkit named Daisykit, focusing on the ease of deployment and for everyone.

After two months of development, we want to share our design, progress, and instructions to integrate Daisykit for AI tasks easily. Let’s see into our demo video below to understand more about Daisykit’s abilities.

https://youtu.be/zKP8sgGoFMc

https://youtu.be/zKP8sgGoFMc

Object detection, face detection with the landmark, human pose detection, background matting, fitness analyzers, … Popular deep learning models are now integrated gradually into our toolkit, creating a magic wand that helps you add AI to your system without knowing much about internal architecture. Daisykit will be a community-driven toolkit, which means all features will be voted, developed, and served back to the community to ensure there are as many as people benefit from our framework.

In this blog post, we will introduce the detailed design of our system in part I, some how-to-use Python examples in part II, and deployment in mobile devices in part III. If you want to touch our toolkit immediately without knowing much about the design, just skip the first part and jump into coding from the second part.

1. Design

Before going to the idea of Daisykit, we also did some research about current deep learning inference frameworks available in the market. Besides hybrid frameworks for both training and inference such as Tensorflow, Pytorch, MXNet, Paddle Paddle, … we can also see frameworks tailor-made for inference like TensorRT, Intel OpenVINO, TFLite, CoreML, TFJS, NCNN, .... to infer faster with lower resources. Hybrid frameworks for both training and inference often have a large distribution size and many dependencies. Therefore, they may not be suitable for deployment in many cases. By taking advantage of hardware-specific instructions and optimizations, inference-focus engines often give better performance. However, we often see AI systems as media processing pipelines in real-world use cases, where AI models are the small steps. Pipeline model inference frameworks such as NVIDIA Deepstream, NNStreamer, or Google Mediapipe came into play to solve the optimization problem for complex systems, where AI models and other processing steps can be organized to run parallelly with an optimal amount of resources. Daisykit is also our effort to build a pipeline framework for deep learning, having a good performance in wide ranges of hardware but easy-to-use interfaces.

Development plan

Our development plan is inspired a lot by Google Mediapipe. We use C++ to develop a core SDK, which contains media processing algorithms, model inference, and graph APIs for concurrency. Currently, our framework supports models with NCNN and OpenCV DNN engines. However, we will add other engines to maximize system compatibility and reduce the effort of model conversion in the future. Using the Daisykit core SDK, we develop wrappers and sample applications for different platforms such as desktop computers, embedded systems, mobile devices, and web browsers. C++, Python wrapper, and Android examples are now available in our repositories. Although we are focusing on model deployment and system architecture in the current phase of the project, we planned to provide training code, tutorials, and a mechanism for model distribution and monitoring in the future.

Development plan for Daisykit

Development plan for Daisykit

Graph-based design for concurrent flow

Many concurrent frameworks take advantage of graph architecture to build processing pipelines. ROS is a popular robotics framework where each Node is a processing unit, communicates with each other via inter-processing communication (IPC). In the world of media processing, GStreamer can construct a processing graph of media components with different operations. NNStreamer and NVIDIA DeepStream make use of GStreamer framework by writing plugins to run AI operations. DeepStream has been very successful with NVIDIA hardware. However, it’s not an open-source solution and could not be ported to run on non-NVIDIA hardware. NNStreamer and DeepStream ability is also limited by how people can write a GStreamer plugin. OpenCV 4.0 also has Graph API but is limited in image processing applications. Google Mediapipe learned from other frameworks in the market to architect a system that is flexible enough to handle multiple types of data while maintaining high performance and multiplatform ability. Although pre-trained models from Google often have high accuracy and excellent inference speed, the Mediapipe framework only supports Google engines like TFLite. Our framework Daisykit is inspired by Mediapipe architecture; however, we want to build an open framework where different inference engines and pre-trained models from various sources are integrated. That will maximize the ease of deployment and the number of models people can use for their projects.

The connection between 2 processing nodes

The connection between 2 processing nodes

Above is an illustration of the connection between 2 processing nodes, each Node for a processing task. For example, in facial landmark detection flow, we consider face detection and landmark regression as two separated nodes. Connecting the face detection node before the landmark regression node indicates that the result of face detection will be used as the input for facial landmark regression. For each Node in our graph, we have multiple input and multiple output connections, which are in_connections_ and out_connections_, respectively. Data processing is handled by worker_thread_ defined in the Node. Each element in in_connections_ and out_connections_ is a Connection instance, keeping a transmission queue between 2 nodes. In each transmission queue, we have multiple Packet instances, which are wrappers of processing data, for example, images or processing results from the previous step. The transmission is pretty lightweight because the Packet only keeps the pointer to the data, not the data itself. The connection between two nodes is controlled based on a TransmissionProfile, which defines the maximum queue size, packet dropping, and other transmission policies. You can find interfaces for these terms and their source code in the Daisykit library (headers, source). The implementation of each graph is now available in C++.

You can find an example of face detector graph here. We first create separated nodes, connect them by Graph::Connect(), activate the processing threads by Node::Activate(), and input the data into the graph.

 1// Create processing nodes
 2std::shared_ptr<nodes::PacketDistributorNode> packet_distributor_node =
 3    std::make_shared<nodes::PacketDistributorNode>("packet_distributor",
 4                                                    NodeType::kAsyncNode);
 5std::shared_ptr<nodes::FaceDetectorNode> face_detector_node =
 6    std::make_shared<nodes::FaceDetectorNode>(
 7        "face_detector",
 8        "models/face_detection/yolo_fastest_with_mask/"
 9        "yolo-fastest-opt.param",
10        "models/face_detection/yolo_fastest_with_mask/"
11        "yolo-fastest-opt.bin",
12        NodeType::kAsyncNode);
13std::shared_ptr<nodes::FacialLandmarkDetectorNode>
14    facial_landmark_detector_node =
15        std::make_shared<nodes::FacialLandmarkDetectorNode>(
16            "facial_landmark_detector",
17            "models/facial_landmark/pfld-sim.param",
18            "models/facial_landmark/pfld-sim.bin", NodeType::kAsyncNode);
19std::shared_ptr<nodes::FaceVisualizerNode> face_visualizer_node =
20    std::make_shared<nodes::FaceVisualizerNode>("face_visualizer",
21                                                NodeType::kAsyncNode, true);
22 
23// Create connections between nodes
24Graph::Connect(nullptr, "", packet_distributor_node.get(), "input",
25                TransmissionProfile(2, true), true);
26 
27Graph::Connect(packet_distributor_node.get(), "output",
28                face_detector_node.get(), "input",
29                TransmissionProfile(2, true), true);
30 
31Graph::Connect(packet_distributor_node.get(), "output",
32                facial_landmark_detector_node.get(), "image",
33                TransmissionProfile(2, true), true);
34Graph::Connect(face_detector_node.get(), "output",
35                facial_landmark_detector_node.get(), "faces",
36                TransmissionProfile(2, true), true);
37 
38Graph::Connect(packet_distributor_node.get(), "output",
39                face_visualizer_node.get(), "image",
40                TransmissionProfile(2, true), true);
41Graph::Connect(facial_landmark_detector_node.get(), "output",
42                face_visualizer_node.get(), "faces",
43                TransmissionProfile(2, true), true);
44 
45// Need to init these nodes before use
46// This method also start worker threads of asynchronous node
47packet_distributor_node->Activate();
48face_detector_node->Activate();
49facial_landmark_detector_node->Activate();
50face_visualizer_node->Activate();
51 
52VideoCapture cap(0);
53while (1) {
54    Mat frame;
55    cap >> frame;
56    cv::cvtColor(frame, frame, cv::COLOR_BGR2RGB);
57    std::shared_ptr<Packet> in_packet = Packet::MakePacket<cv::Mat>(frame);
58    packet_distributor_node->Input("input", in_packet);
59}

The connection between nodes in the face detection graph

The connection between nodes in the face detection graph

By experiments, we have seen the improvement in the frame rate of multithreading with graphs compared to sequential processing. However, the face detection graph may not be constructed optimally. In the next phase of improvements, we want to learn from Mediapipe to add a loopback connection in FaceLandmarkDetectorNode to use landmark results for tracking faces, reducing the dependency in the speed of FaceDetectorNode.

The concurrency architect will be handled by the internal code of Daisykit. Note that these graph APIs are still experimental and only available in C++ now. In the future, we will refine the APIs for easy-to-use interfaces, support constructing graphs from configuration files and other language wrappers. We really want to have your comments and contributions to the concurrency design and implementation of Daisykit.

2. Deploy AI in your systems with just a few lines of code

Python is a beginner-friendly language and is widely used in different DIY and software projects. That’s why we choose Python as a language to focus on. This section introduces some examples of how Daisykit can be applied for AI tasks.

Currently, Daisykit supports six models from different sources. See the details about each model here. The list of supported models:

For each model, we have inference code, and some models have links to training source code. Please don’t worry about the limited number of models for now. We are adding more and more models so that people can only have to select and run. We will also add training code and tutorials for as many models as possible. Some models will be added soon to our framework:

Install Daisykit for Python

We have prebuilt Python packages for Linux x86_64 and Windows x86_64 only CPU now. For other environments or GPU support, you need to install OpenCV C++, Vulkan and build the Python package from scratch.

Install on Ubuntu:

1sudo apt install pybind11-dev libopencv-dev libvulkan-dev # Dependencies
2pip3 install --upgrade pip # Ensure pip is updated
3pip3 install daisykit

Install on Windows:

1pip3 install daisykit

We prepared a Google Colab notebook here for all demo applications for ease of environment setup. However, due to the limitations of the Colab environment, we only run the demo with images here. Link to Colab notebook: https://colab.research.google.com/drive/1LFg3xcoFr3wxuJmn3c4LEJiW2G7oP7F5?usp=sharing.

Face detection flow: Detect faces + landmark

Face detector flow in Daisykit contains a face detection model based on YOLO Fastest and a facial landmark detection model based on PFLD. In addition, to encourage makers to join hands in the fighting with COVID-19, we selected a face detection model that can recognize people wearing face masks or not.

 

Let’s see into the source code below to understand how to run this model with your webcam. First, we initialize the flow with a config dictionary. It contains information about the models used in the flow. get_asset_file() function will automatically download the models and weights files from https://github.com/Daisykit-AI/daisykit-assets, so you don’t have to care about downloading them manually. The downloading only happens the first time we use the models. After that, you can run this code offline. You also can download all files yourself and put the paths to file in instead of get_asset_file(<assets address>). If you don’t need facial landmark output, set with_landmark to False.

 1import cv2
 2import json
 3from daisykit.utils import get_asset_file, to_py_type
 4import daisykit
 5
 6config = {
 7    "face_detection_model": {
 8        "model": get_asset_file("models/face_detection/yolo_fastest_with_mask/yolo-fastest-opt.param"),
 9        "weights": get_asset_file("models/face_detection/yolo_fastest_with_mask/yolo-fastest-opt.bin"),
10        "input_width": 320,
11        "input_height": 320,
12        "score_threshold": 0.7,
13        "iou_threshold": 0.5,
14        "use_gpu": False
15    },
16    "with_landmark": True,
17    "facial_landmark_model": {
18        "model": get_asset_file("models/facial_landmark/pfld-sim.param"),
19        "weights": get_asset_file("models/facial_landmark/pfld-sim.bin"),
20        "input_width": 112,
21        "input_height": 112,
22        "use_gpu": False
23    }
24}
25
26face_detector_flow = daisykit.FaceDetectorFlow(json.dumps(config))
27
28# Open video stream from webcam
29vid = cv2.VideoCapture(0)
30
31while(True):
32
33    # Capture the video frame
34    ret, frame = vid.read()
35
36    frame = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
37
38    faces = face_detector_flow.Process(frame)
39    # for face in faces:
40    #     print([face.x, face.y, face.w, face.h,
41    #           face.confidence, face.wearing_mask_prob])
42    face_detector_flow.DrawResult(frame, faces)
43
44    frame = cv2.cvtColor(frame, cv2.COLOR_RGB2BGR)
45
46    # Convert faces to Python list of dict
47    faces = to_py_type(faces)
48
49    # Display the resulting frame
50    cv2.imshow('frame', frame)
51
52    # The 'q' button is set as the
53    # quitting button you may use any
54    # desired button of your choice
55    if cv2.waitKey(1) & 0xFF == ord('q'):
56        break
57
58# After the loop release the cap object
59vid.release()
60# Destroy all the windows
61cv2.destroyAllWindows()
62

All flows in Daisykit are initialized by a configuration string. Therefore we use json.dumps() to convert the config dictionary to a string. For example, in face detector flow:

1face_detector_flow = daisykit.FaceDetectorFlow(json.dumps(config))

Run the flow to get detected faces by:

1faces = face_detector_flow.Process(frame)

You can use the DrawResult() method to visualize the result or write a drawing function yourself. This AI flow can be used in DIY projects such as smart COVID-19 camera or Snap-chat-like camera decorators.

Human pose detection flow

The human pose detector module contains an SSD-MobileNetV2 body detector and a ported Google MoveNet model for human keypoints. This module can be applied in fitness applications and AR games.

 

Source code:

 1import cv2
 2import json
 3from daisykit.utils import get_asset_file, to_py_type
 4from daisykit import HumanPoseMoveNetFlow
 5 
 6config = {
 7    "person_detection_model": {
 8        "model": get_asset_file("models/human_detection/ssd_mobilenetv2.param"),
 9        "weights": get_asset_file("models/human_detection/ssd_mobilenetv2.bin"),
10        "input_width": 320,
11        "input_height": 320,
12        "use_gpu": False
13    },
14    "human_pose_model": {
15        "model": get_asset_file("models/human_pose_detection/movenet/lightning.param"),
16        "weights": get_asset_file("models/human_pose_detection/movenet/lightning.bin"),
17        "input_width": 192,
18        "input_height": 192,
19        "use_gpu": False
20    }
21}
22 
23human_pose_flow = HumanPoseMoveNetFlow(json.dumps(config))
24 
25# Open video stream from webcam
26vid = cv2.VideoCapture(0)
27 
28while(True):
29 
30    # Capture the video frame
31    ret, frame = vid.read()
32 
33    frame = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
34 
35    poses = human_pose_flow.Process(frame)
36    human_pose_flow.DrawResult(frame, poses)
37 
38    frame = cv2.cvtColor(frame, cv2.COLOR_RGB2BGR)
39 
40    # Convert poses to Python list of dict
41    poses = to_py_type(poses)
42 
43    # Display the result frame
44    cv2.imshow('frame', frame)
45 
46    # Press 'q' to exit
47    if cv2.waitKey(1) & 0xFF == ord('q'):
48        break

Background matting flow

Background matting use only one segmentation model to generate a human body mask. This mask can figure out which pixels belong to humans and which belong to the background. This output can be used for background replacement (like in the Google Meet app). The segmentation model was taken from this implementation by nihui, the author of the NCNN framework. The author also has a webpage for a live demo on web browsers.

https://github.com/nihui/ncnn-webassembly-portrait-segmentation.

Source code:

 1import cv2
 2import json
 3from daisykit.utils import get_asset_file
 4from daisykit import BackgroundMattingFlow
 5 
 6config = {
 7    "background_matting_model": {
 8        "model": get_asset_file("models/background_matting/erd/erdnet.param"),
 9        "weights": get_asset_file("models/background_matting/erd/erdnet.bin"),
10        "input_width": 256,
11        "input_height": 256,
12        "use_gpu": False
13    }
14}
15 
16# Load background
17default_bg_file = get_asset_file("images/background.jpg")
18background = cv2.imread(default_bg_file)
19background = cv2.cvtColor(background, cv2.COLOR_BGR2RGB)
20 
21background_matting_flow = BackgroundMattingFlow(json.dumps(config), background)
22 
23# Open video stream from webcam
24vid = cv2.VideoCapture(0)
25 
26while(True):
27 
28    # Capture the video frame
29    ret, frame = vid.read()
30 
31    image = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
32 
33    mask = background_matting_flow.Process(image)
34    background_matting_flow.DrawResult(image, mask)
35 
36    image = cv2.cvtColor(image, cv2.COLOR_RGB2BGR)
37 
38    # Display the result frame
39    cv2.imshow('frame', frame)
40    cv2.imshow('result', image)
41 
42    # Press 'q' to exit
43    if cv2.waitKey(1) & 0xFF == ord('q'):
44        break

In the source code, we use get_asset_file("images/background.jpg") to download the default background. You can use another image for the background by replacing it with a path to an image file.

Hand pose detection flow

The hand pose detection flow comprises two models: a hand detection model based on YOLOX and a 3D hand pose detection model released by Google this November. Thanks to FeiGeChuanShu for the effort in early model conversion.

 

This hand pose flow can be used in AR games, hand gesture control, and many cool DIY projects.

Source code:

 1import cv2
 2import json
 3from daisykit.utils import get_asset_file, to_py_type
 4from daisykit import HandPoseDetectorFlow
 5 
 6config = {
 7    "hand_detection_model": {
 8        "model": get_asset_file("models/hand_pose/yolox_hand_swish.param"),
 9        "weights": get_asset_file("models/hand_pose/yolox_hand_swish.bin"),
10        "input_width": 256,
11        "input_height": 256,
12        "score_threshold": 0.45,
13        "iou_threshold": 0.65,
14        "use_gpu": False
15    },
16    "hand_pose_model": {
17        "model": get_asset_file("models/hand_pose/hand_lite-op.param"),
18        "weights": get_asset_file("models/hand_pose/hand_lite-op.bin"),
19        "input_size": 224,
20        "use_gpu": False
21    }
22}
23 
24flow = HandPoseDetectorFlow(json.dumps(config))
25 
26# Open video stream from webcam
27vid = cv2.VideoCapture(0)
28 
29while(True):
30 
31    # Capture the video frame
32    ret, frame = vid.read()
33 
34    frame = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
35 
36    poses = flow.Process(frame)
37    flow.DrawResult(frame, poses)
38 
39    frame = cv2.cvtColor(frame, cv2.COLOR_RGB2BGR)
40 
41    # Convert poses to Python list of dict
42    poses = to_py_type(poses)
43 
44    # Display the result frame
45    cv2.imshow('frame', frame)
46 
47    # Press 'q' to exit
48    if cv2.waitKey(1) & 0xFF == ord('q'):
49        break

In the above source code, input_width and input_height of the hand_detection_model can be adjusted for speed/accuracy trade-off.

Barcode detection

Barcodes can be used in a wide range of robotics and software applications. That’s why we integrated a barcode reader into Daisykit. The core algorithms of the barcode reader are from the Zxing-CPP project. This barcode processor can read QR codes and bar codes in different formats.

Source code:

 1import cv2
 2import json
 3from daisykit.utils import get_asset_file
 4from daisykit import BarcodeScannerFlow
 5 
 6config = {
 7    "try_harder": True,
 8    "try_rotate": True
 9}
10 
11barcode_scanner_flow = BarcodeScannerFlow(json.dumps(config))
12 
13# Open video stream from webcam
14vid = cv2.VideoCapture(0)
15 
16while(True):
17 
18    # Capture the video frame
19    ret, frame = vid.read()
20 
21    frame = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
22 
23    result = barcode_scanner_flow.Process(frame, draw=True)
24 
25    frame = cv2.cvtColor(frame, cv2.COLOR_RGB2BGR)
26 
27    # Display the result frame
28    cv2.imshow('frame', frame)
29 
30    # Press 'q' to exit
31    if cv2.waitKey(1) & 0xFF == ord('q'):
32        break

Object detection

A general-purpose object detector based on YOLOX is integrated with Daisykit. The models are trained on the COCO dataset using the official repository of YOLOX. You can retrain the model with your custom dataset and convert it to NCNN format, which can be integrated into Daisykit easily.

Source code:

 1import cv2
 2import json
 3from daisykit.utils import get_asset_file, to_py_type
 4from daisykit import ObjectDetectorFlow
 5 
 6config = {
 7    "object_detection_model": {
 8        "model": get_asset_file("models/object_detection/yolox-tiny.param"),
 9        "weights": get_asset_file("models/object_detection/yolox-tiny.bin"),
10        "input_width": 416,
11        "input_height": 416,
12        "score_threshold": 0.5,
13        "iou_threshold": 0.8,
14        "use_gpu": False,
15        "class_names": [
16            "person", "bicycle", "car", "motorcycle", "airplane", "bus", "train", "truck", "boat", "traffic light",
17            "fire hydrant", "stop sign", "parking meter", "bench", "bird", "cat", "dog", "horse", "sheep", "cow",
18            "elephant", "bear", "zebra", "giraffe", "backpack", "umbrella", "handbag", "tie", "suitcase", "frisbee",
19            "skis", "snowboard", "sports ball", "kite", "baseball bat", "baseball glove", "skateboard", "surfboard",
20            "tennis racket", "bottle", "wine glass", "cup", "fork", "knife", "spoon", "bowl", "banana", "apple",
21            "sandwich", "orange", "broccoli", "carrot", "hot dog", "pizza", "donut", "cake", "chair", "couch",
22            "potted plant", "bed", "dining table", "toilet", "tv", "laptop", "mouse", "remote", "keyboard", "cell phone",
23            "microwave", "oven", "toaster", "sink", "refrigerator", "book", "clock", "vase", "scissors", "teddy bear",
24            "hair drier", "toothbrush"
25        ]
26    }
27}
28 
29flow = ObjectDetectorFlow(json.dumps(config))
30 
31# Open video stream from webcam
32vid = cv2.VideoCapture(0)
33 
34while(True):
35 
36    # Capture the video frame
37    ret, frame = vid.read()
38 
39    frame = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
40 
41    poses = flow.Process(frame)
42    flow.DrawResult(frame, poses)
43 
44    frame = cv2.cvtColor(frame, cv2.COLOR_RGB2BGR)
45 
46    # Convert poses to Python list of dict
47    poses = to_py_type(poses)
48 
49    # Display the result frame
50    cv2.imshow('frame', frame)
51 
52    # Press 'q' to exit
53    if cv2.waitKey(1) & 0xFF == ord('q'):
54        break

If you have any difficulty running the examples above, we prepared a Colab environment to try them without setting up a local environment.

Access our Colab notebook at: https://colab.research.google.com/drive/1LFg3xcoFr3wxuJmn3c4LEJiW2G7oP7F5#scrollTo=2BLn9OfaQQtM

3. Deployment for mobile phone

Besides Python, Daisykit teams also developed Daisykit examples for mobile phones to cover as many use cases as possible. Access our repository for Android here. We provided detailed instructions on how to set up and run the project with Android Studio in the README file with three steps:

  • Clone repository with all submodules
  • Download prebuilt OpenCV, NCNN libraries and put them in the right location
  • Open the project with Android Studio, build and run examples

There are also six demo flows in the Android example. We integrated all of them into a single mobile app for convenience.

https://www.youtube.com/watch?v=8iP1pU0JxHE

https://www.youtube.com/watch?v=8iP1pU0JxHE

4. Conclusion

Although Daisykit is still in an active design and development phase, we can see some optimistic results from our progress. We will learn from other frameworks like NVIDIA DeepStreams or Google Mediapipe to improve Daisykit gradually in terms of model quality and inference speed. In the future, we also have plans for training source code, tutorials, and a module for model distribution and monitoring. We understand that there are still many challenges with Daisykit, but we will try our best to deliver an AI framework for everyone and help people build their own AI projects with ease.

We hope that you can see something useful for your next projects here. We are here to hear from you and support you in applying Daisykit to your next great ideas. You can also dive into the source code, architecture by investigating Daisykit repositories:

Related Posts

Airflow, MLflow or Kubeflow for MLOps?

Machine learning is now focusing more on the data (data-centric AI), and MLOps is obviously the way to bring ML projects into production.

Read more

Install ROS 2 on Raspberry Pi 4 (SD card image available)

ROS2 is an excellent framework for robotics applications. You can go further and embed a completed ROS 2 application stack in a small robot base with a Raspberry Pi computer.

Read more

Paper review: "YOLOX: Exceeding YOLO Series in 2021" and application in traffic sign detection - VIA Autonomous

YOLOX is an anchor-free version of YOLO, with a simpler design but better performance! It aims to bridge the gap between research and industrial communities.

Read more