Why Do You Need to Detect AI – Generated Components in Code?

Posted On

by

Developers are increasingly using AI (Artificial Intelligence) in their code development processes. In a recent CNBC article, Microsoft says that 20 to 30 percent of their software in its repositories today is developed using AI, Google says more than 25 percent of its code is developed using AI, and Meta claims that by next year (2026), half of its code will be developed using AI. 

Embedded systems development is following the same trajectory, with developers utilizing AI-enabled commercial toolkits and open-source ML (machine learning) APIs, such as TensorFlow and scikit learn.

AI is now being embedded in chipsets, sensors, and other OT devices to speed up and localize resource-heavy functions like sequencing, occupant monitoring, inspection, updates, and device integration, giving birth to a new technical term called Edge AI

For example, Sima.ai is a six-year-old Silicon Valley company that delivers silicon, software and platform solutions for the edge AI market. SiMa.ai provides a developer toolkits called Palette™, which aids developers in building AI applications for multiple embedded edge verticals, including  automotive safety applications, autonomous robotic systems, medical systems, government defense systems, and other IOT devices. 

The simplicity of Natural Language Processing (NLP) through LLMs (large language models) and Generative AI helps development teams jumpstart and simplify software development by making it faster, easier, and more automated. However, code development through natural language prompts (often called vibe coding) also makes code attribution and attestation more difficult, especially pertaining to porting critical origin, version, and other application data to Software Bills of Materials (SBOMs). 

In this interview, Gopal Hegde, senior vice president of engineering and operations at SiMa.ai, explains how AI enhances development of critical embedded systems and why AI-developed components are difficult to identify using traditional scanning methods.

Q: How are product developers using AI in embedded systems?

Gopal: Embedded platforms typically have limited memory, compute capacity, and storage capacity, which makes embedded development challenging. To jump start embedded development, developers can leverage large language models to get baseline software – for example they could ask for a display driver or a PCIe driver. They can then use this as baseline and develop further. They can use tools built into their development systems such as Visual C++ Copilot. 

The beauty of the LLM is that  it goes far beyond giving you open-source code to develop with. Say you have a specific embedded system and storage integration, you ask the LLM to give you code that fits your system requirements. So, it works better than pulling something together out of open source to build a piece of code, because it actually generates custom code for your application. It saves you time by giving you a framework versus having to do this from scratch. 

This even goes for chip design. If you have a hundred functions to test your chip, you can use AI to give you five hundred more functional tests. So instead of developers manually writing the tests, LLM can create new tests, which covers a lot more ground in much shorter timeframe.

Q: So how is AI used in SiMa.ai platforms?

Gopal: We use AI for development in various ways. Our developers use AI to create a starting point for their development. We also use it to improve our documentation. For example,  we can feed the PDF of our user manual to a LLM model using Retrieval Augmented Generation or RAG, which augments the knowledge of the LLM. So, now, instead of reading through all that documentation, people can ask questions of the LLM, like, “Tell me how to build an object detection application.” 

With that information fed into the LLM, you can also ask our LLM platform how to modify your application and make it more efficient, change the resolution of your cameras, and so on. This will then give you updated application code that customers can build on.

Q: Many reports indicate that LLM-generated code is not ready for prime time, that it’s buggy, that each time a developer asks an LLM prompt to build a program, it provides a different outcome. 

Gopal: Yes, LLM-generated code often has bugs, so AI is still not intelligent enough to create code without testing and fine-tuning, but it does make a good starting point to develop new functionality, significantly speeding up the development process.  

Q: Is this why you need to identify AI-generated code in OT applications? And if so, how can Binary Composition Analysis make these identifications?

Gopal: Yes, you certainly want to identify AI-generated code in your software and firmware. This is important, as customers expect to hold someone accountable and responsible for the software delivered to them. Say you write code for a PCU that goes into an automobile. If that is machine-generated code, the engineer still needs to attribute ownership and competency of that embedded code, just like with any other code.

Q: Why is it more difficult to get that attribution with AI-generated code?

Gupal: AI-generated code is structured differently and has a distinct signature. When human-developed code goes to GitHub, it has a commit history: who changed what, when, and how it got changed. AI-generated code doesn’t have commit history, is less documented, and a lot more structured. That gives it away. AI doesn’t think like humans. We are messy and not very consistent, so how we develop our code and documentation is different than how AI does this. AI is more like clockwork, more repetitive.

Q: Can you provide some examples of specific signatures of machine-generated code in OT applications?

Gupal: Here are few of the indicators I pulled from GitHub:

  • Unnaturally perfect syntax, such as consistent indentation and no deviations in style across the program
  • Lack of commit history and documentation, such as fewer commits, often large chunks of code committed at once with vague commit messages like “Updated files”
  • Repetitive patterns and lack of creativity, lack of variables or function names
  • Overly generic comments and naming conventions
  • Missing author and repository activity 

The smartest AI engineers in our company are often those fresh out of college. They have no fear of using AI. They can use an LLM and understand how it got to that answer, which gives them full marks for knowing what they’re doing to improve productivity and stand by their software. 

Q: Why is accountability so important?

Gopal: It all hinges on the platform, the prompt engineering, and how you ask the LLM to develop the code. If the prompt is inefficiently written, it’s impossible to tell what the code component is even doing. For example, if they’re telling part of the prompt what to do while issuing the command, versus basically integrating that functionality into the UI. 

The question is how to hold people behind the application accountable. Getting accountability in AI code is very important to knowing you can trust that system. If you look at the self-driving car, you must have explainable AI. With financial transactions, you need explainability of how those transactions are conducted. In fact, you need the same, if not more accountability and attestation with AI-developed code as you do if the system was 100 percent human-developed.
Resources: Learn how CodeSentry Binary Composition Analysis can detect AI components and port that information into a Software Bill of Material

Other Posts

Check out all other blog posts and stay informed.

view all posts