Introspector Project

Thank you for visiting, The project is alive, and I am working on it actively – jmd 30. May 2006.

There is a huge amount of data that has been collected by the various versions of the introspector gcc data extraction.

Currently I am working on reducing the footprint of this archive.

You can find out the latest new by checking out the Introspector Blog

You can access some more archive material on this project web server, which is organized by year/month/day or project.

the years archived are :

You can download the latest releases and archive material at The SourceForge Project Page

There is some older material on the inactive project Mailing List

You can chat with me when I am online at irc server irc://

Project Goals

The Introspector aims to modify, augment, analyse and enable the GNU programming toolchain and surrounding tools [ gcc, binutils, make, bash, autotools, perl, sed, awk, m4, python, emacs] to serve as an unified, abstract and easy to use data source.

The users of this data source can connect to and communicate with these tools in a standard and neutral manner reducing the accidental cost of programming. By usage of semantic web markup and inference rules we can treat the GNU toolchain as one massive semantic web discovery and usage service.

The usage of RDF or Resources Description Format as the underlying representation is of essential importance.

RDF is the foundation of the next generation of the internet, the Semantic Web where plain hypertext links is replaced by meaningful references to resources of type and quality.

The goal is to create a super large and extremely dense web of information about the outside world extracted automatically from computer language programs. Vast clusters of computing power will be need to process all that informaion. In the end this web could be used to create executable programs out of because it should contain all the needed information that the compiler itself uses.

This data will have to be processable, sortable, searchable, and usable from any programming language or also from humans directly.

There is a need for a unity in the amount of software data and services that are available. Via this RDF toolchain we push away the problem of understanding the software and concentrate on connecting enough information so that the meaning can be inferred.

The goal is to connect this huge structured software information base to the huge unstructured natural language documents and artifacts like bugs and features found on the web.

The open source community has produced so many documents about the software that i feel we are reaching a critical mass of information so that we can give meaning to the web documents from the source code information.

The data collectors extract logical statements about programming artifacts into rdf. and web crawlers extract key words and new words from web documents.

natural language parse graphs can be extracted from open source natural language tools, these will also be stored in RDF and available.

By application of introspector data collection procedures to a given program we will be able identify the source of each byte of an output file written and find the data structures and functions used to create it. Then the user will be able to isolate and intercept the usage of that data structure for data collection purposes. This requires a refactoring and pattern generation toolkit.

We want to be able to reference all significant functions and data of the introspector data sources with special uris.

Using standard vocabularies we such as OWL we can declare the structure of the functions and data of the to

Like a telephone switchboard connects many parties who might wish to talk to each other, the Introspector allows multiple consumers and producers of data about software to quickly and painlessly transfer information to each other. This is accomplished via streams of RDF data and streams of raw data described by RDF metainformation. In fact a document can be seen as a special and very large resource object.

This data about software, or meta-data is read and written in and out of existing software tools via a standardized plug-in interface. It will be possible to automatically generate a new data extractor for a another version of the software given the source code and an existing version.

The introspector interfaces into the GCC provide all types of information about the software, the goal is to also mark up and handle the software project management.

Each producer has its own flavour of data and format of data that it stores about your software, and the introspector allows for each software to speak it's native language.

The Introspector plug ins act analogous to network cards in an ethernet allowing for broadcasting of the meta-data to the network of consumers who are subscribed for event notfication.

The internal tree.h data structures of the compiler are translated into OWL Web Ontology Language structures that describe the structure of compilers ast tree data. This tree describes your programs abstractly to the compiler backend.

Further the processing and parsing of this meta-data will is driven by meta-programs that are written a special form RDF that gives instructions to the introspector engine how to efficiently traverse and extract relevant data similar to XSLT and GraphPath.

This frees the introspector from containing any knowledge about copyrighted material. Currently this is being prototyped in Perl instead of creating a new language.

The introspector seeks to populate the semantic web with data about software and most importantly about free software because the sources are publicly available.

This first layer of meta-data describing the resource of free software will be the key for making intelligent programs (agents) that are able to reason about the software itself.

Examples of such agents are tools that can find relevant emails for lines of sources code because having a detailed and documented semantic model of the domain of free software is the key for the processing of the natural language texts about it.

This meta-data is to include all data collected about your software by the compiler, the make & build system, the savannah/sourceforge project management and Debian packaging system, the CVS changes and the mailman mailing list software.

The Introspector's scope was originally just the GCC "C" compiler, but is now expanded to include the extraction of  meta-data from different compilers and interpreters, such as Perl,bison,m4,bash,c#, Java. C++, Fortran, objective-c, Lisp and scheme. Various patches to the target systems will be officially submitted and unofficially maintained to allow for the introspector to extract data in a standard form.


The introspector uses the excellent Redland RDF Application Framework API for parsing/serializing, storing/retrieving and querying/traversing the RDF data.

The Berkley DB storage mechanism provides an efficent indexing system, and the SWIG Simplified Wrapper Interface Generator provides native language interfaces for all major programming systems.

The Introspector GCC interface uses Redland to create repositories of data that can be processed by any tool needed.

The current 4.* version implements a miniature rdf writer that creates a directories structure for each

Experiments have been made with compiling this graph data into arrays, so called "ICE Cubes" that can be traversed even quicker than the redland database. Also I have been able to compress those arrays into smarter vectors of data that can fit nicely into the cache page of a modern PC.


The software is free software in the spirit of the GNU manifesto and is revolutionary in the freedoms that it intends on granting to its users. It is designed to be independant of the sources and consumers of the metadata that it provides and to avoid creating dependancies on any one provider. The focus of the introspector is enabling the user of the compiler and tool builders to be able to access the data they need. Licensing is GPL for the core with LGPL for the user facing modules. The licensing of the ontologies and tree traversing tools that may or may not be derived from the source code of the producers of the metadata is an open question. Logo