The lack of an automated solution for converting codebases into documentation poses challenges in terms of time, accuracy, and code comprehension. Documentation is often ignored by developers, especially in fast-building teams. However, this leads to severe technical debt. Since technical documentation is hard and existing tools are limited or expensive, there is a need for comprehensive automatic documentation generation.
For any inquiries or feedback, you can reach out to the Team Leader at [email protected].
- As our current project does not involve storing data in databases, we can omit the creation of a UML diagram for now.
Our prototype offers a seamless solution to transform a full codebase into comprehensive developer documentation in just one step. By uploading a zip file containing the codebase, you can let the magic happen. The resulting documentation includes function explanations, API specs, table schemas, and dependencies, all in Markdown format.
To power our documentation generation, we leverage the capabilities of GPT-3.5. This advanced language model enables us to produce accurate and contextually relevant documentation for the given codebase.
-
Codebase Traversal: The process begins by traversing the codebase in a tree-wise fashion to access its contents.
-
Code Embeddings with CodeBERT: To extract meaningful information from the code, we employ Microsoft's CodeBERT for code embeddings. However, we encountered an issue with large code files that CodeBERT cannot handle effectively.
-
Handling Large Code Files: To overcome the limitations of CodeBERT for large code files, we devised our own algorithm to create tokenizers in a window-like manner. By specifying a window size and an overlap "region," we maintain essential context and generate embeddings by averaging the embeddings produced for each window.
-
Maintaining Context with Agglomerating Clustering: To ensure context preservation across the codebase, we use Agglomerating Clustering. This technique groups "similar" code files with shared semantic meanings and features, enhancing the quality of the generated documentation.
-
Efficient Documentation Generation: After clustering, we concatenate the code files belonging to the same cluster. The resulting concatenated code is then sent to GPT-3.5 using efficient prompt engineering techniques. The generated documentation provides comprehensive insights into the codebase.
Our prototype streamlines the documentation process by converting a full codebase into developer documentation in a single step. Leveraging GPT-3.5, we produce accurate and contextually relevant documentation, addressing the challenges of manual documentation processes. The resulting documentation enhances code comprehension, reduces technical debt, and improves code maintainability for software development teams.
List of technologies used to build the prototype:
- Frontend: Next.js
- Backend: FastAPI
To clone and run the prototype for testing and analysis, follow the instructions below:
-
Set up a Python virtual environment:
python -m venv venv
-
Activate the virtual environment:
- For Windows:
venv\Scripts\activate
- For macOS and Linux:
source venv/bin/activate
- For Windows:
-
Install dependencies:
pip install -r requirements.txt
-
To run the server, use the following command:
uvicorn main:app --reload
-
Install Node.js dependencies:
npm install
-
Run the development server for the frontend:
npm run dev
- Documentation of
ComicifyAI
:
- Input repo :
https://github.com/ayush4345/Comicify.ai
- Output Docs :
- Documentation of
Cluboard
:
- Input repo :
https://github.com/mittal-parth/Cluboard/
- Output Docs :
-
Handling Large Code Files: We faced a challenge with CodeBERT's inability to process large code files. To overcome this, we devised an algorithm to create tokenizers in a window-like manner, allowing us to maintain context by specifying a window size and overlap region. We then took the average of the embeddings produced to formulate our own embeddings for large files, addressing the issue of context preservation.
-
Aglomerate Clustering for Context Maintenance: To keep context across the codebase, we used Agglomerate Clustering. This method grouped "similar" code files that shared semantic meanings and features. Concatenating code files within the same cluster, we sent them to GPT-3.5 using efficient prompt engineering to generate comprehensive documentation.
-
Persistence and Perseverance: Despite facing difficulties with the clustering functionality, we persevered and continuously tried different approaches until we made it work. Our persistence paid off, and the successful implementation of clustering significantly improved the prototype's performance.