Skip to content

ComputationalReflection/PythonSourceCodeAnalysis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PythonSourceCodeAnalysis

Python Source Code Analysis is a program designed to extract syntactic information of Python programs.

This information is stored in a relational database and can be used to analyze the information with data mining algorithms.

Purpose

This program convert graph-like information obtained with the module ast from the Python Standard Library (PSL) into n-dimensional vectors sotres in a relational database. This process creates a dataset with 16 homogeneous tables. This convertion allow classic data mining algorithms work with the dataset. This data mining algorithms can obtain information such as: most and least used syntactic elements, outlier syntactic patterns and association rules.

In addition to the syntactic information, the program allow the Python files used as argument to be flagged as Expert or Beginner. With this expertice level information linked to data mining results we can clasify new programs into Expert or Beginner programs attending to the presence or not of the different syntactic patterns identified as Expert patterns or Beginner patterns.

This type of information is high value to improve Python programming. We can use it to improve how Python is taught or to improve the tools offered by the different IDEs.

Dataset generation

The dataset used to the outliers analysis contains more than 13 million database entities. This 13 million entities comes from:

  • Student's projects from the subject Introduction to the Programming of the first year of the University of Oviedo degree in Software Engineering
  • Expert's projects obtained with the GitHub API. The list of repositories used to obtain information are listed in the file github_repos.

Notebooks

The outliers analysis of the previus mentioned dataset is collected in the notebooks directory. Each notebook collect the information of the syntactic construction of the name. For each syntactic construction there are two aditional files, one for the beginners and one for the experts. In each notebook there is a complete analysis of each attribute of the table. The information is displayed with graphs.

Example

As an example, we will supose that there is a directory named "python_projects". Inside this directory must be a structure a subdirectories with Python files.

The program can recieve up to 3 arguments, with of them optional:

  • Directory of the Python programs u want to process
  • Expertice level of the programs u want to process (Optional, "BEGINNER" as default)
  • Directory and name of any python subdirectory u want to process as a unique program, ignoring program's default detection (Optional, no value by default)

image

In this first call, we are processing the ./python_projects directory, flagged as EXPERT programs and following the default program detection system.

image

In this call, we are processing the subdirectorie ./python_projects/program_1, flagged as EXPERT program and ignoring the default program detection system.

About

Python Source Code Analysis

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published