PythonSourceCodeAnalysis

Python Source Code Analysis is a program designed to extract syntactic information of Python programs.

This information is stored in a relational database and can be used to analyze the information with data mining algorithms.

Purpose

This program convert graph-like information obtained with the module ast from the Python Standard Library (PSL) into n-dimensional vectors sotres in a relational database. This process creates a dataset with 16 homogeneous tables. This convertion allow classic data mining algorithms work with the dataset. This data mining algorithms can obtain information such as: most and least used syntactic elements, outlier syntactic patterns and association rules.

In addition to the syntactic information, the program allow the Python files used as argument to be flagged as Expert or Beginner. With this expertice level information linked to data mining results we can clasify new programs into Expert or Beginner programs attending to the presence or not of the different syntactic patterns identified as Expert patterns or Beginner patterns.

This type of information is high value to improve Python programming. We can use it to improve how Python is taught or to improve the tools offered by the different IDEs.

Dataset generation

The dataset used to the outliers analysis contains more than 13 million database entities. This 13 million entities comes from:

Student's projects from the subject Introduction to the Programming of the first year of the University of Oviedo degree in Software Engineering
Expert's projects obtained with the GitHub API. The list of repositories used to obtain information are listed in the file github_repos.

Notebooks

The outliers analysis of the previus mentioned dataset is collected in the notebooks directory. Each notebook collect the information of the syntactic construction of the name. For each syntactic construction there are two aditional files, one for the beginners and one for the experts. In each notebook there is a complete analysis of each attribute of the table. The information is displayed with graphs.

Example

As an example, we will supose that there is a directory named "python_projects". Inside this directory must be a structure a subdirectories with Python files.

The program can recieve up to 3 arguments, with of them optional:

Directory of the Python programs u want to process
Expertice level of the programs u want to process (Optional, "BEGINNER" as default)
Directory and name of any python subdirectory u want to process as a unique program, ignoring program's default detection (Optional, no value by default)

In this first call, we are processing the ./python_projects directory, flagged as EXPERT programs and following the default program detection system.

In this call, we are processing the subdirectorie ./python_projects/program_1, flagged as EXPERT program and ignoring the default program detection system.

Name		Name	Last commit message	Last commit date
Latest commit History 380 Commits
dataset		dataset
notebooks		notebooks
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
github_repos		github_repos

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PythonSourceCodeAnalysis

Purpose

Dataset generation

Notebooks

Example

About

Releases

Packages

Contributors 2

Languages

License

ComputationalReflection/PythonSourceCodeAnalysis

Folders and files

Latest commit

History

Repository files navigation

PythonSourceCodeAnalysis

Purpose

Dataset generation

Notebooks

Example

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages