This course is part of the Scientific Data Analytics and Modelling Programme. The aim of the course is that students gain practical skills to access large databases/datasets, to handle data stored in different formats, to explore/distill these data and present/visualize the gathered information. During the course students will come across databases of multiple disciples. Completing of the several projects allows students to gain experience on this field that will be a firm foundation for later courses on theoretical datamining and advanced computing laboratories. In this course we intend to introduce state of the art tools and methods for data exploration and visualization. This field evolves rapidly and a year later one might not use the same tool for the same task, but certain notions, languages and packages become the standard for a longer time.
There is a useful tutorial into python, which gives a wide background knowledge, that comes handy.
During the course every sample code will be shown in jupyter notebooks, which can be accessed on the Kooplex Edu platform.
Each occasion starts with an hour of introduction into the current topic with examples. After that everyone can work on the worksheets and the lecturers will be available to help with the any related problems and questions.
Neptun code: dsexplorf17vm
Instructor: Dávid Visontai
Semester: spring
Type: Lecture + Practice
Credit points: 4
Prerequisites: programming in either python, R or matlab
The course is held in the North Building in computer lab 5.56 on Wednesdays 14:00 - 16:30.
1, 12.02.2020. Introduction to Kooplex Edu, Jupyter Notebooks and USGS water discharge statistics
2, 19.02.2020. Maps, shapes, coordinates and Following John Snow
3, 26.02.2020. SQL queries on an NBA database
4, 04.03.2020. Interactive Visualization
5, 11.03.2020. Hierarchical dataformats and standards, storing data
6, 18.03.2020. ****
7, 25.03.2020. REST services
8, 01.04.2020. Network exploration - This lecture will be given by Dániel Ábel, who is a developer at Maven7.
9, 15.04.2020. Data extraction from images
10, 22.04.2020. Natural Language Processing on tweets - This lecture will be given by Eszter Bokányi, whose field of interest is how social phenomena can be captured by using various digital fingerprints of individuals.
11, 29.04.2020. 3D Visualization
12, 05.05.2020. nCoV-2019
13, 12.05.2020. Consultation
- Datatypes, images, timeseries, tables, graphs, textual data
- Standards of file- and dataformats (csv, hdf5, netcdf)
- Raw and processed data, metadata, cleansing of data
- Access data locally and through the web (APIs, HTTP protocol)
- Access of scientific databases, Usage of relational databases (SQL)
- Transforming data, sorting, combining pandas
- Basics of timeseries analysis
- Handling datasets with geolocation (shapely, folium, geoviews)
- Basics of image processing (opencv)
- Dimension reduction, clustering
- Processing textual data, logs (natural language processing)
- Infographics, visualisation (html, css, javascript libraries)
- Interactive dataexplorative tools (ipywidgets, bokeh, holoviews)
- Developing open source softwares, reproducible research (OSF)
- USGS water discharge statistics - HTML
- Following John Snow - HTML
- SQL queries on an NBA database- HTML
- Interactive Visualization - HTML or Hosted App
- REST API - REST service/API
- Network exploration - HTML
- Data extraction from images - HTML
- Natural Language Processing on tweets - HTML
HTML: e.g. converted from a jupyter notebook
Hosted App: an application that is hosted by a server (plotly, bokeh etc.
REST service/API: instructions are on the worksheet is where the notebooks will be handed out. It is available for all students with a valid Neptun or caesar account. Once you run your notebook server you will find a folder with the course material. The notebooks will be available in this Github repository as well. We will explain how to use this portal on the first lecture. The Kooplex Edu platform is accessible externally as well. In case there is any problem with the portal you can run a notebook server locally on any other computer and upload your work later.
There will be an assignment for each of the 9 topics, which need to be completed individually. The deadlines for the submissions are shown next to the topic and all related information will be in the topics' folder. These are not strict deadlines, however we advise students to keep them in order to be able to complete all tasks.
The minimum requirement for this course is to submit all assignments with at least one completed task. In all worksheet or assignment the first couple of tasks follow the excercises explained in the given tutorial files.
The result should look like a report, which will be generated from the worksheets as it will be explained on the lecture. All figures should have labels, title, each exercise should end with a descriptive conclusion and the explanatory comments should be inserted into the code. These are must have features of a work that is intended for presentation.
Each assignment will be corrected after submission and a maximum of 20 points will be given for it. 10 for all the completed tasks, 10 for the quality of the submitted report (look, clarity and comments).
The final grades will be given according to the following pointsystem:
0 - 60: failed
61 - 79: 2
80 - 101: 3
102 - 124: 4
125 - : 5
In the /home/course/Datasets directory you will find datasets, that you can work with.
- Python tutorial: (translated from the BSc course "numerical methods in physics I" by Eszter Bokányi, work in progress )
- SQL tutorial:
- RESTful service:
- Networkx:
- Wes McKinney: Python for Data Analysis, (O’Reilly 2013)
- Joel Grus: Data Science from Scratch (O’Reilly 2015)