-
Notifications
You must be signed in to change notification settings - Fork 1
Big Three
Where from we started:
- Hypothesis to validate: the relationship between GDP growth and social and economical indicators.
- Python language.
- "Our World in Data" as database.
The objective of the code: Being able to take the data from the database files, perform statistical computations on them and get useful numerical and visual results to make conclusions.
Development: As we were not experts in Python, we started learning the fundamentals and basic functions through some videos: one recorded internally by Capgemini and the other ones mainly from Pluralsight courses, such as Python for Data Analysts, Pandas Fundamentals, and Finding Relationships in Data. At the same time, we began programming the initial functions to open the files and have an overview of the data. Through this inspection of data, we observed that:
- Data was not normalized.
- Depending on the indicator, its corresponding file was ordered in a different way.
Having in mind that these two observations implied a "problem" for us, we realized that the files (CSVs) available on the webpage Our World in Data were being extracted from other sources, mainly FAO. So, we decided to extract directly the data from there, getting the benefits of having data normalized and unified. We had 68 files that could be related to GDP growth. In each one, the information contained different categories such as country, year, units, value, and others (these last ones are considered not relevant for our study). Therefore, what we made was joining all the files, and organizing data based on the conditions of the same country and year. So, what we got was a table containing the value of each indicator in each year and region. It is important to note that indicators files were not all same size, some registered more ancient historical values or more in-depth data by region than others.
From last week's work, we drew from the data joining. Now what we are working on is the definition of the data structure, which is extremely important as it is going to be the basis of our statistical study. The original data from the database had a structure that we have been changing and reorganizing because we identified it did not follow uniformity, in terms of category naming, number, and content.
Despite the fact that we had clear our data structure, the amount of data to be integrated provoked us to change our strategy. During these days we have implemented our process spark methodology. After watching the tutorials, what we have done is charge the data through Spark languages. To continue, we have developed a loop that allows us to know which indicators don't follow the standard structure. This structure is formed by the columns: 'Element', 'Item', 'Area', 'Year', 'Unit' and 'Value'. On the other hand, indicators that don't contain these ones should be modified manually.
Then, we have read the data of each indicator from the main directory and created the structure defined previously. All the data is structured around the 'Area' and the 'Year' to integrate all the values in a similar way.
Finally, all the standard indicators should be joined in a CSV file and this one will be transformed into a parquet format. Moreover, we will do the corresponding partitions to apply the parallel work of Spark. During this week we have only done small tests to test the code, so, we will explain this process in detail in the next report.
Another issue we faced has been the oversaturation of data. We have so many indicators in so many countries with historical records altogether. So, we have decided that for a more accurate study, we should focus better. With that aim, we made a selection of the countries we were going to study and grouped them by geographical location and economic similarity. The groups confirmed are: EEUU+China (2) Europe (10): Germany, France, Sweden, UK, Spain, Croatia, Poland, Greece, Austria, Netherlands. Persian Gulf (7): Iraq, Qatar, UAE, Saudi Arabia, Azerbaijan, Yemen, Oman. North Africa (6): Algeria, Egypt, Libya, Israel, Turkey, Morocco. Africa (8): Senegal, South Africa, Liberia, Mozambique, Cameroon, Nigeria, Ghana. East Asia (7): Bangladesh, India, Vietnam, Thailand, Indonesia, Philippines, South Korea. LATAM (10): Mexico, Brazil, Argentina, Perú, Venezuela, Colombia, Chile, Panama, Costa Rica. In this way, there is a total of 50 countries to be studied according to the different indicators which will be defined in the next steps.
The last challenge we have faced is that the data downloaded from FAOSTAT didn't contain information about all interesting variables. At the beginning of the project, we defined a set of indicators to develop the initial hypotheses. These ones were: Literacy, Financing education, Migration, Mobility (Global and local as well as People and commerce), International trading, Fertility rate, Healthcare, Employment in agriculture, Renewable energy, Mortality, Outside investment, Land use, Contamination, Alcoholism, Tech adoption, Loneliness and social connections, Working hours, Trust, Optimism and pessimism, Gender inequality, Violence and peacekeeping, and Global education. To posse data of all the variables, we have decided to work with a new database from The World Bank page.
Once we have selected the final Database from The World Bank page, we have filtered the indicators and the variables to be measured. The final structure through which we are going to work is this one:
Then, with this information stored, we have normalized all the CSVs, including the principal one, the GDP of the 50 countries selected. The process of Normalization consists of studying if the data follow a normal distribution or if they have outliers that should be omitted or studied in a specific way. So, the five steps which have been followed are:
1-Rename the indicators as an easy way. For example, the indicator -¡Total alcohol consumption per capita (liters of pure alcohol, projected estimates, 15+ years of age) has been renamed Alcoholism.
2-Divide the whole Database into groups according to the Country Code.
3-Substitute the null values by the mean of each indicator to each country.
4-Detect outliers and remove them, so in the following step, we can affirm that our data follows a normal distribution.
And at the end of the week, we started to develop a script to check if our data is normalized and the correlation of each country with every variable.
During this week we have focused on the first Demo, so we had to present the work done during these three weeks to our client. In this point the previous steps, Data Extraction, Data Integration and Data Normalization have been finalized successfully. In the Data Normalization process we have introduced some new steps. Fist one related with the escalation of Data to better visualize it in the future. Second one consists of replacing the null values by the point consecutive of the previous and the later ones though a linear interpolation and backward filling. Thanks that this first sprint has been finalized we have could create the final Data Flow representation through Draw.io program:
In this Data Flow we have implemented the Multi-Hop Architecture to structure the Data. We have learned about this model from this link: https://databricks.com/blog/2019/08/14/productionizing-machine-learning-with-delta-lake.html.
Then, we have programmed the first code related with the next step; the correlation study. So, what we have done consists of a drawing a serie of graphs that allow us to see the relationship between variables and each variable and the GDP of each country. The graphs used are scatterplot and histograms, as well as, a correlation matrix. The results let see a first approximation of results to guide the next steps.
Finally, for next steps the continuation of the correlation code has been created. This computation show if each indicator follows a normal distribution. In case of it doesn't follow a normal distribution, will be removed. However to these ones which follow it will be computed the correlation taking into account lineal, quadratic and logarithm distributions.
After the first demo we prioritized our work on improve a set of things according to our client requests. One of the improvements consisted of develop an initial matrix about our hypotheses between GDP of each country and indicators relationship:
In addition, we add some assumptions as well as libraries used explanation in our notebook.
After that we have continued with the correlation study code. From this loop we obtain which variables have a high correlation (>0.8) with GDP of each country and what type of correlation are. We are going to work with 4 types: linear, quadratic, cubic and logarithm.
Then, based in these computations we have been able to compare actual results with initial hypotheses.
The current and next steps consist of reporting a wide study of correlation. The structure we're going to follow is:
-To check the hypotheses previously defined.
-Type of correlation.
-What this correlation implies?
-Put in context each relationship according to country culture and situation.
As we commented the last week, our main goal during this week has been to extract conclusions basing on numerical results. In this way we have matched the quantitative results with the qualitative ones. Moreover, we have programmed a set of graphs to show the results in a more visual way.
Then, as we had our second demo on Friday, we prepared all for that. The first step was to update our Data Flow with the last processes implemented:
The detailed explanation of the conclusions has been reported in a text document, where we have explained in detail the relevant correlation results giving importance to the economic and social context. The sources of information are also provided in each case.
As it was so much information to analyze and research, we have distributed the different blocks (of regions) between the three of us. We predefined a way of reporting the conclusions, so then unifying was a little bit easier. Besides, in terms of working on Github, we split some branches and later we did the corresponding merging and reorganization.
Besides, we also completed the Excel (the one we had with the initial hypothesis) and dumped there the real results, so we could compare them.
Apart from this, we have been working on the visualization of the results. Using widgets from Python has been our first stage with the creation of a "mapamundi" and other graphs. However, we contemplate the use of other tools for next sprint.
At this point, we prepared a presentation for the second demo, that is available here.
After doing our second demo we have noticed that we aren't matching with client expectations. Thus, during this week we have focused our work in redirecting the work done during last weeks, trying to tailor with the project scope.
The first thing we have done is to rework the integration process taking into account new items:
- Now, we haven't filtered the indicators list according to our thinks. There were some variables which show same results with different units, so what we have done consist of selecting the most reliable unit. For example, in economy topic's case, we have decided to work with those indicators expressed in $.
- Once this filter has been applied, we have created a new distribution for our indicators. All indicators have been classified with an Area:
-A&D (alcoholism and drugs)
-Agriculture
-Demography
-Economy
-Education
-Exports
-Environment
-Employment
-Equality
-Exports
-Health
-Imports
-Industry
-Internet
-Logistic
-Migration
-Military
-Mortality
-Principal
-R&D (research and development)
-Social
-Taxes
-Time efficiency
-Weather
Ans with a Category: -Primary: the most representative ones.
-Secondary: complementary to the primaries.
This classification will allow us to better extract conclusions about relationships.
We have changed some normalization steps too:
-
Removing indicators of each country which have over 308 missing values (20%), because the absence of data creates an unreliable source. We have reduced our minimum number in this filter.
-
As we had done, then we apply the IQR theory to delete outliers.
-
The data which is missed has been replaced by the result obtained of doing interpolation, backward filling and forward filling.
-
Another new step consist of scaling values based on the first value (year 0) of each indicator. In this way, we are obtaining a value which shows the amount increased of a value based on the initial year, that is, how a year is increasing respect to the previous one. The value obtained through this way is a percentage.
After this new normalization process, we have worked on correlation study step. First of all, we have documented ourselves about new methods of correlation computation. We have learned about Spearman Correlation, which is very similar to Pearson (a tool we had used):https://support.minitab.com/en-us/minitab-express/1/help-and-how-to/modeling-statistics/regression/supporting-topics/basics/a-comparison-of-the-pearson-and-spearman-correlation-methods/ . In this way we are trying to find a pattern that allows us to firmly affirm that there exists a correlation. If we obtain same results, which mean, both values greater than 0.75, same sign: positive or negative and same type of correlation: linear, cubic, quadratic or logarithmic; from different methodologies, we could start developing hypotheses and validate the correlation.
Then, with these indicators will be computed the rolling correlation. This tool allows us to study the tendency of high correlation values, to extract clearer conclusions. https://www.geeksforgeeks.org/how-to-calculate-rolling-correlation-in-python/
On the other hand, another type of correlation will be extracted parallel. This one consists of studying the relationship between variable values of a time frame with GDP values of a different time frame. We consider that it in an important fact to study because some indicators could have a direct influence in future years and not at the same time that they occur.
During this week we have worked in changes explained just on week before. Firstly, all correlation computations have been done successfully. So, we can extract conclusions of which are the most important indicators, the correlation type, how many times they appear (in which countries of total country list) and which is the adequately time period of this correlation (it can be a direct correlation or being displaced on time). To study the displacements in time we have work with a time range based on Fibonacci Serie, which computes values between 1,2,3,5,8,13,21 years.
It assumed that the correlation in the primary indicators can be caused by randomness, however if this correlation also appears in the secondary indicators for at least 80% of the countries that appears in the primaries,(Pareto's rule), we can suppose that there is no randomness affecting each group. Furthermore, the first assumption has to happen in 80% of the secondary indicators to avoid any fortuity.
After that, we have focused our work on create widgets to show results. As we had a presentation on Friday we only have drawn a small set to show some results.
Finally, we have updated our Data Flow implementing all new changes:
During the last week we have worked on final details to show our client the best version of our work. The actions taken have been:
- Update the final Data Flow version.
-
To write the complete markdown code, including graphs explanation of 6 examples. So, we have selected a pair of examples each one to describe which is the way of interpret results.
-
To update the GitHub dividing the whole notebook into steps ones.