Reproducible science applied to artificial intelligence: a case story

Agradecimientos

Padres, Tutor, Institución y Conacyt.

Abstract

Investment strategies in the financial industry are a set of rules and procedures designed to guide investors in the selection of an investment portfolio. these strategies are categorized in passive and active strategies. Passive management strategies are those in which the investor buys and holds a position expecting to profit in the long run by taking advantage of the natural market’s growth, meanwhile in active management strategies because the investor thinks he could beat the market’s growth he continuously open and closes positions expecting to profit in the short term. However there’s a huge challenge in active management, strategies with a rigid set of rules will stop working at some point, the market’s conditions that exist at any given time and that allows the investor to profit soon begin to disappear; the market is constantly changing. Therefore it is necessary create investment strategies that can adapt to the changing nature of markets and learn to correct its parameters so that the strategy stays profitable in the long run. In this study we explore the usage of artificial neural networks (ANN) as a basis for the development of investment strategies that will allow them to adapt to changing conditions in financial markets. We will focus our research to verify the results obtained by Kara et al. 2011 in the article “Predicting direction of stock price index movement using artificial neural networks and support vector machines” using an open science approach and to experiment with the application of ANN in stock price indices over american markets like the Standard & Poor’s 500 (S&P500) and the Índice de Precios y Cotizaciones (IPC).

Resumen

Las estrategias de inversión en el ámbito financiero son un conjunto de reglas, comportamientos y procedimientos diseñados para orientar a un inversor en la selección de una cartera de inversión, dichas estrategias se dividen en pasivas y activas, siendo pasivas aquellas en las que se compra un instrumento y se mantiene la posición esperando generar una rentabilidad a largo plazo, mientras que en la gestión activa se busca abrir y cerrar posiciones con frecuencia ya que el inversor se siente con mejores habilidades de gestión que el promedio de participantes en el mercado. Sin embargo existe un gran reto en la gestión activa y es que las estrategias rígidas de gestión no funcionan por mucho tiempo, las condiciones de mercado que existen en un tiempo determinado y que permite generar una ganancia, pronto se disuelven y desaparecen; el mercado es cambiante. Es necesario entonces elaborar estrategias de inversión que se puedan adaptar a cambios del mercado y aprendamos a corregir sus parámetros para que ésta se mantenga rentable al paso del tiempo. En este estudio exploramos la utilización de redes neuronales artificiales (ANN) como base para la elaboración de estrategias de inversión que les permitan re-adaptarse a las condiciones cambiantes del mercado. Enfocaremos el estudio a manera de comprobar los resultados obtenidos por Kara et al. 2011 en el artículo “Predicting direction of stock price index movement using artificial neural networks and support vector machines” apoyando el movimiento open science y a experimentar la aplicación de ANN sobre índices bursátiles de mercados americanos como el Standard & Poor’s 500 (S&P500) y el Índice de Precios y Cotizaciones (IPC).

Introduction

Explica al lector qué va a leer, por qué debe leerlo, cuáles son los objetivos y cómo está organizado el texto. Una buena introducción, predispone de buena manera al lector.

Cómo se llega al tema.

De que se trata, es decir, qué se pretende hacer, en términos genéricos.

Cuáles son los objetivos, tanto el general como los particulares. (¡Muy claros y precisos!).

Cómo se ha organizado, en definitiva el texto, en pocas palabras, que encontrará el lector en cada uno de los captítulos.

Performance of financial markets are often used as leading indicators in the economy, when stock market indices reveal weakness in upward movements there is a perception of fear among market participants and reflect the views in the overall economy of a country, as markets retrace we see a slowdown in the development of the economy and possibly the beginnings of a recession, on the contrary when participants in the market see confidence and strength in upward market movements the reflection in the overall economy is positive as business are growing.

Importancia de desarrollo y evaluaciíon de estrategias para predecir mercados

General purpose

Se indicará el objetivo general y los objetivos particulares. El objetivo general debe describir en forma completa lo que se desea lograr con el trabajo.

OG1

Provide evidence for the need to promote Open Science practices in Artificial Intelligence research.

Old OG1

Explore application of artificial intelligence techniques for investment strategies in emerging financial markets.

Concrete purposes

Los objetivos particulares deben ser lo suficiente claros y específicos (deben ser cuantitativos cuando sea necesario) para que sean evaluables al término de la tesis. En esta parte se muestra que el tesista sabe con claridad lo que va a realizar.

OE1

Identify tools that facilitate management of artificial intelligence research.

OE2

Explore current usage of open science tools in Software Engineering and Artificial Intelligence.

Justification

Esta parte incluye las razones académicas y prácticas que justifican la realización del proyecto.

Se pueden indicar las necesidades que se satisfacen, los beneficios que se obtendrán y el impacto socio-económico del trabajo (quién será beneficiado si se soluciona el problema), según sea el caso.

Stock and futures trading literature emphasizes the importance of capital management, that is, how much of the available capital is to be allocated in a specific market position. Regardless of the probability of success of a single position in the market the management strategy used could result in higher profits over a series of trades.

However there’s a need to provide a procedure for an unbiased comparative over multiple position sizing strategies based in quantitative methods, more over the need to resume the pros and cons over the most known position sizing strategies is needed.

It is not enough to know what is the optimal position sizing strategy to use, as a complete investment strategy would also consist of a reliable method to detect and exploit market inefficiencies. Currently in the academic literature, procedures for generating machine learning models that are able to predict market trending direction exists. But over the various described methods we need to find which one would be the best to use in addition to the optimal position sizing strategy so that this two components maximize profit.

In computer science, often research papers offer analysis or improvement over existing algorithms, it is natural that science exploration in this area is reproducible since the improvement over the algorithm is discussed in the article to a great level of detail.

However, building intuition for learning algorithms used in artificial intelligence is only easy when applied to a low number of features. Complexity of the computations increase as features used in the model increase (e.g. using ANN with a high number of layers, RNNs or other sophisticated algorithms like SVM in higher-dimensional spaces).

It is necessary to guarantee reproducibility of research since model performance varies depending on the data and parameters used it could be the case that in the reproduction of the study, the final outcome is a different result altogether. Open research reproducible tools can help to overcome this problems.

complete the justification

Background

Aquí se indican cuáles son los antecedentes teóricos y prácticos que se relacionan con el trabajo a desarrollar, para darle mayor veracidad científica.

Machine Learning

Machine Learning is a field of computer science and a subset of artificial intelligence that gives computers the ability to learn without being explicitly programmed cite:Koza1996, it is a discipline that explores the study and construction of algorithms that can learn from and make predictions on data cite:Kohavi1998. Machine learning tasks are typically classified into two broad categories, supervised learning where there is a feedback available to the learning system and unsupervised learning where there is not.

In the current section we discuss some of the algorithms applied in supervised learning some of which were used in this study for the forecasting of performance in financial markets.

Agregar algo sober reinforcement learning?

Artificial Neural Networks

Artificial neural networks (ANNs) are biologically inspired computer programs designed to simulate the way in which the human brain processes information. ANNs gather their knowledge by detecting the patterns and relationships in data and learn (or are trained) through experience, not from programming. An ANN is formed from hundreds of single units, artificial neurons or processing elements, connected with coefficients (weights), which constitute the neural structure and are organised in layers. The power of neural computations comes from connecting neurons in a network. Each processing element has weighted inputs, transfer function and one output. The behavior of a neural network is determined by the transfer functions of its neurons, by the learning rule, and by the architecture itself. The weights are the adjustable parameters and, in that sense, a neural network is a parameterized system. The weighted sum of the inputs constitutes the activation of the neuron. The activation signal is passed through transfer function to produce a single output of the neuron. Transfer function introduces non-linearity to the network. During training, the inter-unit connections are optimized until the error in predictions is minimized and the network reaches the specified level of accuracy cite:Agatonovic-Kustrin2000.

SVM

Support vector machines (SVMs) are particular linear classifiers which are based on the margin maximization principle. They perform structural risk minimization, which improves the complexity of the classifier with the aim of achieving excellent generalization performance. The SVM accomplishes the classification task by constructing, in a higher dimensional space, the hyperplane that optimally separates the data into two categories cite:Adankon2009.

Naive Bayes classifier

Otros? Logistic Regression, Decision Tree, Random Forests, KNN?

Evaluating Artificial Intelligence Algorithms

Precision, Recall, Accuracy

Backtesting

Bagging

Evaluating Investment Strategies

Walk-Forward analysis

First, some simple definitions regarding the walk‐forward analysis are in order: In period.This is the chunk of historical data that will be optimized. Out period. This is the chunk of historical data that will be evaluated using opti- 117 mized results from the adjacent in period. Fitness factor. This is the criterion used to determine the “best” result, allowing us to select the optimized parameters. Anchored/Unanchored test.This tells us whether or not the in period start date shifts with time, or if the start date is always the same.

Monte Carlo simulation and projections (P&L, Drawdown, Ruin risk)

Statistic proves (hypothesis testing, t-student)

Open Science

Open Science is the movement to make scientific research, data and their dissemination available to any member of an inquiring society, from professionals to citizens with the ultimate aim of making it easier to publish and communicate scientific knowledge. As stated by Pontika, Knoth, Cancellieri & Pearce (2015) it allows the reproduction of research findings, enables transparency in the research methodology, increases the researcher’s societal impact and saves money and time for both researchers and at research institutions. As of 2015, practices and techniques to be used in a Open Science research weren’t wide spread; an effort by FOSTER (Facilitate Open Science Training for European Research) an European Commission funded project, started by developing an e-learning portal to support the training of a wide range of stakeholders in Open Science and related areas. An open science taxonomy was defined which included nine areas, these are: Open Access, Open Data, Open Reproducible Research, Open Science Definition, Open Science Evaluation, Open Science Guidelines, Open Science Policies, Open Science Projects and Open Science Tools as shown below in figure fig:ost.

Note that each of these terms was further divided into sub-topics to better describe and classify the area. In FOSTER (2015) general Open Science practices for researchers were described, these are as follows: Share protocols openly online and store data in the most open format possible. Use easily attainable software to facilitate reproduction of results. Publish preprints and be positive about open peer review. Cite open access versions of the literature, open data and open code. Acknowledge contributor roles in publications. Translate research objects in as many languages as possible. Openly share research hypothesis and proposals, encouraging feedback.

Next we review the most relevant Open Science Areas for this work.

Open Data

Open Data is online, free of cost, accessible data that can be used, reused and distributed provided that the data source is attributed and shared alike cite:FOSTER2015. As research is more and more data-driven, progress in scientific knowledge becomes intimately tighten to data availability. Open Data policy enables researchers to make use of existing knowledge in innovative and complementary ways. Needless to say, Open Data is crucial to the reproducibility of scientific research. Pfenninger, DeCarolis, Hirth, Quoilin & Staffell, 2017 in “The importance of open data and software” state that given the critical guidance that data provide to decision makers, data should be made open and freely available to researchers as well as the general public. They provided four specific reasons for this:

Improved quality of science. More effective collaboration across the science-policy boundary. Increased productivity through collaborative burden sharing. Profound relevance to societal debates.

Collecting data, formulating models and writing code are resource-intensive. Research funding is limited and researcher time is a scarce resource. Society as a whole saves time and money if researchers avoid unnecessary duplication and learn from one another. Individual researchers gain more time to spend on pressing research questions rather than redundant work on model or dataset development cite:Pfenninger2017. Besides that, researchers are fallible human beings and errors are inevitable under pressure to deliver. Such mistakes can have profound implications. Finally, besides the practical considerations outlined above, there remains the ethical argument that research funded by public money should be available to the public in its entirety.

Open Reproducible research

Open Reproducible research is the act of practicing Open Science and the provision of offering to users free access to experimental elements for research reproduction. This allows for reproducibility testing, the process of validating that the reported research results can be obtained in an independent experiment. In this area, sharing laboratory research records, diaries, journals and workbooks is encouraged, this should be offered free of cost and with terms that allow reuse and redistribution of the recorded material. It is expected as well that open source software is provided with terms that allow dissemination and adaptation. As indicated in Kluyver et al., 2016, several papers have been published with supporting notebooks to reproduce the analysis, or the creation of key plots. The detection of gravitational waves by the LIGO experiment is one such: the researchers posted a notebook on their website illustrating in detail how to filter and process the data to reveal the signature of a distant black hole merger. Others quickly made this available through Binder (a tool for sharing live notebooks), allowing anyone to replicate the analysis even without downloading or installing anything. Other papers published in fields from geology to genetics to computer science have used notebooks as supporting material.

Jupyter Notebook

The Jupyter notebook is an open-source, browser-based tool functioning as a virtual lab notebook to support workflows, code, data, and visualizations detailing the research process. These notebooks can live in online repositories and provide connections to research objects such as datasets, code, methods documents, workflows, and publications that reside elsewhere cite:Randles2017. Notebooks are designed to support the workflow of scientific computing, from interactive exploration to publishing a detailed record of computation. The code in a notebook is organised into cells, chunks which can be individually modified and run. The output from each cell appears directly below it and is stored as part of the document. However, whereas the direct output in most shells can only be text, notebooks can include rich output such as plots, formatted mathematical equations, and even interactive controls and graphics cite:Kluyver2016.

Jupyter notebooks are a medium to make science more open. A study in which 91 publications in the Astrophysics Data System were analyzed, approximately 40% of the publications, linked to repositories with code, data and reproducibility of the research on jupyter notebooks cite:Randles2017. Notebooks also fit well into novel publishing paradigms, such as post publication review. Digital objects such as GitHub repositories, which may contain notebooks, and blog posts, which may be made from notebooks, can now be archived and given permanent DOI references, making it practical to cite them in other publications cite:Kluyver2016.

Apache Zeppelin

Org mode

Org mode is a tool available for the Emacs text editor that in the same vein as Jupyter Notebooks, allows interaction of text content with code.

Org is a mode for keeping notes, maintaining TODO lists, and project planning with a fast and effective plain-text system. It also is an authoring system with unique support for literate programming and reproducible research. cite:Dominik2018

Org is implemented on top of Outline mode, which makes it possible to keep the content of large files well structured. Visibility cycling and structure editing help to work with the tree. Tables are easily created with a built-in table editor. Plain text URL-like links connect to websites, emails, Usenet messages, BBDB entries, and any files related to the projects.

Org files can serve as a single source authoring system with export to many different formats such as HTML, LATEX, Open Document, and Markdown. New export backends can be derived from existing ones, or defined from scratch.

Org files can include source code blocks, which makes Org uniquely suited for authoring technical documents with code examples. Org source code blocks are fully functional; they can be evaluated in place and their results can be captured in the file. This makes it possible to create a single file reproducible research compendium. cite:Dominik2018

Research Methodology

En este capítulo se presenta el desarrollo del trabajo siguiendo rigurosamente la metodología antes indicada.

Esta parte incluye la descripción de cada una de las etapas del desarrollo de la tesis y la calendarización de las mismas mediante una gráfica de barras.

La asignación de tiempo a cada etapa debe tomar en cuenta la cantidad de trabajo a realizar y el tiempo efectivo disponible por el tesista.

Esta calendarización será utilizada para evaluar el avance del tesista.

Será necesario precisar las actividades a desarrollar en otras instituciones, si se diera el caso.

Our goal is to apply Monte Carlo simulation to trading systems, where the process being studied is the sequence of closed trades or daily equity.

For the comparison of position sizing strategies

Now we will look to the procedure defined for the comparative of position sizing strategies. As stated earlier, for an unbiased comparative we need to test several runs of each single strategy in which the only variable parameter is the capital compromised at each step in the execution of the strategy. Nor the initial capital or the probability of success/failure should be altered, that is, we are not taking into account the method by which trade decisions are made but instead the method by which once a decision is taken what could be the optimal capital to risk at each step in a sequence of trades.

The specification of a baseline is needed, from which the performance of all strategies will be measured, such baseline will consist of fixed parameters in the simulation of a stochastic process. The number of positions in a trading sequence, the probability of success of every trade in the sequence and the initial capital at the beginning of the sequence is the same for all the strategies.

Identified parameters are sumarized in ref:tbl:ptypes and described as follows:

# Trades: Fixed parameter indicating the number of trades in the execution of the strategy (length of the sequence).
P(W): Fixed parameter indicating the probability of success of each trade in the sequence.
P. Distribution: Fixed parameter indicating the probability distribution from which the trade outcome is drawn.
Initial capital: Fixed parameter indicating the starting capital (amount) available at the beginning of the sequence.
Position size: Variable parameter controlling the size of a single position in the trade sequence.

Parameter	Type
# Trades	Fixed
P(W)	Fixed
P. Distribution	Fixed
Initial capital	Fixed
Position size	Dynamic

Escribir sobre metodo montecarlo, CDFs, Drawdown?

Mover parametros a una subsección?

Strategies

There exist multiple variety in position sizing strategies, in this study we look in to the most common ones as stated in cite:Bandy2011. Others are just modified versions of the existent ones.

Fixed position sizing

def fixed_bettor(funds, bet_size, plays):
    current_funds = funds

    bets = np.array([], dtype=int)
    funds = np.array([], dtype=int)
    drawdown = np.array([], dtype=float)

    for single_bet in range(plays):
        if current_funds > 0:
            if play():
                current_funds += bet_size
            else:
                current_funds -= bet_size

        bets = np.append(bets, single_bet)  # Add bets to samples

        # Add funds to samples
        if current_funds < 0:
            current_funds = 0

        funds = np.append(funds, current_funds)

        # Add drawdown to samples
        if current_funds > 0:
            dd = (current_funds - np.amax(funds)) / np.amax(funds)
        else:
            dd = -1

        drawdown = np.append(drawdown, np.amin([0, dd]))

    plt.figure(0)
    plt.plot(bets, funds)

    plt.figure(1)
    plt.plot(bets, drawdown)

    return (current_funds, min(drawdown))

Fixed percent position sizing

Martingale position sizing

Progressive position sizing

Kelly formula

D’Alembert position sizing

Inverse D’Alembert

Reproduce Original Paper

Apply Original method in new markets

Optimize different parameters in the original model

Measure and Control in all the cases

Experiment Results

Results Discussion

Presentar un análisis de los datos obtenidos al aplicar el producto mediante el uso de algún método empírico, incluyendo premisas, condiciones de pruebas y pruebas de concepto

Conclusions

Las conclusiones deben resumir las aportaciones que se realizaron mediante la tesis.

Las conclusiones surgen de:

El nivel en que se alcanzaron los objetivos (y si no se lograron al 100%, se debe indicar el por qué).

Las observaciones particulares respecto de la metodología empleada.

Consideraciones respecto de la bibliografía disponible.

La propia percepción del mundo del autor.

Será necesario precisar las actividades a desarrollar en otras instituciones, si se diera el caso.

Se debe tener presente que las conclusiones siempre deben ser tales (respecto de lo que se hizo), y deben estar argumentadas, es decir, se deben sostener en el trabajo que se ha escrito

Going over research questions again

Lessons learned

References

bibliographystyle:apacite bibliography:main.bib

Anexos y Apéndices

Anexo

Secciones relativamente independientes que ayudan a una mejor comprensión de la tesis y que permiten conocer más a fondo aspectos específicos que por su longitud o su naturaleza no conviene tratar dentro del cuerpo de la tesis.

Los anexos de la tesis incluirá material de apoyo a los capítulos.

El material susceptible de incluir en los anexos es aquel que no es necesario leer para entender la tesis, pero que aporta evidencia documental estrictamente necesaria para demostrar la solidez del trabajo.

Apendice

Raw data, installation recipes, how to access db.

Es una sección, también ubicada generalmente al final del texto pero cuya función es importante para la comprensión del texto principal.

Files

main.org

Latest commit

History