Getting data in

Associated directory: 02-getting-data-in

Getting the source data

We ended previous section with recorded balances and deduced total expenses for a couple of months. Lets try to obtain a fuller picture by importing a transaction statement from bank.

You would want to get CSV statements for your account covering a reasonable period of time. You can start small: get a statement for the last month or a quarter, or a calendar/fiscal year-to-date. You can always import older statements later (this would be covered in later chapters).

For now, let's say that it is a single statement covering year-to-date. Let's save that single statement in ./import/<institution name>/in. For example, my current account was once with Lloyds and so their statements will go into ./import/lloyds/in/{filename}.csv.

I will keep using Lloyds as an example, and we will start with a simple CSV statement:

$ cat import/lloyds/in/99966633_20171223_1844.csv | head -n10

Transaction Date,Transaction Type,Sort Code,Account Number,Transaction Description,Debit Amount,Credit Amount,Balance,
25/05/2017,BGC,'12-34-56,99966633,EMPLOYER INC,,903.52,4058.83
15/05/2017,BP,'12-34-56,99966633,OASIS COFFEE ,2.76,,3155.31
05/05/2017,DEB,'12-34-56,99966633,WAITROSE,64.41,,3158.07
01/05/2017,BP,'12-34-56,99966633,AVIVA,100,,3222.48
25/04/2017,BGC,'12-34-56,99966633,EMPLOYER INC,,800.72,3322.48
18/04/2017,BP,'12-34-56,99966633,OASIS COFFEE ,2.76,,2521.76
07/04/2017,BP,'12-34-56,99966633,OASIS COFFEE ,2.76,,2524.52
07/04/2017,DEB,'12-34-56,99966633,WAITROSE,92.24,,2527.28
01/04/2017,,'12-34-56,99966633,INTEREST (NET) ,,1.21,2619.52

Frequency of downloads and statement periods largely does not matter

If you are lucky, your bank would allow you to export an arbitrary range of transactions, and you would be able to pick the convention that suits you best. For example, you might want to export at most one file per year or per quarter or per month.

However, financial institutions could be rather arbitrary in what they provide. You might be getting monthly statements, but generated on 15th of the month (so the inconveniently cross year boundary). Or maybe the maximum amount of rows in the statement will be limited by 100 or 1000 (I saw this in real life).

Or you might want to just download transactions from the last date you previously grabbed files from this financial institution, however many days that might be.

This should not matter much. Save as many or as few files as you want, combine them or split them. Multiple files do not mean additional processing overhead.

Conversion scripts

Quite often these CSV files would not be directly suitable for hledger's CSV import facility. You might want or need to do some data cleaning or transformation. Let's say that in our example I want to remove extra spaces from the description field and replace numeric account names with human-readable ones -- even though it is not necessary to import my sample file, I want to demonstrate how clean-up script could be written and ran. Lets place the clean-up script in ./import/lloyds/in2csv:

$ cat import/lloyds/in2csv

#!/bin/bash
sed -e 's/  +/ /g; s/,99966633,/,assets:Lloyds:current,/' < "$1"

Then you need to write conversion rules for hledger's CSV import command (place them in ./import/lloyds/lloyds.rules) and a script that can use these rules to print out converted journal file. Place it in ./import/lloyds/csv2journal. Usually, it is as simple as this:

$ cat import/lloyds/csv2journal

#!/bin/bash
hledger print --rules-file lloyds.rules -f "$1"

We would want to use those scripts to convert files in ./in into files in ./csv, and then convert those into journal files:

import
└── lloyds
    ├── in
    │   └── 99966633_20171223_1844.csv    ----.
    ├── csv                                   in2csv
    │   └── 99966633_20171223_1844.csv    <---'       ----.
    └── journal                                         csv2journal
        └── 99966633_20171223_1844.journal <--------------'

Automating the conversion

Build rules (from ./export/export.hs) will automatically invoke conversion scripts named in2csv and csv2journal.

The build rules are demand-driven instead of supply-driven. What I mean by that is that they would not try to discover and convert all the in/*.csv files (as "supply-driven" build system system could do).

Instead, when you include a file from import/lloyds/journal/ directory (like import/lloyds/journal/<whatever>.journal), the build rules would try and produce import/lloyds/csv/<whatever>.csv, which will in turn demand the existence of import/lloyds/in/<whatever>.csv, and would call in2csv and csv2journal scrips as necessary.

Let's "enroll" our lloyds directory into this automatic conversion mechanism by editing export/export.hs and populating the lists under these two comment lines:

   -- Enumerate directories with auto-generated cleaned csv files
   [ "//import/lloyds/csv/*.csv" ] |%> in2csv
 
   -- Enumerate directories with auto-generated journals
   [ "//import/lloyds/journal/*.journal" ] |%> csv2journal

Shake operate with absolute pathnames, so the path to import in the example above starts with // to indicate that the rule csv2journal will fire whenever you include a .journal file with import/lloyds/journal in its path name, but it could be anywhere on your disk.

Your conversion scripts will run from the directory where they reside, so you could refer to other files (like lloyds.rules) by their relative names.

Consistent commodity formatting

Normally hledger tries to deduce the format for your amounts (£1234.56 or £1,234.56 or £1234,56 or even £1_234.56) from the amounts encountered in your journals. However, it is a good practice to specify the format that you want via commodity directive.

Those directives could go into a single file (commodities.journal), which will be included into all your yearly files:

$ cat commodities.journal

commodity £1000.00

Tying it all in

In the previous section we added a bunch of manual end-of-month entries to the 2017.journal. Lets remove all of them, leaving just the opening balance.

Now everything is in place to convert the downloaded file and to .journal. Instead of copying the contents of that file into your journal for the current year, lets just include it, so now your 2017.journal will look like this:

$ cat 2017.journal

;; This journal is for the current year
include commodities.journal

;; Opening balances
2017/01/01 opening balances
  assets:Lloyds:current    = £100.00
  equity:opening balances
  
include ./import/lloyds/journal/99966633_20171223_1844.journal

As explained earlier, now that we "demand" for this journal file to exist, it will cause build rules to produce it from the corresponding source files. Crucially, build rules would not automatically convert all .csv files in import to .journal.

Now you can run ./export.sh. This will automatically run in2csv and csv2journal scripts with the right set of arguments and regenerate all your reports. If you are keeping reports under version control you should be able to diff them against the previous version and see exactly what has changed where at a glance.

You will notice that for now import rules put all expenses in the single expenses:unknown account (which means that they are not categorized in any way, and we are not technically getting a better visibility into our expenses). That's fine, let's assume that you do not have time to sort them out just now, we will do this later.

Recap

I've included a sample data file and conversion scripts in 02-getting-data-in:

import
└── lloyds
    ├── in
    │   └── 99966633_20171223_1844.csv       - sample of the original downloaded file
    ├── csv
    │   └── 99966633_20171223_1844.csv       - cleaned up file, ready to be consumed by
    │                                          'hledger print'
    ├── journal
    │   └── 99966633_20171223_1844.journal   - generated journal
    ├── in2csv                               - conversion script to produce ./csv/*
    ├── csv2journal                          - conversion script to produce ./journal/*
    └── lloyds.rules                         - CSV conversion rules

The full list of changes done from the previous step could be seen in diffs/01-to-02.diff. There is a single bit of change there not covered so far: CSV conversion rules in lloyds.rules are added as a dependency for autogenerated Lloyds journals so that changes in the rules files will cause all journals to be regenerated:

extraDeps file
  | "//lloyds//*.journal" ?== file   = ["lloyds.rules"]
  | otherwise = []

Build rules automatically add in2csv and csv2journal as dependencies to the files they produce, so any changes in those scripts will cause all relevant outputs to be regenerated.

Where we are now

At this point, our little project looks like this:

$ tree

.
├── 2017.journal
├── all.journal
├── commodities.journal
├── export
│   ├── 2017-all.journal
│   ├── 2017-balance-sheet.txt
│   ├── 2017-cash-flow.txt
│   ├── 2017-income-expenses.txt
│   ├── export
│   ├── export.hi
│   ├── export.hs
│   └── export.o
├── export.sh
├── import
│   └── lloyds
│       ├── csv
│       │   └── 99966633_20171223_1844.csv
│       ├── csv2journal
│       ├── in
│       │   └── 99966633_20171223_1844.csv
│       ├── in2csv
│       ├── journal
│       │   └── 99966633_20171223_1844.journal
│       └── lloyds.rules
└── README.md

6 directories, 19 files

Next steps

Now that you have converted a single statement, let's get the full history of a single account, save and convert all of them.

Home

Provide feedback

Saved searches

Use saved searches to filter your results more quickly