Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some items populate with duplicate dates #196

Closed
cezary4 opened this issue Nov 1, 2013 · 13 comments
Closed

Some items populate with duplicate dates #196

cezary4 opened this issue Nov 1, 2013 · 13 comments
Labels

Comments

@cezary4
Copy link

cezary4 commented Nov 1, 2013

Hey guys, I noticed that sometimes the withdrawals table populates with two entries for the same item. This appears to be rare, but we should get to the bottom of why it happens. Here are the three dates for which it happened for the 'Federal Salaries' line item. I checked and Treasury did not reissue corrected statements that day, which could have been one possible explanation for the dupe. @bdewilde @abelsonlive any ideas why this would happen?

Here is the query to pick up the duplicate days:

select date, count(date) as count FROM t2
where item like '%Federal salaries%'
and transaction_type = 'withdrawal'
group by date
order by count(date) desc

And here are the duplicate dates picked out by query:

select * from t2 
where item like '%Federal salaries%'
and transaction_type = 'withdrawal'
and date IN ('2013-07-29','2013-08-30','2013-09-03')
@cezary4
Copy link
Author

cezary4 commented Nov 12, 2013

This is the main open issue I think that we still have to work on - anyone around at CSV tomorrow nite? I'll be there, happy to help figure it out.

@abelsonlive
Copy link
Contributor

Yeah, I can look into it tomorrow night with you.

@abelsonlive
Copy link
Contributor

@cezary4: I can't really duplicate this error. when I look at the suspicious tables after they've been parsed, they look like this:

2013-07-29:
img

2013-08-30:
img

This makes me think that the problem isn't the fixies but perhaps date parsing? Let's keep this open for now, but I'm not sure how to approach the solution right now.

@cezary4
Copy link
Author

cezary4 commented Dec 20, 2013

I just replicated it - run this query in the API:

http://api.treasury.io/cc7znvq/47d80ae900e04f2/sql/?q=SELECT "table",
"date", "year_month", "year", "month", "day", "weekday", "account",
"transaction_type", "parent_item", "is_total", "is_net", "item",
"item_raw", "today", "mtd", "fytd", "url" FROM t2 WHERE ("date" >
'2013-07-30' AND "date" < '2013-09-30') AND ("transaction_type" =
'withdrawal') AND ("item" = 'Federal Salaries ( EFT )')

8/30 appears twice when you output through API; so the underlying CSV
tables only have the date once but when we output through API it appears
twice?

Cezary

On Fri, Dec 20, 2013 at 1:47 PM, Brian Abelson [email protected]:

@cezary4 https://github.com/cezary4: I can't really duplicate this
error. when I look at the suspicious tables after they've been parsed, they
looks like this:

2013-07-29:
[image: img]https://github-camo.global.ssl.fastly.net/fa9ef5d09405567c54dfae9a440aa43cc71912c0/687474703a2f2f636c2e6c792f696d6167652f315832783237324b3159336b

2013-08-30:
[image: img]https://github-camo.global.ssl.fastly.net/e7b6a2119ac97dbcff40fd059edc85c5fa36206a/687474703a2f2f636c2e6c792f696d6167652f3270316c304d316731493371

This makes me think that the problem isn't the fixies but perhaps date
parsing? Let's keep this open for now, but I'm not sure how to approach the
solution right now.


Reply to this email directly or view it on GitHubhttps://github.com//issues/196#issuecomment-31032075
.

Cezary Podkul | @cezary
Mobile: (708) 228 1319
Web: cezarypodkul.com

@abelsonlive
Copy link
Contributor

Weird, the problem seems to be that the parser is pulling in multiple files, check the urls in these "duplicate" records:

{
date: "2013-08-30",
url: "https://www.fms.treas.gov/fmsweb/viewDTSFiles?fname=13083000.txt&dir=a"
},
{
date: "2013-08-30",
url: "https://www.fms.treas.gov/fmsweb/viewDTSFiles?fname=13083001.txt&dir=a"
},

One is from 13083000.txt and the other is from 13083001.txt. However, when you go to the first url, it doesn't exist.

@abelsonlive
Copy link
Contributor

OHHHH I GET IT. At one point our parser downloaded 13083000.txt, this file was then deleted by the FMS and replaced with 13083001.txt. However, 13083000.txt was not deleted from our data/fixie directory. This means that BOTH files are being populated in the database, even though only one exists on the FMS website. Crazy

@abelsonlive
Copy link
Contributor

yep, that was it. check the logs from the previous run (see the duplicate lines for each 'problem' fixie)

img

@cezary4
Copy link
Author

cezary4 commented Dec 20, 2013

That explains it. Though I thought we had put in a line in the parser to
have it download only the latest file, no? If not, would that fix it?

On Fri, Dec 20, 2013 at 2:22 PM, Brian Abelson [email protected]:

yep, that was it. check the logs from the previous run (see the duplicate
lines for each 'problem' fixie)

[image: img]https://github-camo.global.ssl.fastly.net/0f105b8dfa7d95dd8458ed32c2509504c90c8135/687474703a2f2f662e636c2e6c792f6974656d732f316d31703147307a3045336431393161334233662f53637265656e73686f74253230323031332d31322d323025323031312e31372e31332e706e67


Reply to this email directly or view it on GitHubhttps://github.com//issues/196#issuecomment-31034640
.

Cezary Podkul | @cezary
Mobile: (708) 228 1319
Web: cezarypodkul.com

@abelsonlive
Copy link
Contributor

fixed.

@cezary4
Copy link
Author

cezary4 commented Dec 22, 2013

Thanks for taking a look at this -- I checked to see if there were any
other dates like this; luckily, there are just three dates where this
happened, all in 2013:

2013-07-29
2013-08-30
2013-09-03

Here is a query that flags them:

select * from (
select date, count(distinct(url)) as URL_count from t2
group by date
order by URL_count desc)
where URL_count > 1

Treasury.io picked up two fixies for each of those dates, which is why
the items doubled up:

2013-07-29

https://www.fms.treas.gov/fmsweb/viewDTSFiles?fname=13072900.txt&dir=a
https://www.fms.treas.gov/fmsweb/viewDTSFiles?fname=13072901.txt&dir=a

2013-08-30

https://www.fms.treas.gov/fmsweb/viewDTSFiles?fname=13083000.txt&dir=a
https://www.fms.treas.gov/fmsweb/viewDTSFiles?fname=13083001.txt&dir=a

2013-09-03
https://www.fms.treas.gov/fmsweb/viewDTSFiles?fname=13090300.txt&dir=a
https://www.fms.treas.gov/fmsweb/viewDTSFiles?fname=13090301.txt&dir=a

It looks like FMS uploaded a second one with an 01 at the end, which
is why they're doubling up.

Can we tell it to just pick up the latest one on days where there are
more than one distinct URL? Or send us an alert when there's more than
two URLs that load on any one day?

This is the last major item on the issues list; thanks everyone for
the long slog to finish

Happy holidays,

Cezary

@abelsonlive
Copy link
Contributor

It looks like the issue is that the function that requests the fixies DOES NOT download the most recent file, instead it actually downloads ALL THE AVAILABLE FILES. It seems that it was written with the built in assumption that there would only ever be one version of the file on the server. In the case of the dates in question, this must not have been the case. At least for a brief period of time. See this function:
https://github.com/csvsoundsystem/federal-treasury-api/blob/master/parser/download_fms_fixies.py#L78

I'll work on patching this so that it more intelligently requests the fixies.

@abelsonlive abelsonlive reopened this Dec 22, 2013
@abelsonlive
Copy link
Contributor

Actually, nevermind, the break here - https://github.com/csvsoundsystem/federal-treasury-api/blob/master/parser/download_fms_fixies.py#L89 - should take care of this.

@abelsonlive
Copy link
Contributor

Honestly, I can't figure out why a file would be duplicated. It seems like everything in our software should handle this. The best I can do is write a test to check for duplicated dates. See this issue: #198

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants