Monday 23 February 2015

The SEC Structured Data Sets technically speaking

In my last post The SEC Structured Data Sets, I talked a little about this new SEC initiative to make XBRL more accessible. This time I'm gonna major on how it works.

At this point you may want to refer to the SEC technical document Annual and Quarterly Financial Statements, the Financial Statement Data Sets page (where the files reside) and if you want to see what the data contained in these files actually looks like, you can download one of our example spreadsheets here. The web queries in this sheet access XBRL Flat, our name for this data set. The sheet itself contains links to videos & info on how it all works. I will talk more about our item for item implementation of this data set in my next post.

On the Financial Statement Data Sets page, you will see there are currently 24 files. After we pass the last business day of this quarter (March 31 2015), they'll be 25. Don't try to open the latest files in Excel - they're too big but you could download one of the early ones to take a peak at the data layout of the files contained in these zips.

As the comprehensive technical document explains, there are 4 files. The one that counts is the num.txt. This has the values. In theory this file by itself has enough in it to do your analysis - values matched with dates and most importantly, tags for each filing. The files are not cumulative so you need to access each one to be sure of finding your filing. This is the point, in other words, where you need to load all these files into a database. If you load it all, its gonna be big (over 10 gig for starters).

The filings are keyed on the Accession Number (adsh) which is what EDGAR uses, so if you want to find the values for a particular company, you need to look up the adsh. This is where the sum.txt file comes in, which contains the company names & CIK's, so you need to load this into another table. Of course you could just find the adsh by going to EDGAR or our website - if you select a filing from, the adsh corresponds to what the aNo = in the address bar, but the adsh adds some annoying dashes! (In our implementation, we use a more comprehensive and timely database for these lookups - this is what you see at

You could stop there, as for example all the values we download in XBRL Flat come from just these two files. But if you want to see what the company has called these data items and if like us, you are sticklers for as reported data, then pre.txt contains the layout along with the labels. The final file, tag.txt contains important information on the tags but you may consider it not important enough.

So what to watch out for? Duplicate values! - surely impossible but no it's been seen and verified in the original filings. And the fields aren't quite in the order shown in the documentation so use the header records. Also you may want to exclude any records where the coreg field is populated, as more than likely you ain't looking at a value for the entire consolidated entity in these cases (I anticipate this will become more prevalent and relevant when they release values for the notes).

Two small bits of standardisation have occurred in these files that are not explicitly documented.

The financial statement headings do not have standard names and tags in the US-GAAP implementation of XBRL i.e. what's in the original filings (Yes I know - ridiculous!) but they do now in this SEC data set; these names and codes (or shall we call them tags?!) are actually all listed in the technical document.

The only slight problem is that a filing can have two financial statements which bear the same code (e.g. BS). One for the consolidated group and one for the parent. Can you tell the difference? Er No. Of course the parent one should have drastically less items. But of course if I'm gonna read this with a computer, I have to go through the hassle of counting items or something and that of course is not necessarily an exhaustive solution. There is also a code called UN (Unclassifiable Statement) which suggests that the SEC classifications may not themselves be exhaustive! I don't actually know why I'm going on about this as we solve this problem (differently) in our full database.

Secondly, the month end of the dates attached to each value have been standardised. This is good and bad. Good as it makes searching and aggregating easier, unless you were specifically looking for values at Apple's year end (27th Sept 2014) when the values are held as 30th Sept. Bad as those few companies that don't adhere to standard month end periods (last Friday of the month etc) can have say 51 week or 53 week years and you wouldn't know it from the num.txt file. This could lead to say revenues being over or understated by 2% on an annual basis, more so if quarterly. To pick this up, you need to keep an eye on the period date in the sub.txt file (don't use the fye field as its filled in inconsistently - 0927 for Apple but 1130 for adobe).

Probably worth re-iterating that no additional standardisation of data values has occurred in this data set. For more details on what needs to be done, see my Missing Totals post.

Next time I will explain how we have replicated this database and what you can do with it.

1 comment:

  1. Interesting one! The post deliver important facts about structured data sets. From this post I got useful information for my research work. Best Event App iPhone