Income Declarations Scrape
Several months ago, we published a blog post criticizing the Civil Service Bureau’s (CSB) income declarations website, http://declaration.ge , for not making its data available in a usable format. Although the CSB responded to some of our criticisms, the data on declaration.ge is still difficult to search and analyze because it’s in PDF format. Giving the public access to government data that can be analyzed and searched is important: it keeps government officials honest, and it can even create jobs.
Therefore, we’re pleased to announce that in order to make official income declarations in Georgia more usable, we’ve scraped declaration.ge and converted the bulk data into a series of Excel spreadsheets, which you can download from our website. Using these Excel files, you can search the declarations data by any criteria you like.
Do you have the name of a corporation, and want to know if any officials own shares in it? You can do that (look in the file labeled “securities”). Or maybe you just want to see which officials drive a Mercedes, and you can do that too (look in “property”). To be clear, we haven’t hacked declaration.ge -- all of the information contained within these files is already publicly available. The point is that when the data was inside PDFs, it was very difficult to answer questions like these.
The data is not perfect; there are ways that it could be made even easier to search, and it only includes what was already available on declaration.ge . If a declaration was filled out incompletely, the data in these files will be incomplete as well. Nonetheless, the data is now a lot more usable than it was. Although we have criticized declaration.ge in the past, we strongly support the CSB’s hard work in creating it -- without their efforts to collate official income declarations digitally and place them online, this type of automated processing would not be possible.
Before 2010, income declarations were filled out by hand and then scanned; we could not have processed scanned declarations in an automated way. We hope that this usage of the declarations data will serve as an example to the CSB of ways that they can continue to improve declaration.ge . We hope that people will put this data to good use -- if you find anything interesting, let us know! Brief description of technical details:
- All the PDF files on declaration.ge are downloaded using a shell script and curl.
- The PDF files are converted to HTML format using another shell script and pdftohtml .
- The HTML files are then parsed by a Python script using BeautifulSoup and a lot of custom parsing code.
- Another Python script loads the results of the parsing script into a CouchDB database using couchdb-python .
- We used couchapp to define CouchDB view functions that output each section of the income declarations, which are then fed into a CouchDB list function that outputs CSV (the actual delimiter is a pipe “|” because there are a lot of commas in the data).
- Another shell script uses curl again to download the results of the list function into separate CSV files for each section.
- The CSV files are imported into LibreOffice and saved as Excel spreadsheets.
We’d like to thank Open Society Institute for making it possible for us to open up this data.
Latest upload, on Feb. 27, 2014: (hosted on Google Drive: click on "File" -> "Download" to download all the CSV files, in a ZIP file)
Upload on Feb. 11, 2013: 11Feb2013.tar_.gz
Original upload (Jan. 30, 2012):
If you need help using this data, please contact firstname.lastname@example.org.