Issue #38: Include scripts to insert data from health ministry

Signed-off-by: Cristian Weiland <cw14@inf.ufpr.br>

Issue #38: Include scripts to insert data from health ministry
a19cac0a · Cristian Weiland · c2f9edda · a19cac0a · a19cac0a · a19cac0a
Commit a19cac0a authored 7 years ago by Cristian Weiland
--- a/scripts/health_ministry/README
+++ b/scripts/health_ministry/README
+The easiest way to insert health ministry data is to use 'insert_health_ministry.sh'.
+
+Script's input: Year, month and day from the data to be inserted, ElasticSearch's user and password. The day should be the last day of the month.
+The script also uses a config file named 'config.sh'. This is a small shell script that need to only declare some variables: index, host and filter:
+- Index: The index prefix to be saved on ElasticSearch.
+- Host: The hostname of the machine runnning ElasticSearch.
+- Filter: A string that will be used to 'egrep' the data obtained from Portal Transparência.
+
+Example: ./insert_health_ministry.sh 2016 10 31 myuser mypass
+Example 2: ./insert_health_ministry.sh 2014 11 30 power ranger
+
+The other script's will be called by 'insert_health_ministry.sh' correctly.
--- a/scripts/health_ministry/config.sh
+++ b/scripts/health_ministry/config.sh
+# This file only contains some config variables:
+
+# Index prefix: The prefix of the index in elasticsearch. Ex: gastos
+
+index="ms-gastos-pagamentos"
+
+# Filter: A string that will be used to 'egrep' the data obtained from Portal Transparência.
+
+filter="MINISTERIO DA SAUDE"
+
+# Host: ElasticSearch's host. Ex: "localhost"
+
+host="localhost"
--- a/scripts/health_ministry/create_health_ministry_config.py
+++ b/scripts/health_ministry/create_health_ministry_config.py
+#!/usr/bin/env python3
+
+# WARNING: This script should not be called directly. Look at 'insert_health_ministry.sh' before calling this script.
+
+# This script is used to create a Logstash Config file.
+
+# Input: year, month and day, ElasticSearch's username and password.
+
+import sys, csv, json, math, subprocess
+from pathlib import Path
+from subprocess import call
+
+if len(sys.argv) != 8:
+    print("Usage: " + sys.argv[0] + " <year (2016)> <month (01)> <day (31)> <index> <host> <username> <password>")
+    sys.exit()
+
+with open('logstash_config.example') as infile:
+	example = infile.read()
+
+output = example % { "timestamp": sys.argv[3] + '/' + sys.argv[2] + '/' + sys.argv[1] + ' 00:00:00'
+					 , "date": sys.argv[1]
+                     , "index": sys.argv[4]
+                     , "host": sys.argv[5]
+					 , "user": sys.argv[6]
+					 , "password": sys.argv[7] }
+
+with open('./tmp_' + sys.argv[1] + '-' + sys.argv[2] + '/config-' + sys.argv[1] + '-' + sys.argv[2], 'w') as outfile:
+	outfile.write(output)
--- a/scripts/health_ministry/insert_health_ministry.sh
+++ b/scripts/health_ministry/insert_health_ministry.sh
+#!/bin/bash
+
+# This script is the one that should be called to insert data from one month.
+
+# Input: Year, month and day from the data to be inserted, ElasticSearch's user and password. The day should be the last day of the month.
+# Example: ./insert_health_ministry.sh 2016 10 myuser mypass
+# It has 4 steps:
+#   1- Download files and put them in the right location.
+#   2- Generate logstash config file via create_health_ministry_config.py.
+#   3- Generate a CSV with only UFPR data via resume_health_ministry.sh, which is stored in ./tmp/year-month.csv
+#   4- Insert data in ElasticSearch via logstash, using the config file created and the CSV created by resume_health_ministry.sh.
+# Output: The commands/scripts outputs.
+
+if [ "$#" -ne 4 ]; then
+	echo "Usage: $0 <year> <month> <user> <password>"
+	echo "Example: $0 2016 12 myuser mypass"
+	exit
+fi
+
+source ./config.sh
+
+if [ -z ${index+x} ]; then
+    echo "Var 'index' is unset. Set it in file 'scripts/health_ministry/config.sh'.";
+    exit;
+fi
+if [ -z ${host+x} ]; then
+    echo "Var 'host' is unset. Set it in file 'scripts/health_ministry/config.sh'.";
+    exit;
+fi
+if [ -z ${filter+x} ]; then
+    echo "Var 'filter' is unset. Set it in file 'scripts/health_ministry/config.sh'.";
+    exit;
+fi
+
+# Change variable names to improve legibility
+year=$1
+month=$2
+
+# Getting the Last day of this month (Using date 2016-05-15 as example):
+# First, get next month (201606).
+aux=$(date +%Y%m -d "$(date +${year}${month}15) next month")
+# Append day 01 (20160601).
+temp=$(date -d "${aux}01")
+# Remove 1 day: 20160531, get only day: 31.
+day=$(date -d "$temp - 1 day" "+%d")
+
+ym=$year-$month
+path="./tmp_$ym"
+
+mkdir -p "$path"
+
+# Step 1:
+# Download files
+request='http://arquivos.portaldatransparencia.gov.br/downloads.asp?a='${year}'&m='${month}'&consulta=GastosDiretos'
+curl -o $path/${year}${month}_GastosDiretos.zip $request -H 'Accept-Encoding: gzip, deflate, sdch' -H 'Accept-Language: en-US,en;q=0.8' -H 'Upgrade-Insecure-Requests: 1' -H 'User-Agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.143 Safari/537.36' -H 'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8' -H 'Referer: http://transparencia.gov.br/downloads/mensal.asp?c=GastosDiretos' -H 'Cookie: ASPSESSIONIDAQRABSAD=OJDLNBCANLIDINCHJHELHHFB; ASPSESSIONIDAQSDCQAD=BOKBKPNCDKOBJKGAMMEKADFL; _ga=GA1.3.1927288562.1481545643; ASPSESSIONIDSCSBBTCD=IGJLJBBCEEJBGLOOJKGNMHBH' -H 'Connection: keep-alive' --compressed
+
+# Unzip them
+unzip -o $path/${year}${month}_GastosDiretos.zip -d $path/
+
+# Remove zip file
+rm $path/${year}${month}_GastosDiretos.zip
+
+# Step 2:
+./create_health_ministry_config.py $year $month "$day" "$index" "$host" $3 $4
+# Step 3:
+./resume_health_ministry.sh "${path}" ${year}-${month} "$filter"
+# Step 4:
+logstash -f $path/config-${year}-${month} < $path/${year}${month}.csv
+
+# Data inserted, we can now remove it.
+rm $path/${year}${month}.csv
+rm $path/config-${year}-${month}
+rm $path/${year}${month}_GastosDiretos.csv
+rmdir $path
--- a/scripts/health_ministry/logstash_config.example
+++ b/scripts/health_ministry/logstash_config.example
+input {
+	stdin {
+		codec => plain {
+			charset => "Windows-1252"
+		}
+	}
+}
+
+filter {
+	csv {
+		columns => ["Código Órgão Superior","Nome Órgão Superior","Código Órgão","Nome Órgao","Código Unidade Gestora","Nome Unidade Gestora","Código Grupo Despesa","Nome Grupo Despesa","Código Elemento Despesa","Nome Elemento Despesa","Código Função","Nome Função","Código Subfunção","Nome Subfunção","Código Programa","Nome Programa","Código Ação","Nome Ação","Linguagem Cidadã","Código Favorecido","Nome Favorecido","Número Documento","Gestão Pagamento","Data Pagamento","Valor"]
+		separator => "	"
+		add_field => { "timestamp" => "%(timestamp)s" }
+	}
+	mutate {
+		convert => { "Código Órgão Superior" => "integer" }
+		convert => { "Código Órgão" => "integer" }
+		convert => { "Código Unidade Gestora" => "integer" }
+		convert => { "Código Grupo Despesa" => "integer" }
+		convert => { "Código Elemento Despesa" => "integer" }
+		convert => { "Código Função" => "integer" }
+		convert => { "Código Subfunção" => "integer" }
+		convert => { "Código Programa" => "integer" }
+		convert => { "Código Ação" => "integer" }
+		convert => { "Código Favorecido" => "integer" }
+		convert => { "Gestão Pagamento" => "integer" }
+		convert => { "Valor" => "float" }
+	}
+	date {
+		match => [ "timestamp", "dd/MM/YYYY HH:mm:ss", "ISO8601" ]
+		target => [ "@timestamp" ]
+	}
+	date {
+        match => [ "Data Pagamento", "dd/MM/YYYY" ]
+        target => [ "Data Pagamento Timestamp" ]
+	}
+}
+
+output {
+	elasticsearch {
+		action => "index"
+		user => "%(user)s"
+		password => "%(password)s"
+		hosts => "http://%(host)s:9200"
+		index => "%(index)s-%(date)s"
+		workers => 1
+	}
+	stdout {}
+}
--- a/scripts/health_ministry/process_health_ministry.sh
+++ b/scripts/health_ministry/process_health_ministry.sh
+#!/bin/bash
+
+# WARNING: This script should not be called unless the database is erased. Its still here for 2 reasons:
+# 1- Log: To know what months of data have been inserted.
+# 2- Example: To give example of how to call script insert_health_ministry.sh.
+
+# This script only calls insert_health_ministry for all years and months.
+
+if [ "$#" -ne 2 ]; then
+	echo "Usage: $0 <user> <password>"
+	echo "Example: $0 myuser mypass"
+	exit
+fi
+
+./insert_health_ministry.sh 2017 01 $1 $2
+./insert_health_ministry.sh 2017 02 $1 $2
+
+./insert_health_ministry.sh 2016 12 $1 $2
+./insert_health_ministry.sh 2016 11 $1 $2
+./insert_health_ministry.sh 2016 10 $1 $2
+./insert_health_ministry.sh 2016 09 $1 $2
+./insert_health_ministry.sh 2016 08 $1 $2
+./insert_health_ministry.sh 2016 07 $1 $2
+./insert_health_ministry.sh 2016 06 $1 $2
+./insert_health_ministry.sh 2016 05 $1 $2
+./insert_health_ministry.sh 2016 04 $1 $2
+./insert_health_ministry.sh 2016 03 $1 $2
+./insert_health_ministry.sh 2016 02 $1 $2
+./insert_health_ministry.sh 2016 01 $1 $2
+
+./insert_health_ministry.sh 2015 12 $1 $2
+./insert_health_ministry.sh 2015 11 $1 $2
+./insert_health_ministry.sh 2015 10 $1 $2
+./insert_health_ministry.sh 2015 09 $1 $2
+./insert_health_ministry.sh 2015 08 $1 $2
+./insert_health_ministry.sh 2015 07 $1 $2
+./insert_health_ministry.sh 2015 06 $1 $2
+./insert_health_ministry.sh 2015 05 $1 $2
+./insert_health_ministry.sh 2015 04 $1 $2
+./insert_health_ministry.sh 2015 03 $1 $2
+./insert_health_ministry.sh 2015 02 $1 $2
+./insert_health_ministry.sh 2015 01 $1 $2
+
+./insert_health_ministry.sh 2014 12 $1 $2
+./insert_health_ministry.sh 2014 11 $1 $2
+./insert_health_ministry.sh 2014 10 $1 $2
+./insert_health_ministry.sh 2014 09 $1 $2
+./insert_health_ministry.sh 2014 08 $1 $2
+./insert_health_ministry.sh 2014 07 $1 $2
+./insert_health_ministry.sh 2014 06 $1 $2
+./insert_health_ministry.sh 2014 05 $1 $2
+./insert_health_ministry.sh 2014 04 $1 $2
+./insert_health_ministry.sh 2014 03 $1 $2
+./insert_health_ministry.sh 2014 02 $1 $2
+./insert_health_ministry.sh 2014 01 $1 $2
+
+./insert_health_ministry.sh 2013 12 $1 $2
+./insert_health_ministry.sh 2013 11 $1 $2
+./insert_health_ministry.sh 2013 10 $1 $2
+./insert_health_ministry.sh 2013 09 $1 $2
+./insert_health_ministry.sh 2013 08 $1 $2
+./insert_health_ministry.sh 2013 07 $1 $2
+./insert_health_ministry.sh 2013 06 $1 $2
+./insert_health_ministry.sh 2013 05 $1 $2
+./insert_health_ministry.sh 2013 04 $1 $2
+./insert_health_ministry.sh 2013 03 $1 $2
+./insert_health_ministry.sh 2013 02 $1 $2
+./insert_health_ministry.sh 2013 01 $1 $2
--- a/scripts/health_ministry/resume_health_ministry.sh
+++ b/scripts/health_ministry/resume_health_ministry.sh
+#!/bin/bash
+
+# WARNING: This script should not be called directly. Look at 'insert_health_ministry.sh' before calling this script.
+
+# Input: First parameter is the path to data files and the second one is the date in the name of the files. Data files can be found in: http://transparencia.gov.br/downloads/mensal.asp?c=GastosDiretos
+# Example: ./resume_health_ministry.sh ./tmp_2016-11 2016-11 "MINISTERIO DA SAUDE"
+
+# Output: A CSV file in folder processed, filtering the data to get only relevant data (in our case, from UFPR).
+
+if [ "$#" -ne 3 ]; then
+	echo "Usage: $0 <path> <date> <filter>"
+	exit
+fi
+
+# Path example: ./tmp_2016-11
+path=$1
+# Date example: 2016-11
+date=$2
+# Filter example: "MINISTERIO DA SAUDE"
+filter=$3
+# dateWithoutHyphen example: 201611
+dateWithoutHyphen=${date//-}
+
+input="${path}/${dateWithoutHyphen}_GastosDiretos.csv"
+output="${path}/${dateWithoutHyphen}.csv"
+
+# About this command:
+# - Grep removes unnecessary data.
+# - Tr removes null characters (ctrl + @).
+
+cat "$input" | egrep --binary-files=text "$filter" | tr -d '\000' > "$output"
--- a/scripts/insert_data.sh
+++ b/scripts/insert_data.sh
@@ -21,3 +21,6 @@ fi

 # Now, insert Workers data.
 (cd workers && ./insert_register_payment.sh $1 $2 $3 $4)
+
+# Last but not least, insert data from Health Ministry.
+(cd health_ministry && ./insert_health_ministry.sh $1 $2 $3 $4)