Optimering af indsamling af Borgerforslagsdata

For et par uger siden skrev jeg om en lille robot, jeg har lavet, der tjekker antallet af stemmer per borgerforslagborgerforslag.dk.

Første udgave af robotten gemte det aktuelle stemmeantal for hvert aktivt borgerforslag hvert 10. minut, og da der både er en del borgerforslag og en del minutter, blev det ret hurtigt til ret mange registreringer i min database.

Jeg kom i tanke om, at det kun er nødvendigt at gemme stemmeantallet, når stemmeantallet har ændret sig siden sidste registrering. Hvis et forslag er viralt, registreres stemmeantallet stadigvæk hvert 10. minut. Hvis et forslag er døende, kan der gå meget længere tid mellem hver registrering.

Her er den nye udgave af robotten, som tjekker om der findes andre registreringer af samme forslag med samme stemmeantal, og kun gemmer antal stemmer, hvis der ikke gør:

import requests
from datetime import datetime
import locale
import psycopg2
from psycopg2 import Error

# Locale is set to Danish to parse dates correctly
locale.setlocale(locale.LC_TIME, ('da_DK', 'UTF-8'))

# API url
url = 'https://www.borgerforslag.dk/api/proposals/search'

# Query parameters
suggestions_per_request = 300
params_json = {
	"filter": "active",
	"sortOrder": "NewestFirst",
	"searchQuery":"",
	"pageNumber":0,
	"pageSize": suggestions_per_request
}

# Connect to database
try:
	connection = psycopg2.connect(user = "",
									password = "",
									host = "",
									port = "",
									database = "")
	cursor = connection.cursor()
except (Exception, psycopg2.Error) as error:
	print ("Error while connecting to PostgreSQL", error)

now = datetime.utcnow()

# Insert into database function
def insert_suggestion_and_votes(connection, suggestion):
	with connection:
		with connection.cursor() as cur:
			try:
				# By default, votes are inserted, except when no new votes have been added
				# This variable is used to keep track of whether votes should be inserted
				insert_votes = True
				
				# See if suggestion already exists in table table borgerforslag_suggestion
				sql = '''SELECT * FROM borgerforslag_suggestion WHERE unique_id = %s'''
				cur.execute(sql, (suggestion['externalId'],))
				suggestion_records = cur.fetchone()
				# If suggestion does not already exist, add suggestion to table borgerforslag_suggestion
				if not suggestion_records:
					suggestion_data = (suggestion['externalId'],suggestion['title'],suggestion['date'],suggestion['url'],suggestion['status'])
					sql = '''INSERT INTO borgerforslag_suggestion(unique_id,title,suggested_date,url,status) VALUES(%s,%s,%s,%s,%s) RETURNING id'''
					cur.execute(sql, suggestion_data)
					id = cur.fetchone()[0]
				# If yes, get id of already added suggestion
				else:
					id = suggestion_records[0]
					# Check in table table borgerforslag_vote whether a record with the same number of votes exists.
					# If it does, no need to save votes
					sql = '''SELECT * FROM borgerforslag_vote WHERE suggestion_id = %s AND votes = %s'''
					cur.execute(sql, (id,suggestion['votes']))
					vote_record = cur.fetchone()
					if vote_record:
						insert_votes = False

				# Add votes to table borgerforslag_vote (if suggestion is new or vote count has changed since last run)
				if insert_votes == True:
					sql = '''INSERT INTO borgerforslag_vote(suggestion_id,timestamp,votes)
					VALUES(%s,%s,%s)'''
					cur.execute(sql, (id,now,suggestion['votes']))
			except Error as e:
				print(e, suggestion)

# Loop preparation
requested_results = 0
number_of_results = requested_results + 1
number_of_loops = 0

# Loop to get suggestions and add them to database
while requested_results < number_of_results and number_of_loops < 10:
	response = requests.post(url, json=params_json)
	json_response = response.json()
	number_of_results = json_response['resultCount']
	requested_results += suggestions_per_request
	number_of_loops += 1
	params_json['pageNumber'] += 1
	for suggestion in json_response['data']:
		suggestion['date'] = datetime.strptime(suggestion['date'], '%d. %B %Y')	# convert date to datetime
		insert_suggestion_and_votes(connection, suggestion)

Oprydning

Nu hvor jeg fik gjort tempoet, min database vokser med, lidt langsommere, ville jeg også gerne rydde lidt op i de gamle registreringer, hvor jeg jo havde gemt antal stemmer hvert 10. minut, uanset om antallet havde ændret sig.

Det skrev jeg også et lille script til. Her er logikken at jeg henter alle stemmeregistreringer sorteret efter hvilket borgerforslag, de hører til, og dernæst efter tidspunkt for registreringen.

Med rækkefølgen på plads, kan jeg for hver registrering tjekke, om den både vedrører samme borgerforslag som den tidligere registrering, og at stemmeantallet er det samme som den tidligere registrering. Hvis begge dele er sandt, er registreringen overflødig og kan slettes:

import psycopg2
from psycopg2 import Error

# Connect to database
try:
	connection = psycopg2.connect(user = "",
									password = "",
									host = "",
									port = "",
									database = "")
	cursor = connection.cursor()
except (Exception, psycopg2.Error) as error:
	print ("Error while connecting to PostgreSQL", error)

with connection:
	with connection.cursor() as cur:
		sql = '''SELECT "borgerforslag_vote"."id", "borgerforslag_vote"."suggestion_id", "borgerforslag_vote"."timestamp", "borgerforslag_vote"."votes" FROM "borgerforslag_vote" ORDER BY "borgerforslag_vote"."suggestion_id" ASC, "borgerforslag_vote"."timestamp" ASC'''
		cur.execute(sql)
		rows = cur.fetchall()

		previous_vote_number = -1
		previous_vote_suggestion = -1000
		for row in rows:
			votes = row[3]
			suggestion = row[1]
			id = row[0]
			if votes == previous_vote_number and previous_vote_suggestion == suggestion:
				sql = '''DELETE FROM "borgerforslag_vote" WHERE "borgerforslag_vote"."id" = %s'''
				cur.execute(sql, (id, ))
			previous_vote_number = row[3]
			previous_vote_suggestion = row[1]