Min egen private eReolen

Jeg kan godt lide at læse bøger, og mange af dem låner jeg på eReolen. Men for en bognørd er eReolen ikke særlig brugervenlig. Der er godt nok en masse mærkelige søgninger, man kan lave, hvis man er teknisk nok, men sådan noget som at se, hvad der rent faktisk er nytilføjede bøger, er svært at følge med i.

eReolen har godt nok en sektion, de kalder “nyheder”, med en søgning der i skrivende stund (februar 2022) hedder noget i retning af:

(dkcclterm.op=202112* OR dkcclterm.op=202201*) AND term.type=ebog and facet.category=voksenmaterialer

Slår man op i beskrivelsen af brøndindekser, kan man se at “dkcclterm.op” dækker over:

dkcclterm.opopOprettelsesdato

Men hvordan kan det være, at en visning af nyheder søger på oprettelsesdatoer i december og januar? Det er februar nu.

Fordi “Oprettelsesdato” for en titel ikke er det samme som dato for titlens tilføjelse på eReolen. Hvad det betyder, ved jeg ikke med sikkerhed, men i hvert fald ikke titlens tilføjelse på eReolen.

Og det betyder, at der løbende kan dukke spændende bøger op, hvis “dkcclterm.op”-værdi ligger langt tilbage i tiden.

Og det betyder, at jeg kan risikere at misse noget, jeg gerne vil læse.

Hvad gjorde jeg så?

Jeg byggede min egen eReolen! Med en robot, der hver nat monitorerer, hvilke titler der rent faktisk er nye. Hver morgen ligger der en mail til mig om, hvor mange titler robotten har fundet, og hvis jeg har tid og kaffe til det, kan jeg kigge de nye titler igennem over morgenkaffen.

Det fungerer sådan her:

I Django byggede jeg en datamodel over titler med forskellige metadata:

from django.db import models
from isbn_field import ISBNField

class Author(models.Model):
	full_name = models.CharField('Forfatter', max_length=200, unique=True)
	birth_year = models.DateField(null=True)
	def __str__(self):
		return self.full_name

class Publisher(models.Model):
	publisher = models.CharField('Udgiver', max_length=200, unique=True)
	def __str__(self):
		return self.publisher

class Keyword(models.Model):
	keyword = models.CharField('Nøgleord', max_length=200, unique=True)
	def __str__(self):
		return self.keyword
		
class TitleType(models.Model):
	title_type = models.CharField('Type', max_length=200, unique=True)
	def __str__(self):
		return self.title_type

class Language(models.Model):
	language = models.CharField('Sprog', max_length=50, unique=True)
	def __str__(self):
		return self.language

class Isbn(models.Model):
	isbn = ISBNField(null=True, blank=True)
	def __str__(self):
		return self.isbn

class Audience(models.Model):
	audience = models.CharField('Målgruppe', max_length=200, unique=True)
	def __str__(self):
		return self.audience
	
class TitleFormat(models.Model):
	title_format = models.CharField('Format', max_length=50, unique=True)
	def __str__(self):
		return self.title_format

class Title(models.Model):
	added = models.DateField()
	object_id = models.CharField('Ereolen-id', max_length=50, unique=True)
	title = models.CharField('Titel', max_length=500)
	original_title = models.CharField('Originaltitel', max_length=500, default="")
	publish_date = models.DateField(null=True)
	dk5 = models.CharField('DK5-kode', max_length=10, default="")
	cover_url = models.URLField('Cover-url', max_length=500, null=True)
	ereolen_url = models.URLField('Ereolen-url', max_length=500)
	abstract = models.TextField(blank=True)
	dkcclterm_op = models.DateField()
	publisher = models.ForeignKey(Publisher, on_delete=models.CASCADE)
	language = models.ForeignKey(Language, on_delete=models.CASCADE)
	title_type = models.ForeignKey(TitleType, on_delete=models.CASCADE)
	title_format = models.ForeignKey(TitleFormat, on_delete=models.CASCADE)
	author = models.ManyToManyField(Author)
	keyword = models.ManyToManyField(Keyword)
	audience = models.ManyToManyField(Audience)
	isbn = models.ManyToManyField(Isbn)

	def __str__(self):
		return self.title	
	def get_authors(self):
		return " & ".join([author.full_name for author in self.author.all()])
	get_authors.short_description = "Author(s)"
	def get_isbns(self):
		return ", ".join([isbn.isbn for isbn in self.isbn.all()])
	get_isbns.short_description = "ISBN(s)"	
	def get_keywords(self):
		return ", ".join([keyword.keyword for keyword in self.keyword.all()])
	get_keywords.short_description = "Keyword(s)"		
	def get_audiences(self):
		return ", ".join([audience.audience for audience in self.audience.all()])
	get_audiences.short_description = "Audience(s)"

I Python skrev jeg en robot, der søger eReolen igennem, tilføjer nye titler til min database og ignorerer titler, der allerede er i databasen. Robotten satte jeg op til at køre hver nat på min server:

# -*- coding: utf-8 -*-
# Author: Morten Helmstedt. E-mail: helmstedt@gmail.com
""" This program saves ebooks, audiobooks and podcasts from ereolen.dk to a local database
that can be used to detect new titles better than ereolen.dk's own search options """

import requests										# make http requests
from bs4 import BeautifulSoup						# parse html responses
from datetime import date							# create date objects
from dateutil.relativedelta import relativedelta	# adding and subtracting months to dates
import re											# regex for publish year parsing
import psycopg2										# work with postgresql databases
from psycopg2 import Error							# database error handling

# Connect to database
try:
	connection = psycopg2.connect(user = "",
									password = "",
									host = "",
									port = "",
									database = "")
	cursor = connection.cursor()
except (Exception, psycopg2.Error) as error:
	print("Error while connecting to PostgreSQL", error)

# Set configuration options and global variables
base_url = 'https://ereolen.dk'
term_types = ['ebog','lydbog','podcast']
added = date.today()
number_of_months_to_search = 200
start_month = added - relativedelta(months=number_of_months_to_search-2)

# Search period list goes from current month plus one month and back to start_month
search_period = []
for i in reversed(range(0,number_of_months_to_search)):
	year_month_date = start_month + relativedelta(months=+i)
	year_month = [year_month_date.year, year_month_date.month]
	search_period.append(year_month)

# Crawl loop
title_counter = 0
for year_month in search_period:
	for term_type in term_types:
		start_date = date(year_month[0],year_month[1],1)
		dkcclterm_op_search = start_date.strftime("%Y%m")
		page = 0
		pages_left = True
		while pages_left == True:
			# Search for hits
			search_url = base_url + '/search/ting/dkcclterm.op%3D' + dkcclterm_op_search + '*%20AND%20term.type%3D' + term_type + '?page=' + str(page) + '&sort=date_descending'
			request = requests.get(search_url)
			result = request.text
			# If an error message is returned in the search, either no results are left, or ereolen.dk is down for some reason
			# In this case, the while loop is broken to try next item type and/or next year-month combination
			if 'Vi kan desværre ikke finde noget, der matcher din søgning' in result or 'The website encountered an unexpected error. Please try again later.' in result:
				pages_left = False
				break
			# Parse hits and get all item links
			soup = BeautifulSoup(result, "lxml")
			links = soup.find_all('a', href=True)
			item_links = {link['href'] for link in links if "/ting/collection/" in link['href']}
			# Go through item link
			for link in item_links:
				# Get id and check if link is already in databse
				object_id = link[link.rfind('/')+1:].replace('%3A',':')
				search_sql = '''SELECT * from ereolen_title WHERE object_id = %s'''
				cursor.execute(search_sql, (object_id, ))
				item_hit = cursor.fetchone()
				# No hits means item is not in database and should be added
				if not item_hit:
					### ADD SEQUENCE ###
					
					# Set full url for item					
					ereolen_url = base_url + link
					
					# Request item and parse html
					title_request = requests.get(ereolen_url)
					title_result = title_request.text
					title_soup = BeautifulSoup(title_result, "lxml")
					
					# TITLE FIELDS #
					
					# TITLE
					try:
						title = title_soup.find('div', attrs={'class':'field-name-ting-title'}).text.replace(" : ",": ")
					except:
						print("Ingen titel på:", ereolen_url)
						break	

					# ORIGINAL TITLE
					try:
						original_title = title_soup.find('div', attrs={'class':'field-label'}, string=re.compile("Original titel:")).next.next.text
					except:
						original_title = ''
					
					# PUBLISHED
					try:
						published = title_soup.find('div', class_={"field-name-ting-author"}).get_text()
						published = int(re.search("[(]\d\d\d\d[)]", published).group()[1:5])
						publish_date = date(published,1,1)
					except:
						publish_date = None

					# COVER URL
					try:
						cover_url = title_soup.find('div', class_={"ting-cover"}).img['src']
					except:
						try:
							data = {
							  'coverData[0][id]': object_id,
							  'coverData[0][image_style]': 'ding_primary_large'
							}
							response = requests.post('https://ereolen.dk/ting/covers', data=data)
							response_json = response.json()
							cover_url = response_json[0]['url']
						except:
							cover_url = ''
		
					# ABSTRACT
					abstract = title_soup.find('div', attrs={'class':'field-name-ting-abstract'}).text

					# DKCCLTERM_OP
					dkcclterm_op = start_date
					
					# FOREIGN KEY FIELDS #
					
					# LANGUAGE
					try:
						ereolen_language = title_soup.find('div', attrs={'class':'field-label'}, string=re.compile("Sprog:")).next.next.text
					except:
						ereolen_language = 'Ukendt'
					language_sql = '''SELECT * from ereolen_language WHERE language = %s'''
					cursor.execute(language_sql, (ereolen_language, ))
					try:
						language = cursor.fetchone()[0]
					except:
						language_insert = '''INSERT INTO ereolen_language(language) VALUES(%s) RETURNING id'''
						cursor.execute(language_insert, (ereolen_language, ))
						language = cursor.fetchone()[0]
	
					# PUBLISHER
					try:
						ereolen_publisher = title_soup.find('div', attrs={'class':'field-label'}, string=re.compile("Forlag:")).next.next.text
					except:
						ereolen_publisher = 'Ukendt'
					publisher_sql = '''SELECT * from ereolen_publisher WHERE publisher = %s'''
					cursor.execute(publisher_sql, (ereolen_publisher, ))
					try:
						publisher = cursor.fetchone()[0]
					except:
						publisher_insert = '''INSERT INTO ereolen_publisher(publisher) VALUES(%s) RETURNING id'''
						cursor.execute(publisher_insert, (ereolen_publisher, ))
						publisher = cursor.fetchone()[0]

					# TYPE
					try:
						ereolen_type = title_soup.find('div', attrs={'class':'field-label'}, string=re.compile("Type:")).next.next.text
					except:
						ereolen_type = 'Ukendt'
					type_sql = '''SELECT * from ereolen_titletype WHERE title_type = %s'''
					cursor.execute(type_sql, (ereolen_type, ))
					try:
						title_type = cursor.fetchone()[0]
					except:
						title_type_insert = '''INSERT INTO ereolen_titletype(title_type) VALUES(%s) RETURNING id'''
						cursor.execute(title_type_insert, (ereolen_type, ))
						title_type = cursor.fetchone()[0]

					# FORMAT
					try:
						ereolen_format = title_soup.find('div', attrs={'class':'field-label'}, string=re.compile("Ebogsformat:")).next.next.text
					except:
						ereolen_format = "Ukendt"
					format_sql = '''SELECT * from ereolen_titleformat WHERE title_format = %s'''
					cursor.execute(format_sql, (ereolen_format, ))
					try:
						title_format = cursor.fetchone()[0]
					except:
						title_format_insert = '''INSERT INTO ereolen_titleformat(title_format) VALUES(%s) RETURNING id'''
						cursor.execute(title_format_insert, (ereolen_format, ))
						title_format = cursor.fetchone()[0]				
					
					# DK5 - TODO: Not done yet
					dk5 = ""
					
					### SAVE BEFORE ADDING MANY-TO-MANY FIELDS ###
					title_data = (added,title_type,title,original_title,publisher,object_id,language,publish_date,cover_url,ereolen_url,title_format,abstract,dkcclterm_op,dk5)
					
					title_insert = '''INSERT INTO ereolen_title(added,title_type_id,title,original_title,publisher_id,object_id,language_id,publish_date,cover_url,ereolen_url,title_format_id,abstract,dkcclterm_op,dk5) VALUES(%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s) RETURNING id'''
					cursor.execute(title_insert, title_data)
					title_id = cursor.fetchone()[0]
					connection.commit()
					
					# MANY-TO-MANY FIELDS #

					# AUDIENCE(S)
					try:
						audience_div = title_soup.find('div', attrs={'class':'field-label'}, string=re.compile("Målgruppe:")).next.next
						audiences = audience_div.find_all('span')
						audiences_list = [aud.text for aud in audiences]
					except:
						audiences_list = ['Ukendt']
					for audience in audiences_list:
						audience_sql = '''SELECT * from ereolen_audience WHERE audience = %s'''
						cursor.execute(audience_sql, (audience, ))
						try:
							audience_id = cursor.fetchone()[0]
						except:
							audience_insert = '''INSERT INTO ereolen_audience(audience) VALUES(%s) RETURNING id'''
							cursor.execute(audience_insert, (audience, ))
							audience_id = cursor.fetchone()[0]
						audience_relation_sql = '''INSERT INTO ereolen_title_audience (title_id, audience_id) VALUES (%s,%s)'''
						try:
							cursor.execute(audience_relation_sql, (title_id,audience_id))
						except:
							connection.rollback()
					
					# ISBN(S)
					try:
						isbn_div = title_soup.find('div', attrs={'class':'field-label'}, string=re.compile("ISBN:")).next.next
						isbns = isbn_div.find_all('span')
						isbns_list = [isb.text for isb in isbns]
						for isbn in isbns_list:
							isbn_sql = '''SELECT * from ereolen_isbn WHERE isbn = %s'''
							cursor.execute(isbn_sql, (isbn, ))
							try:
								isbn_id = cursor.fetchone()[0]
							except:
								isbn_insert = '''INSERT INTO ereolen_isbn(isbn) VALUES(%s) RETURNING id'''
								cursor.execute(isbn_insert, (isbn, ))
								isbn_id = cursor.fetchone()[0]
							isbn_relation_sql = '''INSERT INTO ereolen_title_isbn (title_id, isbn_id) VALUES (%s,%s)'''
							try:
								cursor.execute(isbn_relation_sql, (title_id,isbn_id))
							except:	
								connection.rollback()							
					except:
						pass
					
					# KEYWORDS(S)
					keywords_div = title_soup.find('div', attrs={'class':'field-name-ting-subjects'})
					if keywords_div:
						keywords = [link.text for link in keywords_div.find_all('a')]
						for keyword in keywords:
							keyword_sql = '''SELECT * from ereolen_keyword WHERE keyword = %s'''
							cursor.execute(keyword_sql, (keyword, ))
							try:
								keyword_id = cursor.fetchone()[0]
							except:
								keyword_insert = '''INSERT INTO ereolen_keyword(keyword) VALUES(%s) RETURNING id'''
								cursor.execute(keyword_insert, (keyword, ))
								keyword_id = cursor.fetchone()[0]
							keyword_relation_sql = '''INSERT INTO ereolen_title_keyword (title_id, keyword_id) VALUES (%s,%s)'''
							try:
								cursor.execute(keyword_relation_sql, (title_id,keyword_id))						
							except:
								connection.rollback()	

					# AUTHOR(S)
					creator_full = title_soup.find('div', attrs={'class':'field-name-ting-author'}).text.replace("Af ","")
					# Remove date of book
					creator = creator_full[:creator_full.rfind("(")-1]
					
					authors = creator.split(",")
					
					for author in authors:
						birth_year = None
						if ' (f. ' in author and not len(author) < 7:
							if 'ca. ' in author:
								author = author.replace('ca. ','')
							birth_year_string = author[author.index("(f.")+4:author.index("(f.")+8]
							if ')' in birth_year_string:
								birth_year_string = birth_year_string.replace(')','')
							birth_year = date(int(birth_year_string),1,1)
							author = author[:author.index(" (f.")]
						elif ' (f. ' in author:
							breakpoint()
							
						# Some times there are no authors, but still a published year
						if len(author) == 5 and "(" in author:
							author = ""
						
						if author:
							author = author.strip()
							author_sql = '''SELECT * from ereolen_author WHERE full_name = %s'''
							cursor.execute(author_sql, (author, ))
							try:
								author_id = cursor.fetchone()[0]
							except:
								if birth_year:
									author_insert = '''INSERT INTO ereolen_author(full_name,birth_year) VALUES(%s,%s) RETURNING id'''
									cursor.execute(author_insert, (author,birth_year))
								else:
									author_insert = '''INSERT INTO ereolen_author(full_name) VALUES(%s) RETURNING id'''
									cursor.execute(author_insert, (author, ))
								author_id = cursor.fetchone()[0]
							author_relation_sql = '''INSERT INTO ereolen_title_author (title_id, author_id) VALUES (%s,%s)'''
							try:
								cursor.execute(author_relation_sql, (title_id,author_id))
							except:
								connection.rollback()
	
					### SAVE ###
					connection.commit()
					title_counter += 1
			page += 1
connection.close()
print('Ereolen crawl ran')
if title_counter > 0:
	print('Added titles on ereolen:', title_counter)

Og i Djangos indbyggede administrationsinterface, kan jeg med fint overblik og gode søgnings-, sorterings- og filtreringsmuligheder få øje på en novellesamling af Georg Metz, der netop er dukket op i eReolen med en “dkcclterm.op”-værdi fra september 2013!

Min egen private eReolen. Beklager det bliver lidt småt.

Må jeg prøve?

Jeg ville gerne dele mit værktøj med andre, men det er ikke helt lige til at afklare, hvilke dele af eReolens bogdata, der er frie og offentlige, og hvilke der ejes af en (i mine øjne) lidt underlig konstruktion, der hedder DBC. Et KL-ejet firma (Kommunernes Landsforening), der tjener penge på at sælge data om bøger til – kommuner (og nogle andre aktører, som jeg gætter på næsten udelukkende er offentlige).

Jeg er ved at undersøge, hvad jeg kan offentliggøre uden at genere nogen eller bryde ophavsretsloven. Det kan godt være, det tager lidt tid.