Jeg kan godt lide at læse bøger, og mange af dem låner jeg på eReolen. Men for en bognørd er eReolen ikke særlig brugervenlig. Der er godt nok en masse mærkelige søgninger, man kan lave, hvis man er teknisk nok, men sådan noget som at se, hvad der rent faktisk er nytilføjede bøger, er svært at følge med i.
eReolen har godt nok en sektion, de kalder “nyheder”, med en søgning der i skrivende stund (februar 2022) hedder noget i retning af:
(dkcclterm.op=202112* OR dkcclterm.op=202201*) AND term.type=ebog and facet.category=voksenmaterialer
Slår man op i beskrivelsen af brøndindekser, kan man se at “dkcclterm.op” dækker over:
dkcclterm.op | op | Oprettelsesdato |
Men hvordan kan det være, at en visning af nyheder søger på oprettelsesdatoer i december og januar? Det er februar nu.
Fordi “Oprettelsesdato” for en titel ikke er det samme som dato for titlens tilføjelse på eReolen. Hvad det betyder, ved jeg ikke med sikkerhed, men i hvert fald ikke titlens tilføjelse på eReolen.
Og det betyder, at der løbende kan dukke spændende bøger op, hvis “dkcclterm.op”-værdi ligger langt tilbage i tiden.
Og det betyder, at jeg kan risikere at misse noget, jeg gerne vil læse.
Hvad gjorde jeg så?
Jeg byggede min egen eReolen! Med en robot, der hver nat monitorerer, hvilke titler der rent faktisk er nye. Hver morgen ligger der en mail til mig om, hvor mange titler robotten har fundet, og hvis jeg har tid og kaffe til det, kan jeg kigge de nye titler igennem over morgenkaffen.
Det fungerer sådan her:
I Django byggede jeg en datamodel over titler med forskellige metadata:
from django.db import models
from isbn_field import ISBNField
class Author(models.Model):
full_name = models.CharField('Forfatter', max_length=200, unique=True)
birth_year = models.DateField(null=True)
def __str__(self):
return self.full_name
class Publisher(models.Model):
publisher = models.CharField('Udgiver', max_length=200, unique=True)
def __str__(self):
return self.publisher
class Keyword(models.Model):
keyword = models.CharField('Nøgleord', max_length=200, unique=True)
def __str__(self):
return self.keyword
class TitleType(models.Model):
title_type = models.CharField('Type', max_length=200, unique=True)
def __str__(self):
return self.title_type
class Language(models.Model):
language = models.CharField('Sprog', max_length=50, unique=True)
def __str__(self):
return self.language
class Isbn(models.Model):
isbn = ISBNField(null=True, blank=True)
def __str__(self):
return self.isbn
class Audience(models.Model):
audience = models.CharField('Målgruppe', max_length=200, unique=True)
def __str__(self):
return self.audience
class TitleFormat(models.Model):
title_format = models.CharField('Format', max_length=50, unique=True)
def __str__(self):
return self.title_format
class Title(models.Model):
added = models.DateField()
object_id = models.CharField('Ereolen-id', max_length=50, unique=True)
title = models.CharField('Titel', max_length=500)
original_title = models.CharField('Originaltitel', max_length=500, default="")
publish_date = models.DateField(null=True)
dk5 = models.CharField('DK5-kode', max_length=10, default="")
cover_url = models.URLField('Cover-url', max_length=500, null=True)
ereolen_url = models.URLField('Ereolen-url', max_length=500)
abstract = models.TextField(blank=True)
dkcclterm_op = models.DateField()
publisher = models.ForeignKey(Publisher, on_delete=models.CASCADE)
language = models.ForeignKey(Language, on_delete=models.CASCADE)
title_type = models.ForeignKey(TitleType, on_delete=models.CASCADE)
title_format = models.ForeignKey(TitleFormat, on_delete=models.CASCADE)
author = models.ManyToManyField(Author)
keyword = models.ManyToManyField(Keyword)
audience = models.ManyToManyField(Audience)
isbn = models.ManyToManyField(Isbn)
def __str__(self):
return self.title
def get_authors(self):
return " & ".join([author.full_name for author in self.author.all()])
get_authors.short_description = "Author(s)"
def get_isbns(self):
return ", ".join([isbn.isbn for isbn in self.isbn.all()])
get_isbns.short_description = "ISBN(s)"
def get_keywords(self):
return ", ".join([keyword.keyword for keyword in self.keyword.all()])
get_keywords.short_description = "Keyword(s)"
def get_audiences(self):
return ", ".join([audience.audience for audience in self.audience.all()])
get_audiences.short_description = "Audience(s)"
I Python skrev jeg en robot, der søger eReolen igennem, tilføjer nye titler til min database og ignorerer titler, der allerede er i databasen. Robotten satte jeg op til at køre hver nat på min server:
# -*- coding: utf-8 -*-
# Author: Morten Helmstedt. E-mail: helmstedt@gmail.com
""" This program saves ebooks, audiobooks and podcasts from ereolen.dk to a local database
that can be used to detect new titles better than ereolen.dk's own search options """
import requests # make http requests
from bs4 import BeautifulSoup # parse html responses
from datetime import date # create date objects
from dateutil.relativedelta import relativedelta # adding and subtracting months to dates
import re # regex for publish year parsing
import psycopg2 # work with postgresql databases
from psycopg2 import Error # database error handling
# Connect to database
try:
connection = psycopg2.connect(user = "",
password = "",
host = "",
port = "",
database = "")
cursor = connection.cursor()
except (Exception, psycopg2.Error) as error:
print("Error while connecting to PostgreSQL", error)
# Set configuration options and global variables
base_url = 'https://ereolen.dk'
term_types = ['ebog','lydbog','podcast']
added = date.today()
number_of_months_to_search = 200
start_month = added - relativedelta(months=number_of_months_to_search-2)
# Search period list goes from current month plus one month and back to start_month
search_period = []
for i in reversed(range(0,number_of_months_to_search)):
year_month_date = start_month + relativedelta(months=+i)
year_month = [year_month_date.year, year_month_date.month]
search_period.append(year_month)
# Crawl loop
title_counter = 0
for year_month in search_period:
for term_type in term_types:
start_date = date(year_month[0],year_month[1],1)
dkcclterm_op_search = start_date.strftime("%Y%m")
page = 0
pages_left = True
while pages_left == True:
# Search for hits
search_url = base_url + '/search/ting/dkcclterm.op%3D' + dkcclterm_op_search + '*%20AND%20term.type%3D' + term_type + '?page=' + str(page) + '&sort=date_descending'
request = requests.get(search_url)
result = request.text
# If an error message is returned in the search, either no results are left, or ereolen.dk is down for some reason
# In this case, the while loop is broken to try next item type and/or next year-month combination
if 'Vi kan desværre ikke finde noget, der matcher din søgning' in result or 'The website encountered an unexpected error. Please try again later.' in result:
pages_left = False
break
# Parse hits and get all item links
soup = BeautifulSoup(result, "lxml")
links = soup.find_all('a', href=True)
item_links = {link['href'] for link in links if "/ting/collection/" in link['href']}
# Go through item link
for link in item_links:
# Get id and check if link is already in databse
object_id = link[link.rfind('/')+1:].replace('%3A',':')
search_sql = '''SELECT * from ereolen_title WHERE object_id = %s'''
cursor.execute(search_sql, (object_id, ))
item_hit = cursor.fetchone()
# No hits means item is not in database and should be added
if not item_hit:
### ADD SEQUENCE ###
# Set full url for item
ereolen_url = base_url + link
# Request item and parse html
title_request = requests.get(ereolen_url)
title_result = title_request.text
title_soup = BeautifulSoup(title_result, "lxml")
# TITLE FIELDS #
# TITLE
try:
title = title_soup.find('div', attrs={'class':'field-name-ting-title'}).text.replace(" : ",": ")
except:
print("Ingen titel på:", ereolen_url)
break
# ORIGINAL TITLE
try:
original_title = title_soup.find('div', attrs={'class':'field-label'}, string=re.compile("Original titel:")).next.next.text
except:
original_title = ''
# PUBLISHED
try:
published = title_soup.find('div', class_={"field-name-ting-author"}).get_text()
published = int(re.search("[(]\d\d\d\d[)]", published).group()[1:5])
publish_date = date(published,1,1)
except:
publish_date = None
# COVER URL
try:
cover_url = title_soup.find('div', class_={"ting-cover"}).img['src']
except:
try:
data = {
'coverData[0][id]': object_id,
'coverData[0][image_style]': 'ding_primary_large'
}
response = requests.post('https://ereolen.dk/ting/covers', data=data)
response_json = response.json()
cover_url = response_json[0]['url']
except:
cover_url = ''
# ABSTRACT
abstract = title_soup.find('div', attrs={'class':'field-name-ting-abstract'}).text
# DKCCLTERM_OP
dkcclterm_op = start_date
# FOREIGN KEY FIELDS #
# LANGUAGE
try:
ereolen_language = title_soup.find('div', attrs={'class':'field-label'}, string=re.compile("Sprog:")).next.next.text
except:
ereolen_language = 'Ukendt'
language_sql = '''SELECT * from ereolen_language WHERE language = %s'''
cursor.execute(language_sql, (ereolen_language, ))
try:
language = cursor.fetchone()[0]
except:
language_insert = '''INSERT INTO ereolen_language(language) VALUES(%s) RETURNING id'''
cursor.execute(language_insert, (ereolen_language, ))
language = cursor.fetchone()[0]
# PUBLISHER
try:
ereolen_publisher = title_soup.find('div', attrs={'class':'field-label'}, string=re.compile("Forlag:")).next.next.text
except:
ereolen_publisher = 'Ukendt'
publisher_sql = '''SELECT * from ereolen_publisher WHERE publisher = %s'''
cursor.execute(publisher_sql, (ereolen_publisher, ))
try:
publisher = cursor.fetchone()[0]
except:
publisher_insert = '''INSERT INTO ereolen_publisher(publisher) VALUES(%s) RETURNING id'''
cursor.execute(publisher_insert, (ereolen_publisher, ))
publisher = cursor.fetchone()[0]
# TYPE
try:
ereolen_type = title_soup.find('div', attrs={'class':'field-label'}, string=re.compile("Type:")).next.next.text
except:
ereolen_type = 'Ukendt'
type_sql = '''SELECT * from ereolen_titletype WHERE title_type = %s'''
cursor.execute(type_sql, (ereolen_type, ))
try:
title_type = cursor.fetchone()[0]
except:
title_type_insert = '''INSERT INTO ereolen_titletype(title_type) VALUES(%s) RETURNING id'''
cursor.execute(title_type_insert, (ereolen_type, ))
title_type = cursor.fetchone()[0]
# FORMAT
try:
ereolen_format = title_soup.find('div', attrs={'class':'field-label'}, string=re.compile("Ebogsformat:")).next.next.text
except:
ereolen_format = "Ukendt"
format_sql = '''SELECT * from ereolen_titleformat WHERE title_format = %s'''
cursor.execute(format_sql, (ereolen_format, ))
try:
title_format = cursor.fetchone()[0]
except:
title_format_insert = '''INSERT INTO ereolen_titleformat(title_format) VALUES(%s) RETURNING id'''
cursor.execute(title_format_insert, (ereolen_format, ))
title_format = cursor.fetchone()[0]
# DK5 - TODO: Not done yet
dk5 = ""
### SAVE BEFORE ADDING MANY-TO-MANY FIELDS ###
title_data = (added,title_type,title,original_title,publisher,object_id,language,publish_date,cover_url,ereolen_url,title_format,abstract,dkcclterm_op,dk5)
title_insert = '''INSERT INTO ereolen_title(added,title_type_id,title,original_title,publisher_id,object_id,language_id,publish_date,cover_url,ereolen_url,title_format_id,abstract,dkcclterm_op,dk5) VALUES(%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s) RETURNING id'''
cursor.execute(title_insert, title_data)
title_id = cursor.fetchone()[0]
connection.commit()
# MANY-TO-MANY FIELDS #
# AUDIENCE(S)
try:
audience_div = title_soup.find('div', attrs={'class':'field-label'}, string=re.compile("Målgruppe:")).next.next
audiences = audience_div.find_all('span')
audiences_list = [aud.text for aud in audiences]
except:
audiences_list = ['Ukendt']
for audience in audiences_list:
audience_sql = '''SELECT * from ereolen_audience WHERE audience = %s'''
cursor.execute(audience_sql, (audience, ))
try:
audience_id = cursor.fetchone()[0]
except:
audience_insert = '''INSERT INTO ereolen_audience(audience) VALUES(%s) RETURNING id'''
cursor.execute(audience_insert, (audience, ))
audience_id = cursor.fetchone()[0]
audience_relation_sql = '''INSERT INTO ereolen_title_audience (title_id, audience_id) VALUES (%s,%s)'''
try:
cursor.execute(audience_relation_sql, (title_id,audience_id))
except:
connection.rollback()
# ISBN(S)
try:
isbn_div = title_soup.find('div', attrs={'class':'field-label'}, string=re.compile("ISBN:")).next.next
isbns = isbn_div.find_all('span')
isbns_list = [isb.text for isb in isbns]
for isbn in isbns_list:
isbn_sql = '''SELECT * from ereolen_isbn WHERE isbn = %s'''
cursor.execute(isbn_sql, (isbn, ))
try:
isbn_id = cursor.fetchone()[0]
except:
isbn_insert = '''INSERT INTO ereolen_isbn(isbn) VALUES(%s) RETURNING id'''
cursor.execute(isbn_insert, (isbn, ))
isbn_id = cursor.fetchone()[0]
isbn_relation_sql = '''INSERT INTO ereolen_title_isbn (title_id, isbn_id) VALUES (%s,%s)'''
try:
cursor.execute(isbn_relation_sql, (title_id,isbn_id))
except:
connection.rollback()
except:
pass
# KEYWORDS(S)
keywords_div = title_soup.find('div', attrs={'class':'field-name-ting-subjects'})
if keywords_div:
keywords = [link.text for link in keywords_div.find_all('a')]
for keyword in keywords:
keyword_sql = '''SELECT * from ereolen_keyword WHERE keyword = %s'''
cursor.execute(keyword_sql, (keyword, ))
try:
keyword_id = cursor.fetchone()[0]
except:
keyword_insert = '''INSERT INTO ereolen_keyword(keyword) VALUES(%s) RETURNING id'''
cursor.execute(keyword_insert, (keyword, ))
keyword_id = cursor.fetchone()[0]
keyword_relation_sql = '''INSERT INTO ereolen_title_keyword (title_id, keyword_id) VALUES (%s,%s)'''
try:
cursor.execute(keyword_relation_sql, (title_id,keyword_id))
except:
connection.rollback()
# AUTHOR(S)
creator_full = title_soup.find('div', attrs={'class':'field-name-ting-author'}).text.replace("Af ","")
# Remove date of book
creator = creator_full[:creator_full.rfind("(")-1]
authors = creator.split(",")
for author in authors:
birth_year = None
if ' (f. ' in author and not len(author) < 7:
if 'ca. ' in author:
author = author.replace('ca. ','')
birth_year_string = author[author.index("(f.")+4:author.index("(f.")+8]
if ')' in birth_year_string:
birth_year_string = birth_year_string.replace(')','')
birth_year = date(int(birth_year_string),1,1)
author = author[:author.index(" (f.")]
elif ' (f. ' in author:
breakpoint()
# Some times there are no authors, but still a published year
if len(author) == 5 and "(" in author:
author = ""
if author:
author = author.strip()
author_sql = '''SELECT * from ereolen_author WHERE full_name = %s'''
cursor.execute(author_sql, (author, ))
try:
author_id = cursor.fetchone()[0]
except:
if birth_year:
author_insert = '''INSERT INTO ereolen_author(full_name,birth_year) VALUES(%s,%s) RETURNING id'''
cursor.execute(author_insert, (author,birth_year))
else:
author_insert = '''INSERT INTO ereolen_author(full_name) VALUES(%s) RETURNING id'''
cursor.execute(author_insert, (author, ))
author_id = cursor.fetchone()[0]
author_relation_sql = '''INSERT INTO ereolen_title_author (title_id, author_id) VALUES (%s,%s)'''
try:
cursor.execute(author_relation_sql, (title_id,author_id))
except:
connection.rollback()
### SAVE ###
connection.commit()
title_counter += 1
page += 1
connection.close()
print('Ereolen crawl ran')
if title_counter > 0:
print('Added titles on ereolen:', title_counter)
Og i Djangos indbyggede administrationsinterface, kan jeg med fint overblik og gode søgnings-, sorterings- og filtreringsmuligheder få øje på en novellesamling af Georg Metz, der netop er dukket op i eReolen med en “dkcclterm.op”-værdi fra september 2013!

Må jeg prøve?
Jeg ville gerne dele mit værktøj med andre, men det er ikke helt lige til at afklare, hvilke dele af eReolens bogdata, der er frie og offentlige, og hvilke der ejes af en (i mine øjne) lidt underlig konstruktion, der hedder DBC. Et KL-ejet firma (Kommunernes Landsforening), der tjener penge på at sælge data om bøger til – kommuner (og nogle andre aktører, som jeg gætter på næsten udelukkende er offentlige).
Jeg er ved at undersøge, hvad jeg kan offentliggøre uden at genere nogen eller bryde ophavsretsloven. Det kan godt være, det tager lidt tid.