Invenio: print records in marcxml format using invenio API

Sometimes it is useful to get the marcxml output of some records.

This is an example showing how to print recid’s from 18007 to 18200 in marcxml format using Invenio API:

from invenio.search_engine import print_record
 
salida = ''
 
for recid in range(18007,18200):
    #print "Registro %s" %recid
    salida += print_record(recid,format='xm')
 
print salida

Invenio 1: BibRank exception

I was getting several exceptions in Bibrank:

* 2015-01-15 08:32:17 -> NoOptionError: No option 'citation_loss_limit' in section: 'citation' (ConfigParser.py:618:get)

** User details
No client information available

** Traceback details 

Traceback (most recent call last):
  File "/usr/lib64/python2.7/site-packages/invenio/bibtask.py", line 606, in task_init
    ret = _task_run(task_run_fnc)
  File "/usr/lib64/python2.7/site-packages/invenio/bibtask.py", line 1146, in _task_run
    if callable(task_run_fnc) and task_run_fnc():
  File "/usr/lib64/python2.7/site-packages/invenio/bibrank.py", line 159, in task_run_core
    func_object(key)
  File "/usr/lib64/python2.7/site-packages/invenio/bibrank_tag_based_indexer.py", line 443, in citation
    return bibrank_engine(run)
  File "/usr/lib64/python2.7/site-packages/invenio/bibrank_tag_based_indexer.py", line 356, in bibrank_engine
    func_object(rank_method_code, cfg_name, config)
  File "/usr/lib64/python2.7/site-packages/invenio/bibrank_tag_based_indexer.py", line 68, in citation_exec
    dic, index_update_time = get_citation_weight(rank_method_code, config)
  File "/usr/lib64/python2.7/site-packages/invenio/bibrank_citation_indexer.py", line 141, in get_citation_weight
    weights = process_and_store(updated_recids, config, chunk_size)
  File "/usr/lib64/python2.7/site-packages/invenio/bibrank_citation_indexer.py", line 157, in process_and_store
    citation_loss_limit = int(config.get(function, "citation_loss_limit"))
  File "/usr/lib64/python2.7/ConfigParser.py", line 618, in get
    raise NoOptionError(option, section)
NoOptionError: No option 'citation_loss_limit' in section: 'citation'

** Stack frame details

This was solved by updating /opt/invenio/etc/bibrank/citation.cfg with citation_loss_limit = 50 and I also included some more options:

[...]
reference_via_doi= 999C5a
reference_via_record_id= 990C50
reference_via_isbn= 999C5i
[...]
citation_loss_limit = 50
collections =

Then it was solved 🙂

[SOLVED] Fix apple-touch-icon 404 errors

Some days ago I was checking AWStats reports in my Invenio site and I noticed some (unexpected) 404 errors. Visitors were trying to load URL’s like /iphone or /m and some images which were not linked in my site… (‘apple-touch-icon.png‘ and similar filenames).

apple-touch-icon-precomponsed.png 404 fix

This has to do with some bots coming along, assuming that my site includes a mobile version, and then trying its hand at guessing the location. In the common request-set listed above, we see the bot looking first for an “apple-touch icon,” and then for mobile content in various directories.

But what about thoses images? Take a read at: http://www.computerhope.com/jargon/a/appletou.htm

Similar to the Favicon, the apple-touch-icon.png is a file used for a web page icon on the Apple iPhone, iPod Touch, and iPad. When someone bookmarks your web page or adds your web page to their home screen this icon is used. If this file is not found these Apple products will use the screen shot of the web page, which often looks like no more than a white square.

This file should be saved as a .png, have dimensions of 57 x 57, and be stored in your home directory, unless the path is specified in the HTML using the below code.

When this file is used, by default, the Apple product will automatically give the icon rounded edges and a button-like appearance.

I wanted to fix this, so I began by testing if mod_rewrite was enabled…

[root@aneto www]# grep -R "mod_rewrite" /etc/httpd/conf/
/etc/httpd/conf/httpd.conf:LoadModule rewrite_module modules/mod_rewrite.so

The LoadModule line is uncommented, so it is enabled.

Next step would be to try a basic redirection to test mod_rewrite.

Edit $PATH_TO_INVENIO/etc/apache/invenio-apache-vhost.conf and added this lines in the VirtualHost part.

<ifmodule mod_rewrite.c>
           RewriteEngine  On
           RewriteLog "/home/apache/rewrite.log"
           RewriteLogLevel 9
           RewriteRule old.html bar.html [R]
</ifmodule>

Then restart apache…

[root@aneto ~]# /etc/init.d/httpd  restart

Then open your browser and test the redirection. You should be redirected and the /home/apache/rewrite.log should log that redirection…

[root@aneto ~]# tail -n20 -f /home/apache/rewrite.log
 
155.210.47.93 - - [05/Jul/2013:12:21:06 +0200] [155.210.47.102/sid#2b689442f0c8][rid#2b68947a7380/initial] (2) init rewrite engine with requested uri /old.html
155.210.47.93 - - [05/Jul/2013:12:21:06 +0200] [155.210.47.102/sid#2b689442f0c8][rid#2b68947a7380/initial] (3) applying pattern 'old.html' to uri '/old.html'
155.210.47.93 - - [05/Jul/2013:12:21:06 +0200] [155.210.47.102/sid#2b689442f0c8][rid#2b68947a7380/initial] (2) rewrite '/old.html' -> 'bar.html'
155.210.47.93 - - [05/Jul/2013:12:21:06 +0200] [155.210.47.102/sid#2b689442f0c8][rid#2b68947a7380/initial] (2) explicitly forcing redirect with http://155.210.47.102/bar.html
155.210.47.93 - - [05/Jul/2013:12:21:06 +0200] [155.210.47.102/sid#2b689442f0c8][rid#2b68947a7380/initial] (1) escaping http://155.210.47.102/bar.html for redirect
155.210.47.93 - - [05/Jul/2013:12:21:06 +0200] [155.210.47.102/sid#2b689442f0c8][rid#2b68947a7380/initial] (1) redirect to http://155.210.47.102/bar.html [REDIRECT/302]
155.210.47.93 - - [05/Jul/2013:12:21:06 +0200] [155.210.47.102/sid#2b46e1df80d8][rid#2b46eac962e0/initial] (2) init rewrite engine with requested uri /bar.html
155.210.47.93 - - [05/Jul/2013:12:21:06 +0200] [155.210.47.102/sid#2b46e1df80d8][rid#2b46eac962e0/initial] (3) applying pattern 'old.html' to uri '/bar.html'
155.210.47.93 - - [05/Jul/2013:12:21:06 +0200] [155.210.47.102/sid#2b46e1df80d8][rid#2b46eac962e0/initial] (1) pass through /bar.html

Now that we know that mod_rewrite is working properly, lets add some code to forbid some URL patterns (more refences)…

<ifmodule mod_rewrite.c>
           RewriteEngine  On
           RewriteLog "/home/apache/rewrite.log"
           RewriteLogLevel 9
           #RewriteRule old.html bar.html [R]
           RewriteCond %{REQUEST_URI} /iphone/?$ [NC,OR]
           RewriteCond %{REQUEST_URI} /mobile/?$ [NC,OR]
           RewriteCond %{REQUEST_URI} /mobi/?$ [NC,OR]
           RewriteCond %{REQUEST_URI} /m/?$ [NC]
           RewriteRule (.*) - [F,L]
</ifmodule>

This technique is useful for saving bandwidth and server resources, not just for non-existent mobile-ish requests, but also for any resource that you would like to block – just add a RewriteCond with the target character string of your choice. Hopefully this technique will help you run a cleaner, safer, and more secure website.

Now, what to do with those apple-touch-icon-precomposed.png and similar images which are ending in 404 errors?

First read full Apple documentation about this issue.

Then you can fix it several ways:

1) Search for those images and download them to /soft/cds-invenio/var/www/

cd /soft/cds-invenio/var/www/
wget http://gwt-touch.googlecode.com/svn-history/r86/trunk/demo-ipad-settings/war/apple-touch-icon-precomposed.png

Or you can create some personalized images using online services like http://iconifier.net/

Captura de pantalla 2013-07-05 a la(s) 14.32.55

And you’re ready to go! Logs won’t show those ugly 404 errors from now on and visitors using iphone’s will be happier 🙂

CDS Invenio 0.99.X: inveniogc ERROR [SOLVED]

Some days ago I noticed there was something wrong with inveniogc. Every time I run inveniogc -a I was getting errors like:

2013-04-17 08:31:30 --> 2013-04-17 08:31:30 --> Updating task status to ERROR.
2013-04-17 08:31:30 --> Task #21731 finished. [ERROR]

Calling inveniogc with verbose level = 9 I got some more information (var/log/bibsched_task_XXXX.log and .err files):

2013-04-17 08:29:51 --> - deleting queries not attached to any user
 
2013-04-17 08:29:51 -->   SELECT DISTINCT q.id
  FROM query AS q LEFT JOIN user_query AS uq
  ON uq.id_query = q.id
  WHERE uq.id_query IS NULL AND
  q.type <> 'p' 
 
2013-04-17 08:31:30 --> 2013-04-17 08:31:30 --> Updating task status to ERROR.
2013-04-17 08:31:30 --> Task #21731 finished. [ERROR]

The issue arised when inveniogc tried to delete user queries not attached to any user. I edited lib/python/invenioinveniogc.py and noticed the error was being produced by the output of a query result being printed. Just commented that out and inveniogc works again:

write_message("""  SELECT DISTINCT q.id\n  FROM query AS q LEFT JOIN user_query AS uq\n  ON uq.id_query = q.id\n  WHERE uq.id_query IS NULL AND\n  q.type <> 'p' """, verbose=9)
result = run_sql("""SELECT DISTINCT q.id
                    FROM query AS q LEFT JOIN user_query AS uq
                    ON uq.id_query = q.id
                    WHERE uq.id_query IS NULL AND
                          q.type <> 'p'""")
 
# write_message(result, verbose=9)

Why is this? It seems that the output buffer that write_message is using is too small to store the result of the previous query, so it fails…

CDS Invenio: batch delete records or interval of records (from python interpreter)

Sometime ago I came up with this little hack to add invenio the functionality to delete a record from command line.

If you need to delete a lot of records (i.e. in your testing/development server), you can add this other hack to bibeditcli.py:

Delete several records from invenio: the dirty way

This works, but is not necesarily the way to go. There is another way to achieve same result (records deleted) but does not over load Bibsched with a task for each record. We’ll go over that one later, though.

First thing first: lets go the dirrrrrty way:

def cli_delete_interval(recid_inicio, recid_fin):
    """
    Delete records from recid_inicio to recid_fin, both included
    You'd better make sure...
    """
    try:
        recid_inicio = int(recid_inicio)
    except ValueError:
        print "ERROR: First Record ID must be integer, not %s:" %recid_inicio
        sys.exit(1)
    try:
        recid_fin = int(recid_fin)
    except ValueError:
        print "ERROR: End record ID must be integer, not %s." %recid_fin
        sys.exit(1)
 
    if recid_inicio > recid_fin:
        print "ERROR: First record ID must be less than last record ID."
        sys.exit(1)
 
    for recid in range(recid_inicio, recid_fin):
        (record, junk) = get_record(CFG_SITE_LANG, recid, 0, "false")
        add_field(recid, 0, record, "980", "", "", "c", "DELETED")
        save_temp_record(record, 0, "%s.tmp" % get_file_path(recid))
        save_xml_record(recid)

This is how you call this new function from python.
First, navigate to $PATH_TO_INVENIO/lib/python and run your python interpreter

[miguel@mydevinvenioinstance ~]# cd /soft/cds-invenio/lib/
[miguel@mydevinvenioinstance lib]# python
[GCC 4.1.2 20080704 (Red Hat 4.1.2-50)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>>

And then, just…

>>> import invenio
>>> from invenio.bibeditcli import cli_delete_interval
>>> # the following line will delete records from ID=5125 to ID=7899 .... 
>>> # BE CAREFUL! GREAT POWER COMES WITH GREAT RESPONSIBILITY
>>> 
>>> cli_delete_interval(5125,7899)

Delete several records from Invenio: the not-so-dirty way

If you take a look at the new cli_delete_interval we just came up with, or run it over a big interval, a whole lot of new tmp files will be generated and a lot of tasks will be sent to bibsched (one for every record.). Not efficient. Not nice.

This code is better. Just one tmp file (which will be deleted upon termination) and one single task sent to bibsched.
Please notice the # EDIT HERE! part at line 13

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
def cli_delete_interval(recid_inicio, recid_fin):
    """
    By: Miguel Martin 20120130 
    Goal:
      Delete records from recid_inicio to recid_fin, both included
      Creates just a tmp file and a task (just one) is sent to bibsched
    """
 
    from invenio.bibrecord import record_xml_output
    from invenio.bibtask import task_low_level_submission
 
    # EDIT HERE! FILEPATH MUST BE READABLE/WRITABLE! ######
    tmpfile = "/home/miguelm/tmp/borrado.xml" 
    # #####################################################
 
    try:
        recid_inicio = int(recid_inicio)
    except ValueError:
        print "ERROR: First Record ID must be integer, not %s:" %recid_inicio
        sys.exit(1)
    try:
        recid_fin = int(recid_fin)
    except ValueError:
        print "ERROR: End record ID must be integer, not %s." %recid_fin
        sys.exit(1)
 
    if recid_inicio > recid_fin:
        print "ERROR: First record ID must be less than last record ID."
        sys.exit(1)
 
    fd = open(tmpfile, "w")
    for recid in range(recid_inicio, recid_fin):
        (record, junk) = get_record(CFG_SITE_LANG, recid, 0, "false")
        add_field(recid, 0, record, "980", "", "", "c", "DELETED")
        fd.write(record_xml_output(record))
 
    fd.close()
    task_low_level_submission('bibupload', 'bibedit', '-P', '5', '-r', '%s' % tmpfile)
    #os.system("rm %s" % tmpfile)

Cheers!

— update 20130628 —

Delete all the records

If you want to wipe out all the existing bibliographic content of your site, for example to start uploading the documents from scratch again, you can launch:

       $ /opt/invenio/bin/dbexec < /opt/invenio/src/invenio-0.90/modules/miscutil/sql/tabbibclean.sql
       $ rm -rf /opt/invenio/var/data/files/*
       $ /opt/invenio/bin/webcoll
       $ /opt/invenio/bin/bibindex --reindex

Invenio 1: webaccess ‘become user’ does not work [BUG + PATCH]

CDS Invenio Webaccess module offers this useful “become user” functionality which allows users (whom have been granted with that privilege) to become another user (just like if they have logged in with another account).

This worked like a charm in previous versions of Invenio, but as stated in ‘WebAccess: fix becomeuser exception’, still does not work in Invenio 1.0.

I could not wait for the Invenio guys to fix this, so I implemented a (dirty-but-working) patch myself.

First, edit becomeuser function (defined at webaccessadmin.py):
From:

def becomeuser(req, userID='', callback='yes', confirm=0, ln=CFG_SITE_LANG):

To:

def becomeuser(req, userID='', callback='yes', confirm=0, ln=CFG_SITE_LANG, rand='', email_user_pattern='', limit_to='',maxpage=''):

Then edit webuser.py:
Locate this:

try:
    from invenio.session import get_session

And change it to:

try:
    from invenio.session import get_session, save

Also change the setUid function (also defined at webuser.py)
From:

def setUid(req, uid, remember_me=False):
    """It sets the userId into the session, and raise the cookie to the client.
    """
    if hasattr(req, '_user_info'):
        del req._user_info
    session = get_session(req)
    session.invalidate()
    session = get_session(req)
    session['uid'] = uid
    if remember_me:
        session.set_timeout(86400)
    session.set_remember_me(remember_me)
    if uid > 0:
        user_info = collect_user_info(req, login_time=True)
        session['user_info'] = user_info
        req._user_info = user_info
    else:
        del session['user_info']
 
    return uid

To:

def setUid(req, uid, remember_me=False):
    """It sets the userId into the session, and raise the cookie to the client.
    """
    if hasattr(req, '_user_info'):
        del req._user_info
    session = get_session(req)
    session.invalidate()
    session = get_session(req)
    session['uid'] = uid
    if remember_me:
        session.set_timeout(86400)
    session.set_remember_me(remember_me)
    if uid > 0:
        user_info = collect_user_info(req, login_time=True)
        session['user_info'] = user_info
        req._user_info = user_info
    else:
        del session['user_info']
 
    # THE FIX -----------------------
    session.save()
    # -------------------------------
    return uid

Last step, as usual…

sudo -u www-data /opt/invenio/bin/inveniocfg --update-all; /etc/init.d/apache2 restart;

Invenio 1 API: list all restricted records and (restricted/public) collections

Let’s take a quick look at some of the functions provided by Invenio1 search_engine API:

Get all the collection names…

>>> import invenio.search_engine
>>> collection_reclist_cache = invenio.search_engine.CollectionRecListDataCacher()
>>> print collection_reclist_cache.cache.keys()

Get all the restricted collections (the ones which have access restrictions configured via webaccess-viewrestrcoll authorizations

>>> import invenio.search_engine
>>> restricted_collection_cache = invenio.search_engine.RestrictedCollectionDataCacher()
>>> print restricted_collection_cache.cache

Get all the records which belong to restricted collections

>>> import invenio.search_engine
>>> restricted_collection_cache = invenio.search_engine.RestrictedCollectionDataCacher()
>>> for collection in restricted_collection_cache.cache:
...      print "Coleccion: '%s'" %collection
...      print "Registros: '%s'" %str(repr(invenio.search_engine.get_collection_reclist(collection)))

CDS Invenio 1: Understanding viewrestrcoll action and fixing websearch_engine bug

Just discovered a bug in CDS Invenio 1.0’s search_engine.

The bug

If you define a new authorization for viewing a collection that does not exist, that is, for instance:
action = viewrestrcoll, rol=EXAMPLEROLE, with parameter collection='COLLNAMEWHICHDOESNOTEXIST'

When a guest user (does not belong to EXAMPLEROLE) tries to view a record which belongs to a restricted collection, a 500 Server Error is thrown:

Traceback (most recent call last):
  File "/usr/local/lib/python2.6/dist-packages/invenio/webinterface_handler.py", line 413, in _handler
    return root._traverse(req, path, False, guest_p)
  File "/usr/local/lib/python2.6/dist-packages/invenio/webinterface_handler.py", line 250, in _traverse
    result = _check_result(req, obj(req, form))
  File "/usr/local/lib/python2.6/dist-packages/invenio/websearch_webinterface.py", line 1035, in __call__
    (auth_code, auth_msg) = check_user_can_view_record(user_info, self.recid)
  File "/usr/local/lib/python2.6/dist-packages/invenio/search_engine.py", line 292, in check_user_can_view_record
    restricted_collections = get_restricted_collections_for_recid(recid, recreate_cache_if_needed=False)
  File "/usr/local/lib/python2.6/dist-packages/invenio/search_engine.py", line 241, in get_restricted_collections_for_recid
    return [collection for collection in restricted_collection_cache.cache if recid in get_collection_reclist(collection, recreate_cache_if_needed=False)]
  File "/usr/local/lib/python2.6/dist-packages/invenio/search_engine.py", line 390, in get_collection_reclist
    if not collection_reclist_cache.cache[coll]:
KeyError: 'COLLNAMEWHICHDOESNOTEXIST'

Why does this happen? The restricted_collection_cache contains ‘COLLNAMEWHICHDOESNOTEXIST‘ but NOT IN ‘collection_reclist_cache.cache‘…

import invenio.search_engine
>>> restricted_collection_cache = invenio.search_engine.RestrictedCollectionDataCacher()
>>> print restricted_collection_cache.cache
['Theses', 'Atlantis Times Drafts', ... , 'COLLNAMEWHICHDOESNOTEXIST']
 
>>> collection_reclist_cache = invenio.search_engine.CollectionRecListDataCacher()
>>> print collection_reclist_cache.cache
{'Autenticacion': None, 'Poetry': None, 'normativa': None, 'otros documetos de sistemas': None, 'Concursos y compras': None, 'solaris': None, 'Theoretical Physics (TH)': None, 'Atlantis Institute Books': None, 'Servidores web': None, 'Equipos de Red': None, 'Compras ': None, 'Sql ': None, 'Virtualizaci\xc3\xb3n': None, 'postgres': None, 'CERN Experiments': None, 'Impresoras': None, 'Gestion': None, 'Atlantis Times Science': None, 'Telefonia': None, 'Cabinas Disco': None, 'Correo': None, 'ngix': None, 'Experimental Physics (EP)': None, 'otros documentos de comunicaciones': None, 'otras': None, 'Nominas': None, 'Bases de datos': None, 'ISOLDE': None, 'Administraci\xc3\xb3n': None, 'Atlantis Times News': None, 'otros SO': None, 'jboss': None, 'Ingres': None, 'Multimedia & Arts': None, 'Otros lenguajes de programaci\xc3\xb3n': None, 'GlassFish ': None, 'BEA WebLogic': None, 'Ordenadores Personales': None, 'apache': None, 'Servidores y web y de aplicaciones': None, 'c': None, 'Compras Licencias de Software y Mantenimiento': None, 'Videos': None, 'Atlantis Times Drafts': None, 'Lenguajes de Programaci\xc3\xb3n': None, 'Otros Gestores de Bases de Datos': None, 'recursos humanos': None, 'Atlantis Institute Articles': None, 'Comunicaciones': None, 'Pictures': None, 'Gesti\xc3\xb3n Economica': None, 'Ordenacion Docente': None, 'Administraci\xc3\xb3n Linux': None, 'Articles & Preprints': None, 'otros': None, 'Atlantis Times Arts': None, 'Otros documentos de op': None, 'Administracion Solaris': None, 'Seguridad': None, 'perl': None, 'Atlantis Times': None, 'otros documentos de gestion': None, 'procedimientos': None, 'Gestion de la Investigaci\xc3\xb3n': None, 'Tomcat': None, 'Theses': None, 'Aulas': None, 'Oracle': None, 'Backup': None, 'geronimo': None, 'Sistemas': None, 'Books': None, 'ALEPH': None, 'Publico': None, 'BASICO': None, 'Maquinas': None, 'otros Servidores de apicaciones /web': None, 'cableado': None, 'Sistemas Operativos': None, 'Servidores web y de aplicaciones': None, 'Mantenimeinto': None, 'Books & Reports': None, 'Gesti\xc3\xb3n Acad\xc3\xa9mica': None, 'CERN Divisions': None, 'Otros documetos de Administraci\xc3\xb3n': None, 'Mac': None, 'Atlantis Institute of Fictive Science': None, 'Dataware': None, 'Articles': None, 'Windows': None, 'Mysql': None, 'Reports': None, 'docencia Virtual': None, 'php': None, 'Preprints': None, 'Java ': None, 'Servicio Web': None, 'wifi': None, 'Python': None, 'Linux': None, 'Incidencias': None, 'Shell Scripts ': None, 'Listas de distribuci\xc3\xb3n': None}
>>>

The fix

This is an ugly behaviour as webaccess admins could have typo’s when defining the authorizations and I do not want that 500 Server Error to be displayed. So, let’s fix things up.

Here is the get_collection_reclist function as defined in Invenio 1.0 (originally):

def get_collection_reclist(coll, recreate_cache_if_needed=True):
    """Return hitset of recIDs that belong to the collection 'coll'."""
    if recreate_cache_if_needed:
        collection_reclist_cache.recreate_cache_if_needed()
    if not collection_reclist_cache.cache[coll]:
        # not yet it the cache, so calculate it and fill the cache:
        set = intbitset()
        query = "SELECT nbrecs,reclist FROM collection WHERE name=%s"
        res = run_sql(query, (coll, ), 1)
        if res:
            try:
                set = intbitset(res[0][1])
            except:
                pass
        collection_reclist_cache.cache[coll] = set
    # finally, return reclist:
    return collection_reclist_cache.cache[coll]

And here is the patched version (could also be implemented as a try/Except KeyError)…

def get_collection_reclist(coll, recreate_cache_if_needed=True):
    """Return hitset of recIDs that belong to the collection 'coll'."""
    if recreate_cache_if_needed:
        collection_reclist_cache.recreate_cache_if_needed()
 
    # ----- fix ---------------------------------------
    if coll not in collection_reclist_cache.cache:
        return None
    # -------------------------------------------------
 
    if not collection_reclist_cache.cache[coll]:
        # not yet it the cache, so calculate it and fill the cache:
        set = intbitset()
        query = "SELECT nbrecs,reclist FROM collection WHERE name=%s"
        res = run_sql(query, (coll, ), 1)
        if res:
            try:
                set = intbitset(res[0][1])
            except:
                pass
        collection_reclist_cache.cache[coll] = set
    # finally, return reclist:
    return collection_reclist_cache.cache[coll]

CDS Invenio: debug requests

Sometimes, when things don’t work, you have to debug. It is useful to have the request information displayed in each page.

For achieving this, you can modify webstyle_templates.py (more precisely, the footer output), as in:


out = """
# the usual footer output AND THEN...

%(reqinfo)s

"""
% { ...
'reqinfo' : req.__str__()
}"""

With the req.__str__() line you output the contents of the request object in a string.

CDS Invenio: garbage recolector

Sometimes a set of records must be deleted from CDS Invenio. If the records have fulltext files (pdf’s, jpgs, etc) as external URLs, that is no issue. On the other hand, if the record contains fulltext files that are managed with bibdocfile, when you delete a record, ONLY the marcxml information is deleted, but the PDF is still accesible. That is why you need a garbage recolector/deleter.

I’ve developed these functions:

def listDeletedFromCollname(colname, coltag='980__c'):
  # returns a list of recids that have beed deleted (they also meet that coltag=colname)
 
  from invenio.dbquery import run_sql
  query = """ SELECT distinct bibrec_bib98x.id_bibrec
              FROM bib98x, bibrec_bib98x
              WHERE bib98x.tag='%s' AND bib98x.value = '%s' AND bib98x.id = bibrec_bib98x.id_bibxxx
                    AND bibrec_bib98x.id_bibrec IN (SELECT DISTINCT bibrec_bib98x.id_bibrec
                                      FROM bibrec_bib98x, bib98x
                                      WHERE bibrec_bib98x.id_bibxxx=bib98x.id AND
                                            bib98x.tag = '980__c' AND
                                            bib98x.value = 'DELETED') """ %(coltag,colname)
  res = run_sql(query)
  deletedlist = []
  for i in res:
    deletedlist.append(i[0])
  return deletedlist
 
def fixTESISborradas(verbose=False):
   from invenio import bibdocfile
 
   # lets get a list of records that have been deleted which belong to collection '<em>TESIS</em>' (that is, 980__a=TESIS)
   listadeborradas = listDeletedFromCollname('TESIS','980__a')
   for recid in listadeborradas:
       if verbose:
           print "(listaTESISborradas) Voy a borrar los archivos asociados a '%s'" % recid
       documento = bibdocfile.BibRecDocs(recid)
       for archivo in documento.get_bibdoc_names():
           if verbose:
               print "(listaTESISborradas) ---- Archivo: '%s'" %archivo
           documento.delete_bibdoc(archivo)
       if verbose:
           print "(listaTESISborradas) --- Todos los archivos asociados a '%s' estan borrados" % recid

You can now add a system cron task which calls to fixTESISborradas to delete (rename) the fulltexts of recors which have been deleted 🙂