Category Archives: Cds Invenio

CDS Invenio 0.99.X: inveniogc ERROR [SOLVED]

Some days ago I noticed there was something wrong with inveniogc. Every time I run inveniogc -a I was getting errors like:

2013-04-17 08:31:30 --> 2013-04-17 08:31:30 --> Updating task status to ERROR.
2013-04-17 08:31:30 --> Task #21731 finished. [ERROR]

Calling inveniogc with verbose level = 9 I got some more information (var/log/bibsched_task_XXXX.log and .err files):

2013-04-17 08:29:51 --> - deleting queries not attached to any user
 
2013-04-17 08:29:51 -->   SELECT DISTINCT q.id
  FROM query AS q LEFT JOIN user_query AS uq
  ON uq.id_query = q.id
  WHERE uq.id_query IS NULL AND
  q.type <> 'p' 
 
2013-04-17 08:31:30 --> 2013-04-17 08:31:30 --> Updating task status to ERROR.
2013-04-17 08:31:30 --> Task #21731 finished. [ERROR]

The issue arised when inveniogc tried to delete user queries not attached to any user. I edited lib/python/invenioinveniogc.py and noticed the error was being produced by the output of a query result being printed. Just commented that out and inveniogc works again:

write_message("""  SELECT DISTINCT q.id\n  FROM query AS q LEFT JOIN user_query AS uq\n  ON uq.id_query = q.id\n  WHERE uq.id_query IS NULL AND\n  q.type <> 'p' """, verbose=9)
result = run_sql("""SELECT DISTINCT q.id
                    FROM query AS q LEFT JOIN user_query AS uq
                    ON uq.id_query = q.id
                    WHERE uq.id_query IS NULL AND
                          q.type <> 'p'""")
 
# write_message(result, verbose=9)

Why is this? It seems that the output buffer that write_message is using is too small to store the result of the previous query, so it fails…

CDS Invenio: batch delete records or interval of records (from python interpreter)

Sometime ago I came up with this little hack to add invenio the functionality to delete a record from command line.

If you need to delete a lot of records (i.e. in your testing/development server), you can add this other hack to bibeditcli.py:

Delete several records from invenio: the dirty way

This works, but is not necesarily the way to go. There is another way to achieve same result (records deleted) but does not over load Bibsched with a task for each record. We’ll go over that one later, though.

First thing first: lets go the dirrrrrty way:

def cli_delete_interval(recid_inicio, recid_fin):
    """
    Delete records from recid_inicio to recid_fin, both included
    You'd better make sure...
    """
    try:
        recid_inicio = int(recid_inicio)
    except ValueError:
        print "ERROR: First Record ID must be integer, not %s:" %recid_inicio
        sys.exit(1)
    try:
        recid_fin = int(recid_fin)
    except ValueError:
        print "ERROR: End record ID must be integer, not %s." %recid_fin
        sys.exit(1)
 
    if recid_inicio > recid_fin:
        print "ERROR: First record ID must be less than last record ID."
        sys.exit(1)
 
    for recid in range(recid_inicio, recid_fin):
        (record, junk) = get_record(CFG_SITE_LANG, recid, 0, "false")
        add_field(recid, 0, record, "980", "", "", "c", "DELETED")
        save_temp_record(record, 0, "%s.tmp" % get_file_path(recid))
        save_xml_record(recid)

This is how you call this new function from python.
First, navigate to $PATH_TO_INVENIO/lib/python and run your python interpreter

[miguel@mydevinvenioinstance ~]# cd /soft/cds-invenio/lib/
[miguel@mydevinvenioinstance lib]# python
[GCC 4.1.2 20080704 (Red Hat 4.1.2-50)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>>

And then, just…

>>> import invenio
>>> from invenio.bibeditcli import cli_delete_interval
>>> # the following line will delete records from ID=5125 to ID=7899 .... 
>>> # BE CAREFUL! GREAT POWER COMES WITH GREAT RESPONSIBILITY
>>> 
>>> cli_delete_interval(5125,7899)

Delete several records from Invenio: the not-so-dirty way

If you take a look at the new cli_delete_interval we just came up with, or run it over a big interval, a whole lot of new tmp files will be generated and a lot of tasks will be sent to bibsched (one for every record.). Not efficient. Not nice.

This code is better. Just one tmp file (which will be deleted upon termination) and one single task sent to bibsched.
Please notice the # EDIT HERE! part at line 13

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
def cli_delete_interval(recid_inicio, recid_fin):
    """
    By: Miguel Martin 20120130 
    Goal:
      Delete records from recid_inicio to recid_fin, both included
      Creates just a tmp file and a task (just one) is sent to bibsched
    """
 
    from invenio.bibrecord import record_xml_output
    from invenio.bibtask import task_low_level_submission
 
    # EDIT HERE! FILEPATH MUST BE READABLE/WRITABLE! ######
    tmpfile = "/home/miguelm/tmp/borrado.xml" 
    # #####################################################
 
    try:
        recid_inicio = int(recid_inicio)
    except ValueError:
        print "ERROR: First Record ID must be integer, not %s:" %recid_inicio
        sys.exit(1)
    try:
        recid_fin = int(recid_fin)
    except ValueError:
        print "ERROR: End record ID must be integer, not %s." %recid_fin
        sys.exit(1)
 
    if recid_inicio > recid_fin:
        print "ERROR: First record ID must be less than last record ID."
        sys.exit(1)
 
    fd = open(tmpfile, "w")
    for recid in range(recid_inicio, recid_fin):
        (record, junk) = get_record(CFG_SITE_LANG, recid, 0, "false")
        add_field(recid, 0, record, "980", "", "", "c", "DELETED")
        fd.write(record_xml_output(record))
 
    fd.close()
    task_low_level_submission('bibupload', 'bibedit', '-P', '5', '-r', '%s' % tmpfile)
    #os.system("rm %s" % tmpfile)

Cheers!

CDS Invenio VAGRANT how-to

Do you know VAGRANT? Well, if you don’t, you should.

If you want a development instance of Invenio, Vagrant is the way to go. Three commands and you’ll have a working installation:

This is the usual way of installing a Vagrant BOX:

    $ vagrant box add base http://files.vagrantup.com/lucid32.box
    $ vagrant init
    $ vagrant up

And here the GIT command to get Invenio Vagrant:

git clone  https://github.com/lnielsen-cern/invenio-vagrant

Nice!

CDS Invenio 1: Configure Apache and environment to make OGG and WEBM videos work [SOLVED]

The Invenio 1.0 installation guide does not mention the needed libraries for WebM/OGG videos to play, nor the steps for configuring your system and Apache.

Instead, these are listed in the BibEncode documentation page.

Please note that BibEncode DOES NOT WORK WITH Invenio 1.0 and it is not easily backportable, as stated by Samuele Kaplun.

Anyways, Invenio demo data contains a video-record (called ‘CMS animation of the high-energy collisions at 7 TeV on 30th March 2010‘) but the video player was not working. And (you can call be stubborn here) I wanted to make it work.

I’ve made it work (in my testing box, which runs a Debian 6.0.5). Some steps might not be required in your systems, but just in case…

1. Edit /etc/apache2/mods-enabled/mime.conf and add:

AddType video/ogg .ogv
AddType video/mp4 .mp4
AddType video/webm .webm
AddType video/quicktime .mov

2. Edit /etc/mime.types and add:

audio/webm weba
video/webm webm
video/quicktime mov

3. Uninstall ffmpeg. We will re-install it later, after step 4 is done:

sudo apt-get remove ffmpeg

4. Install the libraries and dependencies needed (OpenJPEG, OGG, Vorbis, Theora, WebM, etc):

sudo apt-get install vorbis-tools-dbg vorbis-tools libtheora-bin libtheora-dbg libtheora-dev libtheora-doc libtheora0 libvpx-dev libvpx-doc libvpx0-dbg libvpx0 libopenjpeg-dev libopenjpeg2-dbg libopenjpeg2  libtwolame0 libtwolame-dev twolame libogg0 libogg-dev libogg-dbg libvorbis-dbg libvorbis-dev libvorbis0a libvorbisenc2 libvorbisfile3 python-pyvorbis-dbg python-pyvorbis

5. Reinstall ffmpeg:

sudo apt-get install ffmpeg
sudo -u www-data /opt/invenio/bin/inveniocfg --update-all; /etc/init.d/apache2 restart;

6. And the usual last step…

sudo -u www-data /opt/invenio/bin/inveniocfg --update-all; /etc/init.d/apache2 restart;

Invenio 1: webaccess ‘become user’ does not work [BUG + PATCH]

CDS Invenio Webaccess module offers this useful “become user” functionality which allows users (whom have been granted with that privilege) to become another user (just like if they have logged in with another account).

This worked like a charm in previous versions of Invenio, but as stated in ‘WebAccess: fix becomeuser exception’, still does not work in Invenio 1.0.

I could not wait for the Invenio guys to fix this, so I implemented a (dirty-but-working) patch myself.

First, edit becomeuser function (defined at webaccessadmin.py):
From:

def becomeuser(req, userID='', callback='yes', confirm=0, ln=CFG_SITE_LANG):

To:

def becomeuser(req, userID='', callback='yes', confirm=0, ln=CFG_SITE_LANG, rand='', email_user_pattern='', limit_to='',maxpage=''):

Then edit webuser.py:
Locate this:

try:
    from invenio.session import get_session

And change it to:

try:
    from invenio.session import get_session, save

Also change the setUid function (also defined at webuser.py)
From:

def setUid(req, uid, remember_me=False):
    """It sets the userId into the session, and raise the cookie to the client.
    """
    if hasattr(req, '_user_info'):
        del req._user_info
    session = get_session(req)
    session.invalidate()
    session = get_session(req)
    session['uid'] = uid
    if remember_me:
        session.set_timeout(86400)
    session.set_remember_me(remember_me)
    if uid > 0:
        user_info = collect_user_info(req, login_time=True)
        session['user_info'] = user_info
        req._user_info = user_info
    else:
        del session['user_info']
 
    return uid

To:

def setUid(req, uid, remember_me=False):
    """It sets the userId into the session, and raise the cookie to the client.
    """
    if hasattr(req, '_user_info'):
        del req._user_info
    session = get_session(req)
    session.invalidate()
    session = get_session(req)
    session['uid'] = uid
    if remember_me:
        session.set_timeout(86400)
    session.set_remember_me(remember_me)
    if uid > 0:
        user_info = collect_user_info(req, login_time=True)
        session['user_info'] = user_info
        req._user_info = user_info
    else:
        del session['user_info']
 
    # THE FIX -----------------------
    session.save()
    # -------------------------------
    return uid

Last step, as usual…

sudo -u www-data /opt/invenio/bin/inveniocfg --update-all; /etc/init.d/apache2 restart;

Invenio 1 API: list all restricted records and (restricted/public) collections

Let’s take a quick look at some of the functions provided by Invenio1 search_engine API:

Get all the collection names…

>>> import invenio.search_engine
>>> collection_reclist_cache = invenio.search_engine.CollectionRecListDataCacher()
>>> print collection_reclist_cache.cache.keys()

Get all the restricted collections (the ones which have access restrictions configured via webaccess-viewrestrcoll authorizations

>>> import invenio.search_engine
>>> restricted_collection_cache = invenio.search_engine.RestrictedCollectionDataCacher()
>>> print restricted_collection_cache.cache

Get all the records which belong to restricted collections

>>> import invenio.search_engine
>>> restricted_collection_cache = invenio.search_engine.RestrictedCollectionDataCacher()
>>> for collection in restricted_collection_cache.cache:
...      print "Coleccion: '%s'" %collection
...      print "Registros: '%s'" %str(repr(invenio.search_engine.get_collection_reclist(collection)))

CDS Invenio 1: Understanding viewrestrcoll action and fixing websearch_engine bug

Just discovered a bug in CDS Invenio 1.0′s search_engine.

The bug

If you define a new authorization for viewing a collection that does not exist, that is, for instance:
action = viewrestrcoll, rol=EXAMPLEROLE, with parameter collection='COLLNAMEWHICHDOESNOTEXIST'

When a guest user (does not belong to EXAMPLEROLE) tries to view a record which belongs to a restricted collection, a 500 Server Error is thrown:

Traceback (most recent call last):
  File "/usr/local/lib/python2.6/dist-packages/invenio/webinterface_handler.py", line 413, in _handler
    return root._traverse(req, path, False, guest_p)
  File "/usr/local/lib/python2.6/dist-packages/invenio/webinterface_handler.py", line 250, in _traverse
    result = _check_result(req, obj(req, form))
  File "/usr/local/lib/python2.6/dist-packages/invenio/websearch_webinterface.py", line 1035, in __call__
    (auth_code, auth_msg) = check_user_can_view_record(user_info, self.recid)
  File "/usr/local/lib/python2.6/dist-packages/invenio/search_engine.py", line 292, in check_user_can_view_record
    restricted_collections = get_restricted_collections_for_recid(recid, recreate_cache_if_needed=False)
  File "/usr/local/lib/python2.6/dist-packages/invenio/search_engine.py", line 241, in get_restricted_collections_for_recid
    return [collection for collection in restricted_collection_cache.cache if recid in get_collection_reclist(collection, recreate_cache_if_needed=False)]
  File "/usr/local/lib/python2.6/dist-packages/invenio/search_engine.py", line 390, in get_collection_reclist
    if not collection_reclist_cache.cache[coll]:
KeyError: 'COLLNAMEWHICHDOESNOTEXIST'

Why does this happen? The restricted_collection_cache contains ‘COLLNAMEWHICHDOESNOTEXIST‘ but NOT IN ‘collection_reclist_cache.cache‘…

import invenio.search_engine
>>> restricted_collection_cache = invenio.search_engine.RestrictedCollectionDataCacher()
>>> print restricted_collection_cache.cache
['Theses', 'Atlantis Times Drafts', ... , 'COLLNAMEWHICHDOESNOTEXIST']
 
>>> collection_reclist_cache = invenio.search_engine.CollectionRecListDataCacher()
>>> print collection_reclist_cache.cache
{'Autenticacion': None, 'Poetry': None, 'normativa': None, 'otros documetos de sistemas': None, 'Concursos y compras': None, 'solaris': None, 'Theoretical Physics (TH)': None, 'Atlantis Institute Books': None, 'Servidores web': None, 'Equipos de Red': None, 'Compras ': None, 'Sql ': None, 'Virtualizaci\xc3\xb3n': None, 'postgres': None, 'CERN Experiments': None, 'Impresoras': None, 'Gestion': None, 'Atlantis Times Science': None, 'Telefonia': None, 'Cabinas Disco': None, 'Correo': None, 'ngix': None, 'Experimental Physics (EP)': None, 'otros documentos de comunicaciones': None, 'otras': None, 'Nominas': None, 'Bases de datos': None, 'ISOLDE': None, 'Administraci\xc3\xb3n': None, 'Atlantis Times News': None, 'otros SO': None, 'jboss': None, 'Ingres': None, 'Multimedia & Arts': None, 'Otros lenguajes de programaci\xc3\xb3n': None, 'GlassFish ': None, 'BEA WebLogic': None, 'Ordenadores Personales': None, 'apache': None, 'Servidores y web y de aplicaciones': None, 'c': None, 'Compras Licencias de Software y Mantenimiento': None, 'Videos': None, 'Atlantis Times Drafts': None, 'Lenguajes de Programaci\xc3\xb3n': None, 'Otros Gestores de Bases de Datos': None, 'recursos humanos': None, 'Atlantis Institute Articles': None, 'Comunicaciones': None, 'Pictures': None, 'Gesti\xc3\xb3n Economica': None, 'Ordenacion Docente': None, 'Administraci\xc3\xb3n Linux': None, 'Articles & Preprints': None, 'otros': None, 'Atlantis Times Arts': None, 'Otros documentos de op': None, 'Administracion Solaris': None, 'Seguridad': None, 'perl': None, 'Atlantis Times': None, 'otros documentos de gestion': None, 'procedimientos': None, 'Gestion de la Investigaci\xc3\xb3n': None, 'Tomcat': None, 'Theses': None, 'Aulas': None, 'Oracle': None, 'Backup': None, 'geronimo': None, 'Sistemas': None, 'Books': None, 'ALEPH': None, 'Publico': None, 'BASICO': None, 'Maquinas': None, 'otros Servidores de apicaciones /web': None, 'cableado': None, 'Sistemas Operativos': None, 'Servidores web y de aplicaciones': None, 'Mantenimeinto': None, 'Books & Reports': None, 'Gesti\xc3\xb3n Acad\xc3\xa9mica': None, 'CERN Divisions': None, 'Otros documetos de Administraci\xc3\xb3n': None, 'Mac': None, 'Atlantis Institute of Fictive Science': None, 'Dataware': None, 'Articles': None, 'Windows': None, 'Mysql': None, 'Reports': None, 'docencia Virtual': None, 'php': None, 'Preprints': None, 'Java ': None, 'Servicio Web': None, 'wifi': None, 'Python': None, 'Linux': None, 'Incidencias': None, 'Shell Scripts ': None, 'Listas de distribuci\xc3\xb3n': None}
>>>

The fix

This is an ugly behaviour as webaccess admins could have typo’s when defining the authorizations and I do not want that 500 Server Error to be displayed. So, let’s fix things up.

Here is the get_collection_reclist function as defined in Invenio 1.0 (originally):

def get_collection_reclist(coll, recreate_cache_if_needed=True):
    """Return hitset of recIDs that belong to the collection 'coll'."""
    if recreate_cache_if_needed:
        collection_reclist_cache.recreate_cache_if_needed()
    if not collection_reclist_cache.cache[coll]:
        # not yet it the cache, so calculate it and fill the cache:
        set = intbitset()
        query = "SELECT nbrecs,reclist FROM collection WHERE name=%s"
        res = run_sql(query, (coll, ), 1)
        if res:
            try:
                set = intbitset(res[0][1])
            except:
                pass
        collection_reclist_cache.cache[coll] = set
    # finally, return reclist:
    return collection_reclist_cache.cache[coll]

And here is the patched version (could also be implemented as a try/Except KeyError)…

def get_collection_reclist(coll, recreate_cache_if_needed=True):
    """Return hitset of recIDs that belong to the collection 'coll'."""
    if recreate_cache_if_needed:
        collection_reclist_cache.recreate_cache_if_needed()
 
    # ----- fix ---------------------------------------
    if coll not in collection_reclist_cache.cache:
        return None
    # -------------------------------------------------
 
    if not collection_reclist_cache.cache[coll]:
        # not yet it the cache, so calculate it and fill the cache:
        set = intbitset()
        query = "SELECT nbrecs,reclist FROM collection WHERE name=%s"
        res = run_sql(query, (coll, ), 1)
        if res:
            try:
                set = intbitset(res[0][1])
            except:
                pass
        collection_reclist_cache.cache[coll] = set
    # finally, return reclist:
    return collection_reclist_cache.cache[coll]

CDS Invenio: debug requests

Sometimes, when things don’t work, you have to debug. It is useful to have the request information displayed in each page.

For achieving this, you can modify webstyle_templates.py (more precisely, the footer output), as in:


out = """
# the usual footer output AND THEN...

%(reqinfo)s

"""
% { ...
'reqinfo' : req.__str__()
}"""

With the req.__str__() line you output the contents of the request object in a string.

CDS Invenio: garbage recolector

Sometimes a set of records must be deleted from CDS Invenio. If the records have fulltext files (pdf’s, jpgs, etc) as external URLs, that is no issue. On the other hand, if the record contains fulltext files that are managed with bibdocfile, when you delete a record, ONLY the marcxml information is deleted, but the PDF is still accesible. That is why you need a garbage recolector/deleter.

I’ve developed these functions:

def listDeletedFromCollname(colname, coltag='980__c'):
  # returns a list of recids that have beed deleted (they also meet that coltag=colname)
 
  from invenio.dbquery import run_sql
  query = """ SELECT distinct bibrec_bib98x.id_bibrec
              FROM bib98x, bibrec_bib98x
              WHERE bib98x.tag='%s' AND bib98x.value = '%s' AND bib98x.id = bibrec_bib98x.id_bibxxx
                    AND bibrec_bib98x.id_bibrec IN (SELECT DISTINCT bibrec_bib98x.id_bibrec
                                      FROM bibrec_bib98x, bib98x
                                      WHERE bibrec_bib98x.id_bibxxx=bib98x.id AND
                                            bib98x.tag = '980__c' AND
                                            bib98x.value = 'DELETED') """ %(coltag,colname)
  res = run_sql(query)
  deletedlist = []
  for i in res:
    deletedlist.append(i[0])
  return deletedlist
 
def fixTESISborradas(verbose=False):
   from invenio import bibdocfile
 
   # lets get a list of records that have been deleted which belong to collection '<em>TESIS</em>' (that is, 980__a=TESIS)
   listadeborradas = listDeletedFromCollname('TESIS','980__a')
   for recid in listadeborradas:
       if verbose:
           print "(listaTESISborradas) Voy a borrar los archivos asociados a '%s'" % recid
       documento = bibdocfile.BibRecDocs(recid)
       for archivo in documento.get_bibdoc_names():
           if verbose:
               print "(listaTESISborradas) ---- Archivo: '%s'" %archivo
           documento.delete_bibdoc(archivo)
       if verbose:
           print "(listaTESISborradas) --- Todos los archivos asociados a '%s' estan borrados" % recid

You can now add a system cron task which calls to fixTESISborradas to delete (rename) the fulltexts of recors which have been deleted :-)

CDS Invenio: unblock records that are “currently being edited by another user” [SOLVED]

When someone is editing a record via bibedit web interface, the record is temporary blocked by Invenio. If anyone tries to edit the record via bibedit at the same time, the message “This record is currently being edited by another user” is shown. If you want to unblock that record (without waiting CFG_BIBEDIT_TIMEOUT seconds), you can run:

sudo -u apache inveniogc -a

And run bibsched tasks.

If still blocked, then delete $PATHTOINVENIO/var/tmp/bibedit_record_$RECID*

All the interesting stuff is in bibedit_engine.py. More precisely, in get_record function:

def get_record(ln, recid, uid, temp):
    """Returns a record dict, and warning message in case of error. """
    #FIXME: User doesn't get submit button if reloading BibEdit-page
    #FIXME: User will get warning of changes being temporary when reloading
    #   BibEdit-page, even though no changes have been made.
 
    warning_temp_file = ''
    file_path = get_file_path(recid)
 
    if temp != "false":
        warning_temp_file = bibedit_templates.tmpl_warning_temp_file(ln)
 
    if os.path.isfile("%s.tmp" % file_path):
 
        (uid_record_temp, record) = get_temp_record("%s.tmp" % file_path)
        if uid_record_temp != uid:
 
            time_tmp_file = os.path.getmtime("%s.tmp" % file_path)
            time_out_file = int(time.time()) - CFG_BIBEDIT_TIMEOUT
 
            if time_tmp_file < time_out_file :
                os.system("rm %s.tmp" % file_path)
                record = create_record(print_record(recid, 'xm'))[0]
                save_temp_record(record, uid, "%s.tmp" % file_path)
 
            else:
                record = ''
 
        else:
            warning_temp_file = bibedit_templates.tmpl_warning_temp_file(ln)
 
    else:
        record = create_record(print_record(recid, 'xm'))[0]
        save_temp_record(record, uid, "%s.tmp" % file_path)
 
    return (record, warning_temp_file)

If a message like “There is a new revision of …” shows up, then a user is MBI/SRV the record. Until this user finishes the MBI/SRV process and the bibsched associated task is run, the record is blocked.

More options regarding the lock level are shown in invenio.conf:

## CFG_BIBEDIT_LOCKLEVEL -- when a user tries to edit a record being edited by
## another user, the lock level determines when it is permitted to do so.
## Level 0 - permits editing if there are no recent edit sessions in tmp directory
##           (unsafe, use only if you know what you are doing)
## Level 1 - permits editing if there are no queued bibedit tasks for this record
##           (safe with respect to bibedit, but not for other bibupload maintenance jobs)
## Level 2 - permits editing if there are no queued bibupload tasks of any sort
##           (safe, but may lock more than necessary if many cataloguers around)
## Level 3 - permits editing if no queued bibupload task concerns given record
##           (safe, most precise locking, but slow,
##            checks for 001/EXTERNAL_SYSNO_TAG/EXTERNAL_OAIID_TAG)
## The recommended level is 3 (default) or 2 (if you use maintenance jobs often).
CFG_BIBEDIT_LOCKLEVEL = 3

The inveniogc -a command is supposed to delete the temporary logs, guest user related information, caches, deleted documents and done tasks. But surprisingly, the inveniogc -a command does not delete every file in $PATH_TO_INVENIO/var/tmp/* (called CFG_TMPDIR in config.py), it only deletes those called rec_fmt_*, bibharvestadmin*, bibconvertrun.*, oaiharvest*, oai_archive*. There are a lot of other files that are not deleted (for instance, some called DOCID-RN-date_time). Still wondering whether these files can be safely deleted manually.

I have been looking at the websubmit_functions. Both Insert_Record and Insert_Modify_Record write in CFG_TMPDIR.