Tag Archives: Fulltext

CDS Invenio: garbage recolector

Sometimes a set of records must be deleted from CDS Invenio. If the records have fulltext files (pdf’s, jpgs, etc) as external URLs, that is no issue. On the other hand, if the record contains fulltext files that are managed with bibdocfile, when you delete a record, ONLY the marcxml information is deleted, but the PDF is still accesible. That is why you need a garbage recolector/deleter.

I’ve developed these functions:

def listDeletedFromCollname(colname, coltag='980__c'):
  # returns a list of recids that have beed deleted (they also meet that coltag=colname)
 
  from invenio.dbquery import run_sql
  query = """ SELECT distinct bibrec_bib98x.id_bibrec
              FROM bib98x, bibrec_bib98x
              WHERE bib98x.tag='%s' AND bib98x.value = '%s' AND bib98x.id = bibrec_bib98x.id_bibxxx
                    AND bibrec_bib98x.id_bibrec IN (SELECT DISTINCT bibrec_bib98x.id_bibrec
                                      FROM bibrec_bib98x, bib98x
                                      WHERE bibrec_bib98x.id_bibxxx=bib98x.id AND
                                            bib98x.tag = '980__c' AND
                                            bib98x.value = 'DELETED') """ %(coltag,colname)
  res = run_sql(query)
  deletedlist = []
  for i in res:
    deletedlist.append(i[0])
  return deletedlist
 
def fixTESISborradas(verbose=False):
   from invenio import bibdocfile
 
   # lets get a list of records that have been deleted which belong to collection '<em>TESIS</em>' (that is, 980__a=TESIS)
   listadeborradas = listDeletedFromCollname('TESIS','980__a')
   for recid in listadeborradas:
       if verbose:
           print "(listaTESISborradas) Voy a borrar los archivos asociados a '%s'" % recid
       documento = bibdocfile.BibRecDocs(recid)
       for archivo in documento.get_bibdoc_names():
           if verbose:
               print "(listaTESISborradas) ---- Archivo: '%s'" %archivo
           documento.delete_bibdoc(archivo)
       if verbose:
           print "(listaTESISborradas) --- Todos los archivos asociados a '%s' estan borrados" % recid

You can now add a system cron task which calls to fixTESISborradas to delete (rename) the fulltexts of recors which have been deleted :-)

CDSInvenio: why some of my records do not have stats in rnkDOWNLOADS table? [SOLVED]

Some days ago I wrote a post about CDS Invenio records’ number-of-downloads stats.

I noticed that some of my records did not have stats in rnkDOWNLOADS table and I was wondering why.

This is related to the way in which records have been submitted to the repository. If the record has been submited with bibupload command line utility the URL (tag 8564) of the fulltext is considered external and Invenio won’t log the download count for that file in rnkDOWNLOADS.

You can check whether the associated fulltexts are considered external or internal using bibdocfile command line utility:

bibdocfile --get-info --recid=XXXX

It will list the internal fulltexts asociated with record XXXX

If you want the fulltexts to be considered internal, you can use bibdocfile:

sudo -u apache bibdocfile --append http://zaguan.unizar.es/tesis/1913/1913_1.pdf --recid=1913

Note that this will modify your record (identified by recid=1913) adding a new 8564_u tag (http://zaguan.unizar.es/tesis/1913/1913_1.pdf).

CDSInvenio: restrict access to some fulltexts to an iprange (II)

My last article about restricting access to some fulltexts to an iprange had a few lacks and mistakes.

It worked perfectly if the type of fulltext (always speaking in terms of ‘public’ or ‘private’) was only set in the SBI process (this is, set by the submiter).

There are a few steps to follow in order to make it work if you want that the REFEREE to be able to change this type.

First of all I will tell you about some of the restrictions CDS Invenio has in its APP and SBI pipelines. These tips will be very helpful to understand the quirks of websubmit module.

websubmit quirks: a quick guide

1. Get_Report_Number must ALWAYS be called *before* Get_Recid

2. Get_Recid must be called before any other function that uses sysno
variable

3. If you are wondering what’s the meaning of the steps in APP, read the following lines:

  • Step one includes the function to prepare the record and ends with a CaseEDS (which is something like a if-then-else)
  • Step two: functions to be executed if the record is approved.
  • Step three: functions to be executed if the record is rejected

4. More tips about APP: Get_Recid relies on Move_from_Pending to be already run (otherwise, there’s no way for Get_Recid to discover the recid of record that has never really be submitted).
If you now why, just read the following lines:

If you set Get_Recid before Move_from_Pending, Get_Recid will complain (it’ll spit something like the record could not be found in our database) because it can’t really find the recid (which should be in a variable called SN in the curdir directory, which in APP is only populated after a call to
Move_from_Pending)…

5. Once Get_Recid is called (no matter in what step) this sysno
variable is kept during all the APP pipeline.
Well, just a small comment about this: the step jumping back and forth is implemented through client browser redirection in Javascript and not server side, hence, each time the step changes (e.g. after the execution of CaseEDS) a new request is made on the server, which is kind of a new Python process is run. I.e. the global variables are lost. So, (gee sorry for telling you this things one at a time, but I’m still a bit new in WebSubmit), you should call Get_Report_Number and Get_Recid in any step, if the other WebSubmit function are expecting to find the reportnumber and the recid/sysno as a global variable…
Please note that in APP you should NOT call Get_Recid in step one (before Move_from_Pending). Refer to (3) for a deeper explanation.

5. Move_to_Done / Move_to_Pending must be called at the end of the process no matter if it is APP, SBI, . Well, with the meaning that Move_to_Pending is supposed to be used in e.g. the SBI action of a refereed submission, since the submitted record will be moved to pending status, for later approval/rejection. For normal submission you would use Move_to_Done at the end of any action.

Now, lets go for it

The idea:

To solve the things I would propose… that you invent an other variable (e.g. REFEREE_PRV_PUB) that will be filled by the referee in the APP action, with three posible values: public, private and "" (i.e. nothing). And then you might extend your Fulltext_Status function to read first this REFEREE_PRV_PUB function (from the filesystem in the curdir) and if it's empty to read the PRV_PUB as filled by the submitter.

So the final configuration might be that you call Fulltext_Status only in step 2 and 3 of APP, after Get_Recid, which comes after Move_from_Pending which comes after Get_Report_Number

Easy, isn’t it? :-P

Changes in Fulltext_Status.py:

My code is now like (changes are outlined):

import os
import re
from invenio.errorlib import register_exception
from invenio.bibdocfile import BibRecDocs, decompose_file, InvenioWebSubmitFileError
 
def Fulltext_Status(parameters, curdir, form, user_info=None):
    """ Reads the form and sets status to prv, if submitter marks
        the record as private
    """
 
    global doctype,access,act,dir
    t=""
    bibrecdocs = BibRecDocs(sysno)
 
    <strong># Due to the constraints in APP pipeline two variables
    # (REFEREE_PRV_PUB and PRV_PUB) must be used
    # to avoid the overwriting of the variables
    # caused by Move_From_Pending (in the APP pipeline)
    # --------------------------------------------------------
    # Get_Recid relies on Move_From_Pending to read sysno
    # Move_From_Pending must be called AFTER Get_Report_Number
    # if Get_Recid is called before Move_From_Pending (in APP) the system will spit an
    # error message like "That record could not be found in our database"</strong>
 
    prv_pub = ""
<strong>    if form.has_key("REFEREE_PRV_PUB"):
       prv_pub = form['REFEREE_PRV_PUB']
 
    elif os.path.exists(os.path.join(curdir, 'REFEREE_PRV_PUB')):
       prv_pub = open(os.path.join(curdir, 'REFEREE_PRV_PUB')).read()</strong>
 
    # if REFEREE_PRV_PUB is set and not equal to "" its value must be the used, instead
    # of the one used in PRV_PUB!!!
 
    elif form.has_key("PRV_PUB"):
       prv_pub = form['PRV_PUB']
 
    elif os.path.exists(os.path.join(curdir, 'PRV_PUB')):
       prv_pub = open(os.path.join(curdir, 'PRV_PUB')).read()
 
    else:
       prv_pub = ""
 
    if prv_pub  == 'private':
       # then status must be set to prv
       bibdocs = bibrecdocs.list_bibdocs()
       for bibdoc in bibdocs:
           bibdoc.set_status('prv')
 
    elif prv_pub == 'public':
      # then status must be set to ""
       bibdocs = bibrecdocs.list_bibdocs()
       for bibdoc in bibdocs:
           bibdoc.set_status('')
 
    return ""

Changes in ART.tpl

It should have two lines like:

PRV_PUB---<:PRV_PUB:>
REFEREE_PRV_PUB---<:REFEREE_PRV_PUB:>

Changes in ARTcreate.tpl

We’ll just modify the lines in charge to set 984a field to ‘private’ or ‘public’.
If REFEREE_PRV_PUB has a value (not equal to “”) this value must overwrite the one in PRV_PUB (the referee has the last word in the article’s type). This is done as follows:

984a::IFDEFP(REFEREE_PRV_PUB,,0)---<:REFEREE_PRV_PUB::REFEREE_PRV_PUB:>
984a::IFDEFP(REFEREE_PRV_PUB,,1)---<:PRV_PUB::PRV_PUB:>
END::DEFP()---

And now, the SBI and APP pipelines

There is no need to Fulltext_Status to be called in SBI. So, delete from it ;)
In the APP put Fulltext_Status ONLY in steps two and three. It should end up being something like:
new APP process

CDSInvenio: restrict access to some fulltexts to an iprange

[NOTE: Only available in english, my apologies to spanish speakers]

[Edit note: if you want to be able to change the type of  fulltext (in terms of public/private) from the APP(roval) pipeline you should check the second part of this article]

Introduction

One of the most common issues when addressing the problem of publishing scientific production (this is, for instance, published articles) in an OAI repository is that fulltexts are usually under restrictive licensing. I mean, journals are not very happy with the idea of their fulltexts being public.

But this concept is totally against my idea of OAI Repositories, in which fulltexts are supposed to be public and accesible for everyone. Most of the universities are engaged in a debate about how to solve this issue. Meanwhile, we IT people must provide a solution for the problem.

In our case the solution is simple: restrict the fulltext access of copyrighted fulltexts to University staff only (students, profesors, researchers…). Now what you might be wondering is: how to do this with CDSInvenio?

What we want to achieve is: in the submission form (and maybe in the approval form too) we show a select box which asks the submitter if the fulltext is public or private. If the document is marked as private the associated fulltext will be only available from a certain iprange (in my case I slightly modified the functionality using EZProxy so that our users can access the contents from their home).

Restrict access to some fulltexts using CDSInvenio: step by step

1. I’ve defined the new Function

called Fulltext_Status.py.
Tip: make sure that the function name is equal to the file’s name without the trailing .py

1.5 I’ve modified the template associated to submission of ART (article) files

The associated files (in my case) are:

  • ART.tpl
  • ARTcreate.tpl

In ART.tpl I’ve added the following line:

PRV_PUB---<:PRV_PUB:>

In ARTcreate.tpl I modified the line associated to generation of 8564_u MARC tag like:

8564u::IFDEFP(DEMOART_FILE_RENAMED,,0)---<datafield tag="856" ind1="4" ind2=" "><subfield code="u">http://zaguan.unizar.es/record/<:SN::SN:>/files/<:DEMOART_
FILE_RENAMED::DEMOART_FILE_RENAMED:></subfield><subfield code="z">Fulltext</subfield></datafield>

which basically means write URL to file in 8564_u tag if there is an attached file.

and also added a 984__a tag (9xx tags are used for administration purposes) which shows if the record is considered as private or public:

984a::DEFP---<:PRV_PUB::PRV_PUB:>

2. I’ve added the new function to the SBI process

Tip: the position of the function is quite important. It has to be *AFTER* Move_Files_to_Storage call because it’s Move_Files_to_Storage who will allocate the bibdocs and associate them to the record.

SBI process

3. And to the APP process.

Tip 1: the position of the function in the approval process is also important. It has to be just *BEFORE* the Move_to_Done function because this should be the last function ever to be called (since it packs up everything and archives it).

Tip 2: Moreover in the APP action, due to an architectural limitation of WebSubmit, function executed in a step different than the 1st one, will not have the form dictionary (Will improve the documentation on this).


APP process cds invenio

In order to be able to read the parameter PRV_PUB in your function you should instead try to use the filesystem as in:

if form.has_key("PRV_PUB"):
    prv_pub = form['PRV_PUB']
elif os.path.exists(os.path.join(curdir, 'PRV_PUB')):
    prv_pub = open(os.path.join(curdir, 'PRV_PUB')).read()
else:
    prv_pub = ""

In this way if the form element is not there you fallback on the filesystem, where, if everything went correctly, the form element should have been stored in a file just before entering step 1.

4. Then I have created a new element description

called PRV_PUB
SBI process
This is just the important part:

<select name="PRV_PUB">
   <option>Seleccione el carácter de su publicación:</option>
   <option value="private">Private</option>
   <option value="public">Public</option>
</select>

5. I’ve added this new element to the SBI and APP page

The image below just shows the SBI page (with the APP page the including process is pretty similar)

SBI form page cds invenio

6. Next I have created the role IP_UZ

Just go to $CFG_SITE_NAME/admin/webaccess/webaccessadmin.py URL and add a new role:

   allow email "miguelm@unizar.es"
   allow remote_ip "155.210."
   deny all

7. Then I have connected this new role

… to the viewrestrdoc with status=prv (prv is the status code I used in Fulltext_Status function, refer to section 1).
authorization details cds invenio

Then I submit a new element, approve it and check if I can see the fulltext pdf from my computer (155.210.XX.YY). Great, I can.

Tip: You can check the status of recently submitted files using:

/soft/cds-invenio/bin/bibdocfile --get-info --recid 3277

You should see something like this:

3277::::total bibdocs attached=1
3277::::total size latest version=714.1 KB
3277::::total size all files=714.1 KB
3277:225:::docname=ART--2009-009
3277:225:::doctype=DEMOART_FILE
3277:225:::status=prv
3277:225:::basedir=/soft/cds-invenio/var/data/files/g0/225
3277:225:::creation date=2009-05-13 13:07:55
3277:225:::modification date=2009-05-13 13:08:40
3277:225:::total file attached=1
3277:225:::total size latest version=714.1 KB
3277:225:::total size all files=714.1 KB
3277:225:1:.pdf:fullpath=/soft/cds-invenio/var/data/files/g0/225/ART--2009-009.pdf;1
3277:225:1:.pdf:fullname=ART--2009-009.pdf
3277:225:1:.pdf:name=ART--2009-009
3277:225:1:.pdf:status=prv
3277:225:1:.pdf:checksum=abccd8b54af1c1fb10f4ad3a7e93151a
3277:225:1:.pdf:size=714.1 KB
3277:225:1:.pdf:creation time=2009-06-15 13:28:44
3277:225:1:.pdf:modification time=2009-05-13 13:07:55
3277:225:1:.pdf:encoding=None
3277:225:1:.pdf:url=http://zaguan.unizar.es/record/3277/files/ART--2009-009.pdf
3277:225:1:.pdf:description=None
3277:225:1:.pdf:comment=Texto completo

After that I use a proxy to connect to my repository, and without being logged in into Invenio, I try to access the fulltext. A “this file is restricted” text appears. Cool, it worked! :)

Going further: using EZProxy

At this point we know how to restrict the access to fulltext from an iprange. If we want our users (which are validated agains an LDAP system) to be able to access the fulltexts from their home (and not only from the university’s ip’s) we can use something like EZProxy. From a high level point of view this software gives an intern IP to outside connections (only if the user is able to login to EZProxy)

I will not explain here how to install/use/configure this software because there is plenty of documentation in their website. Lets suppose we have it running already.

The steps I took to make CDS work with EZProxy were:

8. I slightly modified the Bibformat element which shows URLs

(in my case, called bfe_fulltext_light.py) so that it ends up being something like (I just show the part of main_urls and outline the important lines. Please ignore the part of dx.doi.org):

<strong>ezproxy_url = 'http://roble.unizar.es:9090/login?url=' # your URL to ezproxy</strong>
if main_urls:
        last_name = ""
        for descr, urls in main_urls.items():
            url_list = []
            urls.sort(lambda (url1, name1, format1), (url2, name2, format2): url1 &lt; url2 and -1 or url1 &gt; url2 and 1 or 0)
 
            for url, name, format in urls:
                last_name = name
                if show_icons.lower() == 'yes':
                    file_icon = '&lt;img src="%s/img/%s" alt="%s"/&gt;'
                                   % (CFG_SITE_URL, icon(url), _("Download fulltext"))
                else:
                    file_icon = ''
                # first of all, see if it is public, private or dx.doi.org link
                <strong>pub_prv = bfo.field('984__a')</strong>
                <strong>if</strong> (url.find("dx.doi.org") != -1) or <strong>(pub_prv.find("private") != -1):
                      url_list.append('&lt;a '+style+' href="' + ezproxy_url + escape(url)+'"&gt;'+ \
                                                    file_icon +'&lt;/a&gt; ')
 
                else:
                      url_list.append('&lt;a '+style+' href="' + escape(url)+'"&gt;'+ \
                                                    file_icon +'&lt;/a&gt; ')</strong>
 
            out += separator.join(url_list) + additional_str

Then run:

echo "DELETE FROM bibfmt WHERE format='HB'" | /soft/cds-invenio/bin/dbexec
sudo -u apache bibreformat -c "YOUR_COLLECTION_NAME"
sudo -u apache bibsched

Now your URL’s will be pointing to ezproxy_url instead of directly to the fulltext, which means users (which can login to EZProxy) will be available to access fulltexts from any IP.

You can see a working example in Zaguan repository.

Thanks a lot to all the CDS Support Team and specially to Samuele Kaplun.