Tag Archives: Websubmit

CDS Invenio: Get Record number (recid, sysno) from reference number

Table sbmSUBMISSIONS has a lot of information related to the actions performed by websubmit (new record creation, modifications, etc). It stores the user id that performed the websubmit action, the action’s type (SBI, MBI, APP), the reference or report number (i.e. BOOK–2010-005), the date, time and status of the action.

This is a great piece of information. But, surprisingly (at least it is to me) the recid (also referred as sysno) is NOT stored in that table. I guess there is a (good?) reason for this, but I am not aware of it.

Well, which is the quickest way to get recid (sysno) from a report number? This is it:

>>> from invenio.websubmit_functions.Get_Recid import get_existing_records_for_reportnumber
>>> recid = get_existing_records_for_reportnumber('TAZ-2010-078')
>>> print recid
[4404]

CDS Invenio: query database to know a tag value from a record

Which is the quickest way to know the tag’s value of a record using CDS Invenio functions? Imagine you want to know the value for the 8564_u tag of record which sysno=4403

Easy! First, run python. Then:

>>> from invenio.dbquery import run_sql
>>> from invenio.websubmit_functions.Create_Modify_Interface_TAZ import Create_Modify_Interface_getfieldval_fromDBrec
>>> value =  Create_Modify_Interface_getfieldval_fromDBrec('8564_u',4403)
>>> print value
http://aneto.unizar.es/TAZ/CPS/2010/4403/TAZ-2010-077.pdf

There is only a small snag… the database is not always updated, so this information is not absolutely consistent. If the record has been modified recently the value might not be up to date.

CDS-Invenio: Change SBI process – not referred records, restricted fulltext access

These past days I have been talking a lot about CDS-Invenio, websubmit module and Apache’s .htaccess files (part one and two).

In this blog you can also find some posts that show how to grant access to fulltexts using CDS-Invenio (refer to part one and part two). In those posts a new websubmit function was created in the approval pipeline, and new roles and permissions were defined in webaccess so that fulltext access would be allowed to some iprange only.

The goal

Well, now we have a different need: we want a non-refereed submit process (this is, no need to approve records) so that the record and its metadata can be read no matter what the consulting IP is. But (and here comes the funny part) we want the fulltext file to be accessed only by:
- the submitter
- some privileged users (a subset of ldap-authenticated users and some file-authenticated users). Refer to this post for further details.

We would also want to have the chance to “edit” (modify) those records so that access to fulltext is allowed to everyone. This modifying process will be done, if needed, some time after the record is submitted.

Define new doctype, create form page and SBI functions

First of all, lets define a new doctype called TAZ to which the mods will be applied.

Now we will make a new submit form (refer to your CDS Invenio manual) and a new submit (SBI) pipeline. Here are the functions I’ve used:

  • Create_Recid
  • Report_Number_Generation
  • Make_Dummy_MARC_XML_Record
  • Move_Files_To_Storage_TAZ
  • Make_Record
  • Insert_Record
  • Print_Success
  • Mail_Submitter

If you are familiar to CDS-Invenio you will notice that there is just a new function involved. This is Move_Files_To_Storage_TAZ. It is pretty similar to default Move_Files_To_Storage function.

First of all, lets remember what Move_Files_To_Storage function does:
When the record is created its metadata is stored in a running directory like /soft/cds-invenio/var/data/submit/storage/running/TAZ/1263304104_5564/. Lets take a look at the contents of that directory:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
. Remember that <code>Move_Files_To_Storage</code> has not been executed yet.
[root@aneto]# ls -l /soft/cds-invenio/var/data/submit/storage/running/TAZ/1263304104_5564/
total 176
-rw-r--r-- 1 apache apache   15 Jan 12 14:48 access
-rw-r--r-- 1 apache apache    3 Jan 12 14:48 act
-rw-r--r-- 1 apache apache    1 Jan 12 14:48 curpage
-rw-r--r-- 1 apache apache    4 Jan 12 14:48 DEMOTHE_TITLE
-rw-r--r-- 1 apache apache    3 Jan 12 14:48 doctype
-rw-r--r-- 1 apache apache  679 Jan 12 14:48 dummy_marcxml_rec
drwxr-xr-x 4 apache apache 4096 Jan 12 14:48 files
-rw-r--r-- 1 apache apache  532 Jan 12 14:48 function_log
-rw-r--r-- 1 apache apache    7 Jan 12 14:48 indir
-rw-r--r-- 1 apache apache   15 Jan 12 14:48 lastuploadedfile
-rw-r--r-- 1 apache apache    2 Jan 12 14:48 ln
-rw-r--r-- 1 apache apache   25 Jan 12 14:48 mainmenu
-rw-r--r-- 1 apache apache    1 Jan 12 14:48 mode
-rw-r--r-- 1 apache apache   15 Jan 12 14:48 PFC_MEM
-rw-r--r-- 1 apache apache   16 Jan 12 14:48 PFC_MEM_RENAMED
-rw-r--r-- 1 apache apache  695 Jan 12 14:48 recmysql
-rw-r--r-- 1 apache apache  205 Jan 12 14:48 rename_cmd
-rw-r--r-- 1 apache apache    4 Jan 12 14:48 SN
-rw-r--r-- 1 apache apache    1 Jan 12 14:48 startPg
-rw-r--r-- 1 apache apache    1 Jan 12 14:48 step
-rw-r--r-- 1 apache apache   17 Jan 12 14:48 SuE
-rw-r--r-- 1 apache apache   12 Jan 12 14:48 TAZ_RN

It contains all the submit-form values, as well as the fulltext files attatched to the record (/soft/cds-invenio/var/data/submit/storage/running/TAZ/1263304104_5564/files directory).

Move_Files_To_Storage moves files to a directory like /soft/cds-invenio/var/data/files/g0/409/TAZ-2010-009.pdf;1. How is this done?

Move_Files_To_Storagecreates a BibRecDoc object (this class is defined in BibDocFile.py). BibRecDoc objects have BibDoc objects inside of them. As a result of this (more precisely, as a result of the execution of _make_base_dir defined in bibdocfile.py) a new directory is created to store the fulltexts. The path is made using the CFG_WEBSUBMIT_FILEDIR variable (defined in config.py) as a basis and appending a group (g0) and a docid (409).

This BibRecDoc has consequences in how URL’s to fulltexts are built and managed (hard to explain in a few lines, so I’ll skip this).

Well, in my case I decided to change this function so that files would not be copied to that dir, but to a new one instead /t31/TAZ/SN/.

SN refers to RecId (record id) and it is a variable.

I also decided I did not want to create a BibRecDoc for my TAZ fulltext’s because I wanted to handle the fulltext-URLs creation (stored into marcxml’s 856u tag) to be like if the fulltext was outside from CDS Invenio system (external url, using CDS terms).

This allows me to avoid the Python handler (invenio.webinterface_layout defined in /soft/cds-invenio/etc/apache/invenio-apache-vhost.conf) into which invenio relies.

My Move_Files_To_Storage_TAZ is as follows:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
from invenio.bibdocfile import BibRecDocs, decompose_file, InvenioWebSubmitFileError
import os
import re
from invenio.websubmit_icon_creator import create_icon, InvenioWebSubmitIconCreatorError
from invenio.websubmit_config import InvenioWebSubmitFunctionWarning
from invenio.websubmit_functions.Shared_Functions import get_dictionary_from_string, \
     createRelatedFormats
from invenio.errorlib import register_exception
 
def Move_Files_to_Storage_TAZ(parameters, curdir, form, user_info=None):
    """
    The function moves files received from the standard submission's form through
    file input element(s).
    Websubmit_engine built the following file organization in the directory curdir/files
 
                  curdir/files
                        |
      _______________________________________________________________________________
            |                                   |                          |
      ./file input 1 element's name      ./file input 2 element's name    ....
         |                                     |
      test1.pdf                             test2.pdf
 
 
    There is only one instance of all possible extension(pdf, gz...) in each part
    otherwise we may encount problems when renaming files.
    +parameters['rename']: if given, all the files in curdir/files are renamed.
     parameters['rename'] is of the form: <PA>elemfilename[re]</PA>* where re is
     an regexp to select(using re.sub) what part of the elem file has
     to be selected.e.g <PA>file:TEST_FILE_RN</PA>
    +parameters['documenttype']: if given, other formats are created.
     It has 2 possible values: - if "picture" icon in gif format is created
                               - if "fulltext" ps, gz .... formats are created
    +parameters['paths_and_suffixes']: directories to look into and corresponding
    suffix to add to every file inside. It must have the same structure as a
     python dictionnary of the following form
     {'FrenchAbstract':'french', 'EnglishAbstract':''}
     The keys are the file input element name from the form <=> directories in curdir/files
     The values associated are the suffixes which will be added to all the files
     in e.g. curdir/files/FrenchAbstract
    +parameters['iconsize'] need only if "icon" is selected in parameters['documenttype']
    """
    global sysno
    paths_and_suffixes = parameters['paths_and_suffixes']
    rename = parameters['rename']
    documenttype = parameters['documenttype']
    iconsize = parameters['iconsize']
 
    ## Create an instance of BibRecDocs for the current recid(sysno)
# we do not want this anymore
#    bibrecdocs = BibRecDocs(sysno)
 
    paths_and_suffixes = get_dictionary_from_string(paths_and_suffixes)
    ## Go through all the directory specified in the keys
    ## of parameters['paths_and_suffixes']
    for path in paths_and_suffixes.keys():
        ## Check if there is a directory for the current path
        if os.path.exists("%s/files/%s" % (curdir, path)):
            ## Go through all the files in curdir/files/path
            for current_file in os.listdir("%s/files/%s" % (curdir, path)):
                ## retrieve filename and extension
                ## Editado por Teresa y Miguel: vamos a copiar los TAZ al /t31 y vamos a pasar del resto de cosas
                dummy, filename, extension = decompose_file(current_file)
                if extension and extension[0] != ".":
                    extension = '.' + extension
                if len(paths_and_suffixes[path]) != 0:
                    extension = "_%s%s" % (paths_and_suffixes[path], extension)
                ## Build the new file name if rename parameter has been given
                if rename:
                    filename = re.sub('<PA>(?P<content>[^<]*)</PA>', \
                                      get_pa_tag_content, \
                                      parameters['rename'])
                if rename or len(paths_and_suffixes[path]) != 0:
                    ## Rename the file
                try:
                        # Write the log rename_cmd
                        fd = open("%s/rename_cmd" % curdir, "a+")
                        fd.write("%s/files/%s/%s" % (curdir, path, current_file) + " to " +\
                                  "%s/files/%s/%s%s" % (curdir, path, filename, extension) + "\n\n")
                        ## Rename
                        os.rename("%s/files/%s/%s" % (curdir, path, current_file), \
                                  "%s/files/%s/%s%s" % (curdir, path, filename, extension))
                        fd.close()
                        ## Save the new name in a text file in curdir so that
                        ## the new filename can be used by templates to created the recmysl
                        fd = open("%s/%s_RENAMED" % (curdir, path), "w")
                        fd.write("%s%s" % (filename, extension))
                        fd.close()
 
                        fd = open("%s/SN" % (curdir) )
                        numeroreg = fd.read()
                        fd.close()
 
                        fd = open("%s/CENTRO" % (curdir) )
                        centro = fd.read()
                        fd.close()
 
                        fd = open("%s/DEMOTHE_DATE" % (curdir) )
                        year = fd.read()
                        fd.close()
 
                        ## HAsta aqui todo igual. Ahora vamos a copiarlo al /t31, que es lo que queremos
                        destino = "/t31/TAZ/%s/%s/%s" % (centro, year, numeroreg)
                        os.system("mkdir /t31/TAZ/%s/" % (centro))
                        os.system("mkdir /t31/TAZ/%s/%s/" % (centro, year))
                        os.system("mkdir /t31/TAZ/%s/%s/%s/" % (centro, year, numeroreg))
                        if path == "PFC_MEM":
                            os.system("cp %s/files/%s/%s%s %s/%s%s" % (curdir, path, filename, extension, destino, filename, extension) )
                        if path == "PFC_ANE":
                            os.system("cp %s/files/%s/%s%s %s/%s_ANE%s" % (curdir, path, filename, extension, destino, filename, extension) )
                        ## Ya esta el directorio destino creado y sabemos cual es
                        ## entonces invocamos a la funcion que crea el .htaccess en destino
                        if path == "PFC_MEM":
                            create_htaccess(curdir,destino)
 
                    except OSError, err:
                        msg = "Cannot rename the file.[%s]"
                        msg %= str(err)
                        raise InvenioWebSubmitFunctionWarning(msg) 
    return ""
 
def get_pa_tag_content(pa_content):
    """Get content for <PA>XXX</PA>.
    @param pa_content: MatchObject for <PA>(.*)</PA>.
    return: the content of the file possibly filtered by an regular expression
    if pa_content=file[re]:a_file => first line of file a_file matching re
    if pa_content=file*p[re]:a_file => all lines of file a_file, matching re,
    separated by - (dash) char.
    """
    pa_content = pa_content.groupdict()['content']
    sep = '-'
    out = ''
    if pa_content.startswith('file'):
        filename = ""
        regexp = ""
        if "[" in pa_content:
            split_index_start = pa_content.find("[")
            split_index_stop =  pa_content.rfind("]")
            regexp = pa_content[split_index_start+1:split_index_stop]
            filename = pa_content[split_index_stop+2:]## ]:
        else :
            filename = pa_content.split(":")[1]
        if os.path.exists(os.path.join(curdir, filename)):
            fp = open(os.path.join(curdir, filename), 'r')
            if pa_content[:5] == "file*":
                out = sep.join(map(lambda x: re.split(regexp, x.strip())[-1], fp.readlines()))
            else:
                out = re.split(regexp, fp.readline().strip())[-1]
            fp.close()
    return out

Relevant lines are 90 to 102 (read SN and copy the files to desired location).

The APACHE configuration

In this post you will find the explanations to the following configuration:

/soft/cds-invenio/etc/apache/invenio-apache-vhost.conf:

AddDefaultCharset UTF-8
ServerSignature Off
ServerTokens Prod
NameVirtualHost 155.210.5.41:80
<Files *.pyc>
   deny from all
</Files>
<Files *~>
   deny from all
</Files>
<VirtualHost 155.210.5.41:80>
        ServerName aneto.unizar.es
        ServerAdmin teresa@unizar.es
        DocumentRoot /soft/cds-invenio/var/www
        ErrorLog "/soft/cds-invenio/var/log/apache/ldap-error_log"
        CustomLog "/soft/cds-invenio/var/log/apache/ldap-access_log" common
        LogLevel debug
        <Directory /soft/cds-invenio/var/www>
           Options FollowSymLinks MultiViews
        </Directory>
        Alias /TAZ/ "/t31/TAZ/"
        <Directory /t31/TAZ/>
           AllowOverride AuthConfig
           AuthType Basic
           AuthBasicProvider ldap file
           AuthzLDAPAuthoritative off
           AuthName "Aneto accediendo a PDF sin aprobar"
           AuthLDAPURL "ldap://ldapmail.unizar.es/dc=unizar,dc=es?uid?sub?(objectClass=person)"
        </Directory>
        DirectoryIndex index.en.html index.html index.php
        <LocationMatch "^(/+$|/index|/collection|/record|/author|/search|/browse|/youraccount|/youralerts|/yourbaskets|/yourmessages|/yourgroups|/submit|/getfile|/comments|/error|/oai2d|/rss|/help|/journal|/openurl|/stats|/ourcode)">
           SetHandler python-program
           PythonHandler invenio.webinterface_layout
           PythonDebug On
        </LocationMatch>
        <Directory /soft/cds-invenio/var/www>
           AddHandler python-program .py .cgi
           PythonHandler mod_python.publisher
           PythonDebug On
        </Directory>
</VirtualHost>

Just as an example of .htaccess I provide the one related to recid 3481:

[root@aneto cdsadmin]# more /t31/TAZ/3481/.htaccess
# Generate your .htpasswd files using the following online service
# http://www.kxs.net/support/htaccess_pw.html
 
AuthUserFile /soft/cds-invenio/var/.htpasswd
Require user cdsadmin miguelm

The cdsadmin user will be present in ALL the .htaccess files. This user is authenticated using AuthUserFile

The second user, miguelm (authenticated with LDAP service) is the user that originally submitted the record (this name is part of the submitter-email, usually referred as SuE).

With this config user cdsadmin will be allowed to download all fulltext and SuE will be also allowed to download his fulltext, but other users won’t (and, way more, if the MBI -modification- is configured properly, these two users will be the only ones to have the chance to modify the record. Refer to function Is_Original_Submitter).

The .htaccess-creation function

You will also need to make a new function which responsible for creating the proper .htaccess file for each record.

This function will be explained in the next days, I’m still at a developing stage ;)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
def create_htaccess(curdir,destino):
    """Crea el htaccess para usuarios privilegiados (cdsadmin, etc)
    que se validan contra el fichero .htpasswd general y
    el submitter que se valida contra LDAP.
    """
    # Abrir curdir/SuE y leer el valor
    fd = open("%s/SuE" % (curdir))
    SuE = fd.read()
    fd.close()
 
    # Aqui tenemos algo como SuE = miguelm@unizar.es
    # Hacemos el split por @
    user = SuE.rsplit('@',1)
    usuario = user[0]
 
    # Ya tenemos el 'usuario'- Ahora creamos el .htaccess
    htaccess = """AuthUserFile /soft/cds-invenio/var/.htpasswd
               Require user cdsadmin %s""" % (usuario)
 
    # Ahora escribimos el .htaccess donde corresponde, osea, en 'destino'
    fd = open("%s/.htaccess" % (destino), "w")
    fd.write("%s" % (htaccess))
    fd.close()

CDS-Invenio: Understanding WEBSUBMIT

A few posts back I talked about restricting access to fulltexts to an iprange. In those articles I gave some tips about how websubmit works. Now I want to do another mod to my CDS Invenio repository, so I needed to have a deeper understanding in websubmit workflow.

Before we begin this travel through websubmit it is a good idea to define some terminology I’ll be using:

Terminology

Recid: it is the registry number of a record. For instance, for record http://zaguan.unizar.es/record/2000 registry number is equals to 2000. This number is stored into marcxml’s 001 tag.

Report_Number: it is another way to identify some record. For the example above, this number is INPRO–2009-038 and it is stored into marcxml’s 037 tag. It is also called reference or rn.

access: this is a randomly-generated? (not so sure its totally random) number which is created every time you begin with an action. You can see it in the URL when, for instance, you submit a record. It is something like this: 1260870578_1753 and it is NOT stored in marcxml, but it is in the database (for instance in field id of table sbm_SUBMISSIONS).

act: the action you are executing. This parameter can be seen in your browser’s URL. Invenio comes with several pre-made actions, like:

SBI for submit new record
APP for approve submitted record
MBI for modify existing (this is, already approved!!) record
SRV for changes in attached fulltext files.

doctype: the document type to which act is applied. This can be seen in browser’s URL too. doctype refers to the string in brackets that you can see in your websumit’s admin menu. For instance, [DEMOBOO].

indir: each action is connected directly with a system’s directory. For example, for action MBI (modify existing record) the working directory is modify. This is the value of indir parameter. Further details can be read in the following section.

Actions

You can see your defined actions in
http://www.yourrepositoryname.com/admin/websubmit/websubmitadmin.py/actionlist

Each action is connected to a working directory. For instance, SBI is attached to running directory, which is located under $PATH_TO_CDS/var/data/submit/storage/running/

By default, CDS Invenio comes with an example of a referee’d doctype, which executes this functions. Along with the functions are the values of each step and score.

  Create_Recid     1 	10 	
  Report_Number_Generation 	 1 	20 	
  Make_Dummy_MARC_XML_Record	 	1 	30 	
  Move_Files_to_Storage	1 	40 	
  Mail_Submitter	1 	50 	
  Update_Approval_DB 	1 	60 	
  Send_Approval_Request	1 	70 	
  Print_Success	  1 	80
  Move_to_Pending    1       90

The SUBMIT NEW RECORD (SBI) workflow

Suppose we enter our repository, log in, and then click “submit” and select a referee’d doctype (in my case doctype=PFC). Then a form appears. What is really happening?

Well, websubmit_engine.py and websubmit_webinterface.py are working in the shadows.

Here is what really happens:
websubmit_engine creates a new access number.

websubmit_webinterface creates a new directory using several parameters. Lets see an example:

$BASE/$indir/$doctype/$access
$BASE=$PATH_TO_CDS/var/data/storage
$indir=running
$doctype=PFC
$access=1260870578_1753
 
So, the system creates $PATH_TO_CDS/var/data/storage/running/PFC/1260870578_1753

Then it copies to that directory (from now on, called curdir) all of the parameters shown in your browser’s URL (indir, doctype, access, blablabla). Click the image below to see fullsize:

cds invenio websubmit

It is important to note that none of the PFC-SBI’s functions have been still executed!.

Now the user begins to fill the submit form. When it is filled, the “submit” button is pushed. This is the moment in which PFC-SBI’s functions begin to run..

Lets see what happens before Move_To_Done is executed:

All of the form’s fields are stored in curdir. If your form has a field called PFC_AUTHOR with value ‘Miguel Martín’ a file called PFC_AUTHOR is created in curdir and it contains the string ‘Miguel Martín’. Since Create_Dummy has also been executed, a file called dummy_marcxml is also created (Make_Dummy_MARC_XML_Record functions takes into account the $doctype.tpl and $doctypeCREATE.tpl files and builds your dummy_marcxml according to that information).

Once this is done a “Congratulations! blablabla” message appears to the user, as a result of the execution of Print_Success. It seems that all of the submitting process is over, but it really isn’t.

Something ‘unexpected’ happens:
Move_to_Pending function moves your curdir/* files to /var/data/submit/storage/pending/doctype/Report_Number directory! This is, in my case:
/var/data/submit/storage/pending/PFC/PFC–2009-005

Here we can take a detailed look at Move_To_Pending.py:

import os
 
from invenio.config import CFG_WEBSUBMIT_STORAGEDIR
from invenio.websubmit_config import InvenioWebSubmitFunctionError
 
def Move_to_Pending(parameters, curdir, form, user_info=None):
    global rn
    doctype = form['doctype']
    PENDIR = "%s/pending/%s" % (CFG_WEBSUBMIT_STORAGEDIR,doctype)
    if not os.path.exists(PENDIR):
        try:
            os.makedirs(PENDIR)
        except:
            raise InvenioWebSubmitFunctionError("Cannot create pending directory %s" % PENDIR)
    # Moves the files to the pending directory
    rn = rn.replace("/","-")
    namedir = rn
    FINALDIR = "%s/%s" % (PENDIR,namedir)
    os.rename(curdir,FINALDIR)
    return ""

As shown above, this function creates, if not exists and for my example: /var/data/submit/storage/pending/PFC/.

The rn variable means Report Number, this is, PFC–2009-005. So it moves all the contents in the curdir to /var/data/submit/storage/pending/PFC/PFC–2009-005/ directory.

cds invenio websubmit

*** Edit: lets see some of MBI (modify record AFTER it is approved) pipeline.

APP (approve record)

When a record has been submitted and then approved some information is stored in /var/data/submit/storage/done/PFC/$Report_Number and a new entry is created in MySQL database, more precisely in table sbm_SUBMISSIONS.

This table stores a lot of information. Lets see what is stored for some record (for instance, for the one which Report_Number=

SELECT * FROM sbmSUBMISSIONS WHERE reference = 'PFC-2009-018';

submission database

Lets see the stored information and fields.

email who performed the action? (It was me!).
doctype the doctype of the document to which the action has been made (PFC
action the action made to that doctype (SBI, MBI, APP, SRV).
status the status of the task, in this case, finished (could be pending
id the access (number) of the task: 1245755804_15279
reference (report number or rn)
cd The date of cd. This is, the date of creation of /soft/cds-invenio/var/data/submit/storage/pending/PFC/PFC-2009-018/mainmenu
md The date of md. This is, the date of creation of /soft/cds-invenio/var/data/submit/storage/pending/PFC/PFC-2009-018 directory

The fulltext document (in pdf or whatever format you use) is stored in /soft/cds-invenio/var/data/submit/storage/pending/PFC/PFC-2009-018/files/PFC_MEM/PFC-2009-018.pdf but a compressed version is also stored into /soft/cds-invenio/var/data/files/g0/405/PFC-2009-018.pdf;1

MBI (modify AFTER approval)

The functions that are run are listed below:

    Get_Report_Number
    Get_Recid
    Is_Original_Submitter
    Create_Modify_Interface
    Get_Report_Number
    Get_Recid
    Make_Modify_Record
    Insert_Modify_Record
    Print_Success_MBI
    Send_Modify_Mail
    Move_to_Done

Wow, a lot of functions! I will comment only the ones which are more important and hard to understand.

I’ll begin with Create_Modify_Interface:

This functions reads the stored metadata of a record and creates an interface to modify the fields that user selects. Where does the Create_Modify_Interface function read the stored values? Is it from database? Is it from the system directories?. Well, it depends: this function goes into curdir and looks for a file named Create_Modify_Interface_Done.

If it exists, then the record metadata is loaded from FILES (curdir system directory) using Create_Modify_Interface_getfieldval_fromfile.

If not exists, then the record metadata is loaded from DATABASE, using Create_Modify_Interface_getfieldval_fromDBrec.

**** WILL CONTINUE WRITING THIS POST IN FUTURE DAYS ****

CDSInvenio: restrict access to some fulltexts to an iprange (II)

My last article about restricting access to some fulltexts to an iprange had a few lacks and mistakes.

It worked perfectly if the type of fulltext (always speaking in terms of ‘public’ or ‘private’) was only set in the SBI process (this is, set by the submiter).

There are a few steps to follow in order to make it work if you want that the REFEREE to be able to change this type.

First of all I will tell you about some of the restrictions CDS Invenio has in its APP and SBI pipelines. These tips will be very helpful to understand the quirks of websubmit module.

websubmit quirks: a quick guide

1. Get_Report_Number must ALWAYS be called *before* Get_Recid

2. Get_Recid must be called before any other function that uses sysno
variable

3. If you are wondering what’s the meaning of the steps in APP, read the following lines:

  • Step one includes the function to prepare the record and ends with a CaseEDS (which is something like a if-then-else)
  • Step two: functions to be executed if the record is approved.
  • Step three: functions to be executed if the record is rejected

4. More tips about APP: Get_Recid relies on Move_from_Pending to be already run (otherwise, there’s no way for Get_Recid to discover the recid of record that has never really be submitted).
If you now why, just read the following lines:

If you set Get_Recid before Move_from_Pending, Get_Recid will complain (it’ll spit something like the record could not be found in our database) because it can’t really find the recid (which should be in a variable called SN in the curdir directory, which in APP is only populated after a call to
Move_from_Pending)…

5. Once Get_Recid is called (no matter in what step) this sysno
variable is kept during all the APP pipeline.
Well, just a small comment about this: the step jumping back and forth is implemented through client browser redirection in Javascript and not server side, hence, each time the step changes (e.g. after the execution of CaseEDS) a new request is made on the server, which is kind of a new Python process is run. I.e. the global variables are lost. So, (gee sorry for telling you this things one at a time, but I’m still a bit new in WebSubmit), you should call Get_Report_Number and Get_Recid in any step, if the other WebSubmit function are expecting to find the reportnumber and the recid/sysno as a global variable…
Please note that in APP you should NOT call Get_Recid in step one (before Move_from_Pending). Refer to (3) for a deeper explanation.

5. Move_to_Done / Move_to_Pending must be called at the end of the process no matter if it is APP, SBI, . Well, with the meaning that Move_to_Pending is supposed to be used in e.g. the SBI action of a refereed submission, since the submitted record will be moved to pending status, for later approval/rejection. For normal submission you would use Move_to_Done at the end of any action.

Now, lets go for it

The idea:

To solve the things I would propose… that you invent an other variable (e.g. REFEREE_PRV_PUB) that will be filled by the referee in the APP action, with three posible values: public, private and "" (i.e. nothing). And then you might extend your Fulltext_Status function to read first this REFEREE_PRV_PUB function (from the filesystem in the curdir) and if it's empty to read the PRV_PUB as filled by the submitter.

So the final configuration might be that you call Fulltext_Status only in step 2 and 3 of APP, after Get_Recid, which comes after Move_from_Pending which comes after Get_Report_Number

Easy, isn’t it? :-P

Changes in Fulltext_Status.py:

My code is now like (changes are outlined):

import os
import re
from invenio.errorlib import register_exception
from invenio.bibdocfile import BibRecDocs, decompose_file, InvenioWebSubmitFileError
 
def Fulltext_Status(parameters, curdir, form, user_info=None):
    """ Reads the form and sets status to prv, if submitter marks
        the record as private
    """
 
    global doctype,access,act,dir
    t=""
    bibrecdocs = BibRecDocs(sysno)
 
    <strong># Due to the constraints in APP pipeline two variables
    # (REFEREE_PRV_PUB and PRV_PUB) must be used
    # to avoid the overwriting of the variables
    # caused by Move_From_Pending (in the APP pipeline)
    # --------------------------------------------------------
    # Get_Recid relies on Move_From_Pending to read sysno
    # Move_From_Pending must be called AFTER Get_Report_Number
    # if Get_Recid is called before Move_From_Pending (in APP) the system will spit an
    # error message like "That record could not be found in our database"</strong>
 
    prv_pub = ""
<strong>    if form.has_key("REFEREE_PRV_PUB"):
       prv_pub = form['REFEREE_PRV_PUB']
 
    elif os.path.exists(os.path.join(curdir, 'REFEREE_PRV_PUB')):
       prv_pub = open(os.path.join(curdir, 'REFEREE_PRV_PUB')).read()</strong>
 
    # if REFEREE_PRV_PUB is set and not equal to "" its value must be the used, instead
    # of the one used in PRV_PUB!!!
 
    elif form.has_key("PRV_PUB"):
       prv_pub = form['PRV_PUB']
 
    elif os.path.exists(os.path.join(curdir, 'PRV_PUB')):
       prv_pub = open(os.path.join(curdir, 'PRV_PUB')).read()
 
    else:
       prv_pub = ""
 
    if prv_pub  == 'private':
       # then status must be set to prv
       bibdocs = bibrecdocs.list_bibdocs()
       for bibdoc in bibdocs:
           bibdoc.set_status('prv')
 
    elif prv_pub == 'public':
      # then status must be set to ""
       bibdocs = bibrecdocs.list_bibdocs()
       for bibdoc in bibdocs:
           bibdoc.set_status('')
 
    return ""

Changes in ART.tpl

It should have two lines like:

PRV_PUB---<:PRV_PUB:>
REFEREE_PRV_PUB---<:REFEREE_PRV_PUB:>

Changes in ARTcreate.tpl

We’ll just modify the lines in charge to set 984a field to ‘private’ or ‘public’.
If REFEREE_PRV_PUB has a value (not equal to “”) this value must overwrite the one in PRV_PUB (the referee has the last word in the article’s type). This is done as follows:

984a::IFDEFP(REFEREE_PRV_PUB,,0)---<:REFEREE_PRV_PUB::REFEREE_PRV_PUB:>
984a::IFDEFP(REFEREE_PRV_PUB,,1)---<:PRV_PUB::PRV_PUB:>
END::DEFP()---

And now, the SBI and APP pipelines

There is no need to Fulltext_Status to be called in SBI. So, delete from it ;)
In the APP put Fulltext_Status ONLY in steps two and three. It should end up being something like:
new APP process

CDSInvenio: restrict access to some fulltexts to an iprange

[NOTE: Only available in english, my apologies to spanish speakers]

[Edit note: if you want to be able to change the type of  fulltext (in terms of public/private) from the APP(roval) pipeline you should check the second part of this article]

Introduction

One of the most common issues when addressing the problem of publishing scientific production (this is, for instance, published articles) in an OAI repository is that fulltexts are usually under restrictive licensing. I mean, journals are not very happy with the idea of their fulltexts being public.

But this concept is totally against my idea of OAI Repositories, in which fulltexts are supposed to be public and accesible for everyone. Most of the universities are engaged in a debate about how to solve this issue. Meanwhile, we IT people must provide a solution for the problem.

In our case the solution is simple: restrict the fulltext access of copyrighted fulltexts to University staff only (students, profesors, researchers…). Now what you might be wondering is: how to do this with CDSInvenio?

What we want to achieve is: in the submission form (and maybe in the approval form too) we show a select box which asks the submitter if the fulltext is public or private. If the document is marked as private the associated fulltext will be only available from a certain iprange (in my case I slightly modified the functionality using EZProxy so that our users can access the contents from their home).

Restrict access to some fulltexts using CDSInvenio: step by step

1. I’ve defined the new Function

called Fulltext_Status.py.
Tip: make sure that the function name is equal to the file’s name without the trailing .py

1.5 I’ve modified the template associated to submission of ART (article) files

The associated files (in my case) are:

  • ART.tpl
  • ARTcreate.tpl

In ART.tpl I’ve added the following line:

PRV_PUB---<:PRV_PUB:>

In ARTcreate.tpl I modified the line associated to generation of 8564_u MARC tag like:

8564u::IFDEFP(DEMOART_FILE_RENAMED,,0)---<datafield tag="856" ind1="4" ind2=" "><subfield code="u">http://zaguan.unizar.es/record/<:SN::SN:>/files/<:DEMOART_
FILE_RENAMED::DEMOART_FILE_RENAMED:></subfield><subfield code="z">Fulltext</subfield></datafield>

which basically means write URL to file in 8564_u tag if there is an attached file.

and also added a 984__a tag (9xx tags are used for administration purposes) which shows if the record is considered as private or public:

984a::DEFP---<:PRV_PUB::PRV_PUB:>

2. I’ve added the new function to the SBI process

Tip: the position of the function is quite important. It has to be *AFTER* Move_Files_to_Storage call because it’s Move_Files_to_Storage who will allocate the bibdocs and associate them to the record.

SBI process

3. And to the APP process.

Tip 1: the position of the function in the approval process is also important. It has to be just *BEFORE* the Move_to_Done function because this should be the last function ever to be called (since it packs up everything and archives it).

Tip 2: Moreover in the APP action, due to an architectural limitation of WebSubmit, function executed in a step different than the 1st one, will not have the form dictionary (Will improve the documentation on this).


APP process cds invenio

In order to be able to read the parameter PRV_PUB in your function you should instead try to use the filesystem as in:

if form.has_key("PRV_PUB"):
    prv_pub = form['PRV_PUB']
elif os.path.exists(os.path.join(curdir, 'PRV_PUB')):
    prv_pub = open(os.path.join(curdir, 'PRV_PUB')).read()
else:
    prv_pub = ""

In this way if the form element is not there you fallback on the filesystem, where, if everything went correctly, the form element should have been stored in a file just before entering step 1.

4. Then I have created a new element description

called PRV_PUB
SBI process
This is just the important part:

<select name="PRV_PUB">
   <option>Seleccione el carácter de su publicación:</option>
   <option value="private">Private</option>
   <option value="public">Public</option>
</select>

5. I’ve added this new element to the SBI and APP page

The image below just shows the SBI page (with the APP page the including process is pretty similar)

SBI form page cds invenio

6. Next I have created the role IP_UZ

Just go to $CFG_SITE_NAME/admin/webaccess/webaccessadmin.py URL and add a new role:

   allow email "miguelm@unizar.es"
   allow remote_ip "155.210."
   deny all

7. Then I have connected this new role

… to the viewrestrdoc with status=prv (prv is the status code I used in Fulltext_Status function, refer to section 1).
authorization details cds invenio

Then I submit a new element, approve it and check if I can see the fulltext pdf from my computer (155.210.XX.YY). Great, I can.

Tip: You can check the status of recently submitted files using:

/soft/cds-invenio/bin/bibdocfile --get-info --recid 3277

You should see something like this:

3277::::total bibdocs attached=1
3277::::total size latest version=714.1 KB
3277::::total size all files=714.1 KB
3277:225:::docname=ART--2009-009
3277:225:::doctype=DEMOART_FILE
3277:225:::status=prv
3277:225:::basedir=/soft/cds-invenio/var/data/files/g0/225
3277:225:::creation date=2009-05-13 13:07:55
3277:225:::modification date=2009-05-13 13:08:40
3277:225:::total file attached=1
3277:225:::total size latest version=714.1 KB
3277:225:::total size all files=714.1 KB
3277:225:1:.pdf:fullpath=/soft/cds-invenio/var/data/files/g0/225/ART--2009-009.pdf;1
3277:225:1:.pdf:fullname=ART--2009-009.pdf
3277:225:1:.pdf:name=ART--2009-009
3277:225:1:.pdf:status=prv
3277:225:1:.pdf:checksum=abccd8b54af1c1fb10f4ad3a7e93151a
3277:225:1:.pdf:size=714.1 KB
3277:225:1:.pdf:creation time=2009-06-15 13:28:44
3277:225:1:.pdf:modification time=2009-05-13 13:07:55
3277:225:1:.pdf:encoding=None
3277:225:1:.pdf:url=http://zaguan.unizar.es/record/3277/files/ART--2009-009.pdf
3277:225:1:.pdf:description=None
3277:225:1:.pdf:comment=Texto completo

After that I use a proxy to connect to my repository, and without being logged in into Invenio, I try to access the fulltext. A “this file is restricted” text appears. Cool, it worked! :)

Going further: using EZProxy

At this point we know how to restrict the access to fulltext from an iprange. If we want our users (which are validated agains an LDAP system) to be able to access the fulltexts from their home (and not only from the university’s ip’s) we can use something like EZProxy. From a high level point of view this software gives an intern IP to outside connections (only if the user is able to login to EZProxy)

I will not explain here how to install/use/configure this software because there is plenty of documentation in their website. Lets suppose we have it running already.

The steps I took to make CDS work with EZProxy were:

8. I slightly modified the Bibformat element which shows URLs

(in my case, called bfe_fulltext_light.py) so that it ends up being something like (I just show the part of main_urls and outline the important lines. Please ignore the part of dx.doi.org):

<strong>ezproxy_url = 'http://roble.unizar.es:9090/login?url=' # your URL to ezproxy</strong>
if main_urls:
        last_name = ""
        for descr, urls in main_urls.items():
            url_list = []
            urls.sort(lambda (url1, name1, format1), (url2, name2, format2): url1 &lt; url2 and -1 or url1 &gt; url2 and 1 or 0)
 
            for url, name, format in urls:
                last_name = name
                if show_icons.lower() == 'yes':
                    file_icon = '&lt;img src="%s/img/%s" alt="%s"/&gt;'
                                   % (CFG_SITE_URL, icon(url), _("Download fulltext"))
                else:
                    file_icon = ''
                # first of all, see if it is public, private or dx.doi.org link
                <strong>pub_prv = bfo.field('984__a')</strong>
                <strong>if</strong> (url.find("dx.doi.org") != -1) or <strong>(pub_prv.find("private") != -1):
                      url_list.append('&lt;a '+style+' href="' + ezproxy_url + escape(url)+'"&gt;'+ \
                                                    file_icon +'&lt;/a&gt; ')
 
                else:
                      url_list.append('&lt;a '+style+' href="' + escape(url)+'"&gt;'+ \
                                                    file_icon +'&lt;/a&gt; ')</strong>
 
            out += separator.join(url_list) + additional_str

Then run:

echo "DELETE FROM bibfmt WHERE format='HB'" | /soft/cds-invenio/bin/dbexec
sudo -u apache bibreformat -c "YOUR_COLLECTION_NAME"
sudo -u apache bibsched

Now your URL’s will be pointing to ezproxy_url instead of directly to the fulltext, which means users (which can login to EZProxy) will be available to access fulltexts from any IP.

You can see a working example in Zaguan repository.

Thanks a lot to all the CDS Support Team and specially to Samuele Kaplun.