Tag Archives: Repositorio

Curso CDS Invenio GRATIS para descargar

Últimamente no paro de subir cursos gratis, no os quejaréis. El otro día fue el que impartí a mis compañeros del Servicio de Informática sobre el framework Symfony.

El de hoy es de CDS Invenio (software libre y abierto de gestión de repositorios desarrollado por el CERN) y está orientado a personal bibliotecario e informático. Cubre los principales módulos, workflow, acciones y problemas habituales en la administración de un repositorio.

Espero que os sea útil.

Sin más rollos, el link: Curso CDS Invenio (repositorio) gratis en slideshare

Vufind con Innovative Millenium y CDS-Invenio: importar registros y configurar facetas

Unos días atrás comenté cómo instalar vufind.

Una vez instalado el software y comprobado que funcionan algunos aspectos fundamentales (como la validación por LDAP, etc), el siguiente paso es proceder a la importación de registros (bien sean desde el Catálogo de la Biblioteca o desde Repositorios OAI).

Paso a comentar algunas conclusiones obtenidas mediante las primeras experiencias de carga.

Carga de registros de prueba procedentes de Innovative Millenium (formato MARC)

Imaginemos que hemos exportado desde Millenium un archivo con registros MARC (.mrc). En caso de que no tengáis una exportación de registros a mano podéis usar (esta página o ésta otra para obtener datos de ejemplo).

Mi fichero se llama $VUFIND_HOME/import/400.mrc

Procedo a importar los registros con la siguiente orden:

/usr/local/vufind/import-marc.sh import/400.mrc

Más información en la wiki de vufind.

Carga de registros de prueba procedentes de CDS-Invenio (formato MARCXML)

Imaginemos que exportamos de un repositorio un conjunto de registros en formato marcxml, por ejemplo éste.

Creo una carpeta para almacenar estos ficheros .xml exportados:

mkdir $VUFIND_HOME/harvest/desdezaguan
mkdir $VUFIND_HOME/harvest/desdezaguan/marcxml
cd $VUFIND_HOME/harvest/desdezaguan/marcxml

Obtengo los registros…

wget http://zaguan.unizar.es/search?as=1&cc=Tesis&m1=a&p1=&f1=&op1=a&m2=a&p2=&f2=&op2=a&m3=a&p3=&f3=&action_search=Buscar&c=Tesis&c=&sf=&so=a&rm=&rg=160&sc=1&of=xm
 
Connecting to zaguan.unizar.es|127.0.0.1|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: `search?as=1'
 
    [  <=>                                                                                                ] 38,128       131K/s   in 0.3s
 
2010-09-30 14:11:26 (131 KB/s) - `search?as=1' saved [38128]
 
 
[1]+  Done                    wget http://zaguan.unizar.es/search?as=1

Cambio el nombre del fichero xml…

 mv search\?as\=1 tesis0.xml

Y procedemos a realizar la importación. Se invoca poniendo como parametro el directorio donde están los XML’s:

[root@ harvest]# cd /usr/local/vufind/harvest; \
                     ./batch-import-marc.sh desdezaguan-tesis/marcxml/
 
Now Importing /usr/local/vufind/harvest/desdezaguan-tesis/marcxml//tesis0.xml ...
/usr/java/jre1.6.0_17/bin/java -Xms512m -Xmx512m
 
-Dsolrmarc.solr.war.path=/usr/local/vufind/solr/jetty/webapps/solr.war
 
-Dsolr.core.name=biblio -Dsolrmarc.pa
th=/usr/local/vufind/import -Dsolr.path=/usr/local/vufind/solr
 
-Dsolr.solr.home=/usr/local/vufind/solr -jar /usr/local/vufind/import/SolrMarc.jar
 
/usr/local/
vufind/import/import.properties
 
/usr/local/vufind/harvest/desdezaguan-tesis/marcxml/tesis0.xml
 INFO [main] (MarcImporter.java:769) - Starting SolrMarc indexing.
 INFO [main] (Utils.java:189) - Opening file: /usr/local/vufind/import/import.properties
 INFO [main] (MarcHandler.java:325) - Attempting to open data file:
 
/usr/local/vufind/harvest/desdezaguan-tesis/marcxml/tesis0.xml
 INFO [main] (MarcImporter.java:618) -  Updating to Solr index at /usr/local/vufind/solr
 INFO [main] (MarcImporter.java:634) -      Using Solr core biblio
 INFO [main] (SolrCoreLoader.java:102) - Using the data directory of:
 
/usr/local/vufind/solr/biblio
 INFO [main] (SolrCoreLoader.java:104) - Using the multicore schema file at :
 
/usr/local/vufind/solr/solr.xml
 INFO [main] (SolrCoreLoader.java:105) - Using the biblio core
 INFO [main] (MarcImporter.java:266) - Added record 1 read from file: 4841
 INFO [main] (MarcImporter.java:266) - Added record 2 read from file: 4840
 INFO [main] (MarcImporter.java:266) - Added record 3 read from file: 4823
 
....
 
 INFO [main] (MarcImporter.java:516) -  Adding 160 of 160 documents to index
 INFO [main] (MarcImporter.java:517) -  Deleting 0 documents from index
 INFO [main] (MarcImporter.java:391) - Calling commit
 INFO [main] (MarcImporter.java:402) - Done with the commit, closing Solr
 INFO [main] (MarcImporter.java:405) - Setting Solr closed flag
 INFO [main] (MarcImporter.java:431) - Connecting to solr server at URL:
 
http://localhost:8080/solr/biblio/update
 INFO [main] (SolrUpdate.java:135) - <?xml version="1.0" encoding="UTF-8"?>
 INFO [main] (SolrUpdate.java:135) - <response>
 INFO [main] (SolrUpdate.java:135) - <lst name="responseHeader"><int
 
name="status">0</int><int name="QTime">136</int></lst>
 INFO [main] (SolrUpdate.java:135) - </response>
 INFO [main] (MarcImporter.java:526) - Finished indexing in 0:01.00
 INFO [main] (MarcImporter.java:535) - Indexed 10 at a rate of about 8.0 per sec
 INFO [main] (MarcImporter.java:536) - Deleted 0 records
 INFO [Thread-2] (MarcImporter.java:465) - Starting Shutdown hook
 INFO [Thread-2] (MarcImporter.java:484) - Finished Shutdown hook

La invocacion MUEVE el fichero tesis0.xml y crea:

harvest/desdezaguan/marcxml/log y
harvest/desdezaguan/marcxml/processed

Veamos qué tiene cada carpeta:

[root@ marcxml]# ls -l $VUFIND_HOME/harvest/desdezaguan/marcxml/log/
total 8
-rw-r--r-- 1 root root 3095 Sep 29 12:54 tesis0.xml.log
 
[root@olmo marcxml]# ls -l processed/
total 68
-rw-r--r-- 1 root root 59068 Sep 29 12:49 tesis0.xml (el original)
 
[root@olmo marcxml]# cd log/
[root@olmo log]# more tesis0.xml.log

*** NOTA: Si el identificador del registro YA EXISTE en vufind no duplica, actualiza el registro

Más información en la wiki de vufind.

Configurando el display name de las facetas solr en vufind

Las facetas se describen en el archivo facets.ini. Os muestro cómo queda nuestro archivo tras la modificación y customización de los nombres que se mostrarán en las facetas. La parte de la izquierda muestra el ‘nombre lógico’ del índice de SOLR y la parte derecha el display name (aka ‘lo que sale en la web como facetas’).

* Nota: funcionan las tildes perfectamente (thanks vufind guys!)

more $VUFIND_HOME/web/conf/facets.ini
 
; The order of display is as shown below
; The name of the index field is on the left
; The display name of the field is on the right
[Results]
institution        = Origen
building           = Localización
format             = Formato
 
; Use callnumber-first for LC call numbers, dewey-hundreds for Dewey Decimal:
callnumber-first   = "Call Number"
;dewey-hundreds     = "Call Number"
 
authorStr          = Autor
language           = Idioma
genre_facet        = Genero
era                = Era
geographic_facet   = Región
 
; Facets that will appear at the top of search results when the TopFacets
; recommendations module is used.  See the [TopRecommendations] section of
; searches.ini for more details.
[ResultsTop]
topic_facet        = "Suggested Topics"
 
; This section is reserved for special boolean facets.  These are displayed
; as checkboxes.  If the box is checked, the filter on the left side of the
; equal sign is applied.  If the box is not checked, the filter is not applied.
; The value on the right side of the equal sign is the text to display to the
; user.  It will be run through the translation code, so be sure to update the
; language files appropriately.
;
; Leave the section empty if you do not need checkbox facets.
;
; NOTE: Do not create CheckboxFacets using values that also exist in the
;       other facet sections above -- this will not work correctly.

Los nombres de las facetas quedarán tal que asi:
vufind facets example configuration

Leer más sobre configuración de facetas en vufind y solr

Asignación de valores a las facetas solr en vufind

Es el próximo paso que queremos dar. Pero antes observemos cómo podemos hacer consultas al motor SOLR de vufind.
En http://yoursite.com:8080/solr/biblio/admin/form.jsp de tu servidor web podemos ver una amigable interfaz que nos permite consultar cómo son las respuestas XML a peticiones de consulta del motor y, de este modo, hacernos una idea de cómo queremos asignar valores a cada una de las partes.

Interfaz de consultas a SOLR:
vufind solr web interface

Haced una query. Ver cómo es el XML que devuelve. Fijaos en los valores que tienen los distintos registros devueltos en cada campo. En concreto, y para enseñarlo siempre con un ejemplo, vamos a fijarnos en el campo institution, que por defecto tiene asignado siempre el valor estático ‘MyInstitution’ para cualquier registro importado:

<arr name="institution">
    <str>MyInstitution</str>
</arr>

Esto es debido a la siguiente línea del fichero $VUFIND_HOME/import/marc.properties donde a esa faceta se le asigna el valor estático ‘MyInstitution’:

institution = "MyInstitution"

Imaginemos que queremos que la faceta ‘institution’ haga referencia al origen de los datos. Tendremos varios orígenes distintos: catálogo y repositorio.

Queremos que, si el registro viene del repositorio (Esto es, tiene 980a==’TESIS’) en esta faceta se guarde la cadena “Repositorio”.
Para ello debemos editar el fichero marc_local.properties (este fichero sobreescribe los settings por defecto marcados en marc.properties).

vi $VUFIND_HOME/import/marc_local.properties

Y añadimos la siguiente línea:

#asignar a la faceta 'institution' el valor de la etiqueta 980a según el <em>mappeo</em> establecido en el fichero <em>zaguan_map.properties</em>
institution = 980a,zaguan_map.properties

Y el contenido del fichero $VUFIND_HOME/import/zaguan_map.properties es:

[root@ import]# more /usr/local/vufind/import/zaguan_map.properties
 
# Si el valor de la etiqueta marcxml es 'TESIS', asigna a la faceta la cadena 'Repositorio'
TESIS = Repositorio

Del mismo modo imaginemos que, si el registro viene del catálogo (Esto es, tiene 907a==’.b1XXX’) en esta faceta se guarde la cadena “Catálogo”. En el fichero $VUFIND_HOME/import/marc_local.properties pondremos la línea:

# Tomamos los caracteres 1 y 2 de la etiqueta 907a y los <em>mappeamos</em> según el fichero <em>roble_marc.properties</em>.
institution = 907a[1-2],roble_map.properties

Y en roble_marc.properties:

[root@olmo import]# more /usr/local/vufind/import/roble_map.properties
b1 = Catalogo

¡Mucho ojo con los caracteres especiales como el punto (.) pues son interpretados como expresión regular y habría que escaparlos!

También es útil asignar el campo que actuará como identificador de los registros en vufind. En nuestro caso deseamos utilizar el valor de la etiqueta 907a como identificador del registro. Como la etiqueta es repetible deberemos añadir también el modificador first.

Añadimos pues la siguiente línea a marc_local.properties:

id = 907a, first

Acordaos de reiniciar vufind tras estas modificaciones:

$VUFIND_HOME/vufind.sh restart

De momento os dejo con el manual de facetas de la wiki de vufind para que sigáis leyendo ;)

Más experiencias en breves!

CDS-Invenio: Understanding WEBSUBMIT

A few posts back I talked about restricting access to fulltexts to an iprange. In those articles I gave some tips about how websubmit works. Now I want to do another mod to my CDS Invenio repository, so I needed to have a deeper understanding in websubmit workflow.

Before we begin this travel through websubmit it is a good idea to define some terminology I’ll be using:

Terminology

Recid: it is the registry number of a record. For instance, for record http://zaguan.unizar.es/record/2000 registry number is equals to 2000. This number is stored into marcxml’s 001 tag.

Report_Number: it is another way to identify some record. For the example above, this number is INPRO–2009-038 and it is stored into marcxml’s 037 tag. It is also called reference or rn.

access: this is a randomly-generated? (not so sure its totally random) number which is created every time you begin with an action. You can see it in the URL when, for instance, you submit a record. It is something like this: 1260870578_1753 and it is NOT stored in marcxml, but it is in the database (for instance in field id of table sbm_SUBMISSIONS).

act: the action you are executing. This parameter can be seen in your browser’s URL. Invenio comes with several pre-made actions, like:

SBI for submit new record
APP for approve submitted record
MBI for modify existing (this is, already approved!!) record
SRV for changes in attached fulltext files.

doctype: the document type to which act is applied. This can be seen in browser’s URL too. doctype refers to the string in brackets that you can see in your websumit’s admin menu. For instance, [DEMOBOO].

indir: each action is connected directly with a system’s directory. For example, for action MBI (modify existing record) the working directory is modify. This is the value of indir parameter. Further details can be read in the following section.

Actions

You can see your defined actions in
http://www.yourrepositoryname.com/admin/websubmit/websubmitadmin.py/actionlist

Each action is connected to a working directory. For instance, SBI is attached to running directory, which is located under $PATH_TO_CDS/var/data/submit/storage/running/

By default, CDS Invenio comes with an example of a referee’d doctype, which executes this functions. Along with the functions are the values of each step and score.

  Create_Recid     1 	10 	
  Report_Number_Generation 	 1 	20 	
  Make_Dummy_MARC_XML_Record	 	1 	30 	
  Move_Files_to_Storage	1 	40 	
  Mail_Submitter	1 	50 	
  Update_Approval_DB 	1 	60 	
  Send_Approval_Request	1 	70 	
  Print_Success	  1 	80
  Move_to_Pending    1       90

The SUBMIT NEW RECORD (SBI) workflow

Suppose we enter our repository, log in, and then click “submit” and select a referee’d doctype (in my case doctype=PFC). Then a form appears. What is really happening?

Well, websubmit_engine.py and websubmit_webinterface.py are working in the shadows.

Here is what really happens:
websubmit_engine creates a new access number.

websubmit_webinterface creates a new directory using several parameters. Lets see an example:

$BASE/$indir/$doctype/$access
$BASE=$PATH_TO_CDS/var/data/storage
$indir=running
$doctype=PFC
$access=1260870578_1753
 
So, the system creates $PATH_TO_CDS/var/data/storage/running/PFC/1260870578_1753

Then it copies to that directory (from now on, called curdir) all of the parameters shown in your browser’s URL (indir, doctype, access, blablabla). Click the image below to see fullsize:

cds invenio websubmit

It is important to note that none of the PFC-SBI’s functions have been still executed!.

Now the user begins to fill the submit form. When it is filled, the “submit” button is pushed. This is the moment in which PFC-SBI’s functions begin to run..

Lets see what happens before Move_To_Done is executed:

All of the form’s fields are stored in curdir. If your form has a field called PFC_AUTHOR with value ‘Miguel Martín’ a file called PFC_AUTHOR is created in curdir and it contains the string ‘Miguel Martín’. Since Create_Dummy has also been executed, a file called dummy_marcxml is also created (Make_Dummy_MARC_XML_Record functions takes into account the $doctype.tpl and $doctypeCREATE.tpl files and builds your dummy_marcxml according to that information).

Once this is done a “Congratulations! blablabla” message appears to the user, as a result of the execution of Print_Success. It seems that all of the submitting process is over, but it really isn’t.

Something ‘unexpected’ happens:
Move_to_Pending function moves your curdir/* files to /var/data/submit/storage/pending/doctype/Report_Number directory! This is, in my case:
/var/data/submit/storage/pending/PFC/PFC–2009-005

Here we can take a detailed look at Move_To_Pending.py:

import os
 
from invenio.config import CFG_WEBSUBMIT_STORAGEDIR
from invenio.websubmit_config import InvenioWebSubmitFunctionError
 
def Move_to_Pending(parameters, curdir, form, user_info=None):
    global rn
    doctype = form['doctype']
    PENDIR = "%s/pending/%s" % (CFG_WEBSUBMIT_STORAGEDIR,doctype)
    if not os.path.exists(PENDIR):
        try:
            os.makedirs(PENDIR)
        except:
            raise InvenioWebSubmitFunctionError("Cannot create pending directory %s" % PENDIR)
    # Moves the files to the pending directory
    rn = rn.replace("/","-")
    namedir = rn
    FINALDIR = "%s/%s" % (PENDIR,namedir)
    os.rename(curdir,FINALDIR)
    return ""

As shown above, this function creates, if not exists and for my example: /var/data/submit/storage/pending/PFC/.

The rn variable means Report Number, this is, PFC–2009-005. So it moves all the contents in the curdir to /var/data/submit/storage/pending/PFC/PFC–2009-005/ directory.

cds invenio websubmit

*** Edit: lets see some of MBI (modify record AFTER it is approved) pipeline.

APP (approve record)

When a record has been submitted and then approved some information is stored in /var/data/submit/storage/done/PFC/$Report_Number and a new entry is created in MySQL database, more precisely in table sbm_SUBMISSIONS.

This table stores a lot of information. Lets see what is stored for some record (for instance, for the one which Report_Number=

SELECT * FROM sbmSUBMISSIONS WHERE reference = 'PFC-2009-018';

submission database

Lets see the stored information and fields.

email who performed the action? (It was me!).
doctype the doctype of the document to which the action has been made (PFC
action the action made to that doctype (SBI, MBI, APP, SRV).
status the status of the task, in this case, finished (could be pending
id the access (number) of the task: 1245755804_15279
reference (report number or rn)
cd The date of cd. This is, the date of creation of /soft/cds-invenio/var/data/submit/storage/pending/PFC/PFC-2009-018/mainmenu
md The date of md. This is, the date of creation of /soft/cds-invenio/var/data/submit/storage/pending/PFC/PFC-2009-018 directory

The fulltext document (in pdf or whatever format you use) is stored in /soft/cds-invenio/var/data/submit/storage/pending/PFC/PFC-2009-018/files/PFC_MEM/PFC-2009-018.pdf but a compressed version is also stored into /soft/cds-invenio/var/data/files/g0/405/PFC-2009-018.pdf;1

MBI (modify AFTER approval)

The functions that are run are listed below:

    Get_Report_Number
    Get_Recid
    Is_Original_Submitter
    Create_Modify_Interface
    Get_Report_Number
    Get_Recid
    Make_Modify_Record
    Insert_Modify_Record
    Print_Success_MBI
    Send_Modify_Mail
    Move_to_Done

Wow, a lot of functions! I will comment only the ones which are more important and hard to understand.

I’ll begin with Create_Modify_Interface:

This functions reads the stored metadata of a record and creates an interface to modify the fields that user selects. Where does the Create_Modify_Interface function read the stored values? Is it from database? Is it from the system directories?. Well, it depends: this function goes into curdir and looks for a file named Create_Modify_Interface_Done.

If it exists, then the record metadata is loaded from FILES (curdir system directory) using Create_Modify_Interface_getfieldval_fromfile.

If not exists, then the record metadata is loaded from DATABASE, using Create_Modify_Interface_getfieldval_fromDBrec.

**** WILL CONTINUE WRITING THIS POST IN FUTURE DAYS ****