Introducing MARCXML manipulation tool

If you have to import/export your MARCXML records in Invenio, tind.io offers this great online utility: https://tools.tind.io/xml/xml-manipulation/ that allows to manipulate marcxml.

Marcxml manipulation tool tind.io

Exporting marc is not only useful on migration processes, but also when you have to perform changes to a lot of records in your Invenio system. It is way better than performing those changes at database level.

You can export records using web interface or command line. I prefer this second method, using BibExport:

First change this config file: /opt/invenio/etc/bibexport/marcxml.cfg

The MARCXML exporting method export all the records matching a particular search query, zip them and move them to the requested folder. The output of this exporting method is similar to what one would get by listing the records in MARCXML from the web search interface.

Default configurations are given below. The job would have exported all records from the Book collection into one xml.-file and all articles with the author “Polyakov, A M” into another.

[export_job]
export_method = marcxml
[export_criterias]
books = 980__a:BOOK
polyakov_articles = 980__a:ARTICLE and author:"Polyakov, A M"

the job is run by this command:

/opt/invenio/bin/bibexport -u admin -wmarcxml

Default folder for storing is:

/opt/invenio/var/www/export/marcxml

Drupal 7: install module translation [SOLVED]

To install a Drupal 7 module translation, this is, a .po file (for instance, the fullcalendar spanish translation) go to Administration » Configuration » Regional and language (/admin/config/regional/translate).

Click the ‘Import’ tab and use the ‘Import translation’ form.

drupl

What if there are still some strings translations missing? No worries, just go to /admin/config/regional/translate/translate and add them (by hand).

Tip: use Localization update module 😉

Vufind: delete imported records

Vufind admin panel (http://www.yourhost.com/vufind/Admin/Home) allows to delete records by id (it calls Records.php, and more precisely the deleteRecord method).

But if you want, for instance, to delete all records from a bad import you can do it directly from your system prompt using util/delete.php script:

The first parameter is the import file name and the second, its format. If no format supplied, ‘marc’ is assumed.

cd $VUFIND_HOME;
php util/deletes.php import/400.mrc marc

In my case an error was showing up:

PHP Warning:  parse_ini_file(../web/conf/config.ini): failed to open stream: No such file or directory in /usr/local/vufind/util/deletes.php on line 48
 
Warning: parse_ini_file(../web/conf/config.ini): failed to open stream: No such file or directory in /usr/local/vufind/util/deletes.php on line 48
Solr index is offline.

Mmmhh. Relative path issues.

I opened deletes.php and edited line 48 so that parse_ini_file is done to /usr/local/vufind/web/conf/config.ini (full path). But then another error was showing up:

PHP Fatal error:  Call to a member function getData() on a non-object in /usr/local/vufind/util/deletes.php on line 85
 
Fatal error: Call to a member function getData() on a non-object in /usr/local/vufind/util/deletes.php on line 85

What’s the problem now? Well, I am NOT USING MARC’s 001 tag as identificator as stated in my import/marc_local.properties (which overrides import/marc.properties). My id is set to record’s 907a tag value… UG.

We will cope with this issue later (here!)

New delete tool script

I first decided to make a little php program which allows me to delete a list of identifiers. I called it util/BorraRegistros.php.

This php script is called WITHOUT parameters, so you’ll have to edit it in order to include the identifiers in $lista_ids_registros array.

<?php
set_include_path('/usr/local/vufind/web/:/usr/local/vufind/web/sys/:/usr/local/lib/php/');
require_once 'Solr.php';
 
$configArray = parse_ini_file('/usr/local/vufind/web/conf/config.ini', true);
 
// Setup Solr Connection
$url = $configArray['Index']['url'];
$solr = new Solr($url);
if ($configArray['System']['debug']) {
    $solr->debug = true;
}
 
// ----------------------------------------------------------------
// This is the list of SOLR IDENTIFIERS to be deleted!!!
$lista_ids_registros = array('.b1000001x');
// ----------------------------------------------------------------
 
print "Interfaz de borrado de registros\nSe borraran los registros cuyos identificadores son:\n";
print_r($lista_ids_registros);
 
// Confirm deletion...
echo "¿Seguro de que deseas continuar? Escribe 'si' para continuar: ";
$handle = fopen ("php://stdin","r");
$line = fgets($handle);
if(trim($line) != 'si'){
    echo "Cancelado\n";
    exit;
}
echo "\n";
echo "Gracias, se va a proceder...\n";
 
// Delete each record identified by its value $lista_ids_registros
foreach ($lista_ids_registros as $id_registro){
   print "\nPreparando para borrar el registro '$id_registro'.............................";
   $solr->deleteRecord($id_registro);
   print "[ OK ]";
}
 
print "\nTerminando el borrado...";
// Now commit and optimize
$solr->commit();
$solr->optimize();
print "\n";
 
?>

More references to SOLR delete by id in http://wiki.apache.org/solr/UpdateXmlMessages#A.22delete.22_by_ID_and_by_Query

This script could also be done using cURL, but I kinda prefer it this way.


Fixing util/delete.php to fit records identified by marctags distinct from 001

Above we noticed problems when deleting recently imported records with util/delete.php. The problem is that this script does not read import/marc_local.properties and therefore does not notice that our solr records might not be identified by its marc’s 001 tag.

For instance, in my import/marc_local.properties we find:

id = 907a, first

So I modified util/delete.php so that it takes into account that my solr records are identified by tag 907 (subfield ‘a’) AND not tag 001! Notice changes in lines 86-90.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
<?php
/**
 *
 * Copyright (C) Villanova University 2007.
 *
 * This program is free software; you can redistribute it and/or modify
 * it under the terms of the GNU General Public License version 2,
 * as published by the Free Software Foundation.
 *
 * This program is distributed in the hope that it will be useful,
 * but WITHOUT ANY WARRANTY; without even the implied warranty of
 * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
 * GNU General Public License for more details.
 *
 * You should have received a copy of the GNU General Public License
 * along with this program; if not, write to the Free Software
 * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
 *
 */
 
// Parse the command line parameters -- see if we are in "flat file" mode and
// find out what file we are reading in!
$filename = $argv[1];
$mode = isset($argv[2]) ? $argv[2] : 'marc';
 
// No filename specified?  Give usage guidelines:
if (empty($filename)) {
    echo "Delete records from VuFind's index.\n\n";
    echo "Usage: deletes.php [filename] [format]\n\n";
    echo "[filename] is the file containing records to delete.\n";
    echo "[format] is the format of the file -- it may be one of the following:\n";
    echo "\tflat - flat text format (deletes all IDs in newline-delimited file)\n";
    echo "\tmarc - binary MARC format (delete all record IDs from 001 fields)\n";
    echo "\tmarcxml - MARC-XML format (delete all record IDs from 001 fields)\n";
    echo '"marc" is used by default if no format is specified.' . "\n";
    die();
}
 
// File doesn't exist?
if (!file_exists($filename)) {
    die("Cannot find file: {$filename}\n");
}
 
require_once 'util.inc.php';        // set up util environment
require_once 'sys/Solr.php';
 
// Read Config file
//$configArray = parse_ini_file('../web/conf/config.ini', true);
$configArray = parse_ini_file('/usr/local/vufind/web/conf/config.ini', true);
// Setup Solr Connection
$url = $configArray['Index']['url'];
$solr = new Solr($url);
if ($configArray['System']['debug']) {
    $solr->debug = true;
}
 
// Count deleted records:
$i = 0;
 
// Flat file mode:
if ($mode == 'flat') {
    $ids = explode("\n", file_get_contents($filename));
    foreach($ids as $id) {
        $id = trim($id);
        if (!empty($id)) {
            $solr->deleteRecord($id);
            $i++;
        }
    }
// MARC file mode:
} else {
    // We need to load the MARC record differently if it's XML or binary:
    if ($mode == 'marcxml') {
        require_once 'File/MARCXML.php';
        $collection = new File_MARCXML($filename);
    } else {
        // este require hace referencia a /usr/local/lib/php/File/MARC.php
        require_once 'File/MARC.php';
        $collection = new File_MARC($filename);
    }
 
    // Once the record is loaded, the rest of the logic is always the same:
    while ($record = $collection->next()) {
        // getField is defined in /usr/local/lib/php/File/MARC/Record.php
        // Comment this line
        // $idField = $record->getField('001');
        // Add the following two lines...
        $idField = $record->getField('907');
        $idField = $idField->getSubfield('a');
        $id = (string)$idField->getData();
        $solr->deleteRecord($id);
        $i++;
    }
}
 
// Commit and Optimize if necessary:
if ($i) {
    $solr->commit();
    $solr->optimize();
}
?>

Now we can run the new script and it will work 🙂

clear; php $VUFIND_HOME/util/deletes_by_907a.php $VUFIND_HOME/import/400.mrc marc

Thanks for reading, have fun!

CDS-Invenio: import records from WebOPAC PRO (Millennium)

One of the most common needs in nowadays libraries is to move information from the catalogue into the repository. To do so I’ve developed a bash script that calls a python program which checks for new or modified records in the library catalogue and loads them into CDS Invenio repository:

First we’ll take a look at the bash script:

#!/bin/bash
# author: miguelm[at]unizar[dot]es
# date: 2009-03-29
# comments:
#     -  This script loads into your repository some records from the
#         library's catalog.
#     -  Parameters: this script does not need any parameters at all
#     -  This script will create the following files:
#           1. reg_nuevosYYYYmmdd.xml
#           2. reg_modificadosYYYYmmdd.xml
#           3. logYYYYmmddd.xml
 
clear
PATH=$PATH:$HOME/bin
ORACLE_HOME=/usr/lib/oracle/11.1/client64
LD_LIBRARY_PATH=/usr/lib/oracle/11.1/client64/lib
export ORACLE_HOME
export LD_LIBRARY_PATH
export PATH
 
ruta="/home/apache/"
 
sal=`cd "$ruta"`
hoy2=`/bin/date`
#echo "Actual date: " $hoy2
fechaini=`/bin/date --date "$hoy2 -60 days"`
#echo "Two-months-ago date: " $fechaini
fechainimyformat=`/bin/date --d "$fechaini $1 sec" "+%d/%m/%Y"`
 
fechainiformatfile=`/bin/date --d "$fechaini $1 sec" "+%Y%m%d"`
echo $fechainiformatfile
fichcreados=$ruta"importaFHdesdeRoble/reg_nuevos"$fechainiformatfile".xml"
# the name for the xml file that contains the NEW records
 
fichmodif=$ruta"importaFHdesdeRoble/reg_modif"$fechainiformatfile".xml"
# the name for the xml file that contains the MODIFIED records
 
fichlog=$ruta"importaFHdesdeRoble/log"$fechainiformatfile".log"
# the name for the xml file that logs the proccess.
 
echo "Begining the import process  (FH records) from catalogue to repository"
echo "------------------------------------------------------------------------"
echo "Begin date: " $fechainimyformat
echo "Output file for new records: "$fichcreados
echo "Output file for modified records: "$fichmodif
echo "Log file: "$fichlog
echo "------------------------------------------------------------------------"
echo ""
s2=`/bin/touch "$fichcreados"`
s2=`/bin/touch "$fichmodif"`
comm = "/usr/bin/python "$ruta" importaFHdesdeRoble/importaFH.py"
salida=
<br />`"$comm" "$fechainimyformat" "$fichcreados" "$fichmodif" &gt; $fichlog`
 
# Now that records have been created, you should check if their sintax is correct:
# we use xmlmarclint for this purpose
 
echo "Checking file integrity..."
 
file="temp.tmp"
salida1=`/soft/cds-invenio/bin/xmlmarclint "$fichcreados" &gt; "$file"`
if [ -s $file ]; then
   # the output file of xmlmarclint command is not empty =&gt; there are errors...
   echo "Sintax of new-files XML....[FAIL]"
   mens="Sintax check failed for "$fichcreados
   echo "$mens" &gt;&gt; "$fichlog"
else
   echo "Sintax of new-files XML....[OK]"
   # if XML file is NOT empty, load its data into repository...
   if [ -s $fichcreados ]; then
       echo "Begining to upload the new records into the repository......."
       salida1=`/soft/cds-invenio/bin/bibupload -i "$fichcreados" &gt; "$fichlog"`
   else
        echo "There are no new files to load (the XML file is EMPTY)"
   fi
 
fi
alias rm='rm'
rm "$file"
 
# Similar behaviour for the modified records...(comments in spanish, sorry)
file="temp.tmp"
salida2=`/soft/cds-invenio/bin/xmlmarclint "$fichmodif" &gt; "$file"`
if [ -s $file ]; then
   # el fichero de salida de error de xmlmarclint tiene contenido =&gt; hay errores
   echo "Sintaxis del fichero de MODIFICADOS (ya existen en Zaguan).....[FAIL]"
   mens2="Fallo en la sintaxis de "$fichmodif
   echo "$mens2" &gt;&gt; "$fichlog"
else
   echo "Sintaxis del fichero de MODIFICADOS (ya existen en Zaguan).....[OK]"
   if [ -s $fichmodif ]; then
        echo "Se va a proceder a modificar estos registros.............."
        salida1=`/soft/cds-invenio/bin/bibupload -r "$fichmodif" &gt; "$fichlog"`
   else
        echo "El fichero MODIF. "$fichmodif" esta vacio = no registros que modificar"
   fi
 
fi
alias rm='rm'
rm "$file"
 
alias rm='rm -i'
mail=`/bin/mail miguelm@unizar.es &lt; "$fichlog"`
echo "Finalizado"



Now, the python program importaFH.py (which queries catalogue’s ORACLE DB):

#! /usr/bin/python
# This program is aimed to query the catalogue's
# ORACLE DB to export the CREATED/MODIFIED records
# Parameters:
#    This program should be called like:
#    python importaFH.py 01/01/2009 salida_crear.xml salida_modificar.xml
#        salida_crear  = the XML file that contains the NEW
#                             records (the ones that are NOT
#                             in your repository)
#        salida_modificar = the XML file that contains the
#                             MODIFIED records (they are already
#                             in your repository, so they have to be updated!
 
import sys
import time
import os
os.environ["NLS_LANG"] = "AMERICAN_AMERICA.UTF8"
import cx_Oracle
import cgi
 
class RegistroFH(object):
 def __init__(self, controlfield001 = "", controlfield008 = "",
    datafield040a = "", datafield040d = "", datafield100a = "",
    datafield245a = "", datafield245c = "", datafield260a = "",
    datafield260b = "", datafield260c = "", datafield300a = "",
    datafield300b = "", datafield300c = "", datafield500a = [],
    datafield538a = [], datafield700a = [], datafield700e = [],
    datafield752a = "", datafield752d = "", datafield907a = ".",
    datafield907b = "", datafield907c = "", datafield945a = "",
    datafield945t = "", datafield945y = "", datafield945z = "",
    datafield980a = "", datafield998a = "", datafield998b = "",
    datafield998f = "", datafield998g = "", datafield8564u = [],
    datafield8564z = []):
  self.controlfield001 = controlfield001
  self.controlfield008 = controlfield008
  self.datafield040a = datafield040a
  self.datafield040d = datafield040d
  self.datafield100a = datafield100a
  self.datafield245a = datafield245a
  self.datafield245c = datafield245c
  self.datafield260a = datafield260a
  self.datafield260b = datafield260b
  self.datafield260c = datafield260c
  self.datafield300a = datafield300a
  self.datafield300b = datafield300b
  self.datafield300c = datafield300c
  self.datafield500a = datafield500a
  self.datafield538a = datafield538a
  self.datafield700a = datafield700a
  self.datafield700e = datafield700e
  self.datafield752a = datafield752a
  self.datafield752d = datafield752d
  self.datafield907a = datafield907a
  self.datafield907b = datafield907b
  self.datafield907c = datafield907c
  self.datafield945a = datafield945a
  self.datafield945t = datafield945t
  self.datafield945y = datafield945y
  self.datafield945z = datafield945z
  self.datafield980a = datafield980a
  self.datafield998a = datafield998a
  self.datafield998b = datafield998b
  self.datafield998f = datafield998f
  self.datafield998g = datafield998g
  self.datafield8564u = datafield8564u
  self.datafield8564z = datafield8564z
 
# Searches in your repository the record identified by rec_key
# Returns '-1' if the record is not found
#            'tag001 value' if the record exists
def existe_en_zaguan(rec_key):
  url_buscar = """http://zaguan.unizar.es/search?p="""
  url_buscar += """%s""" % (rec_key)
  url_buscar += """*&amp;f=&amp;action_search=Buscar&amp;c=Repositorio+Digital+de+la+Universidad+de+Zaragoza&amp;sf=&amp;so=d&amp;rm=&amp;rg=10&amp;sc=1&amp;of=xm"""
  #print url_buscar
  import urllib
  f = urllib.urlopen(url_buscar)
  contenido = f.read()
  cadena_a_buscar = ''
  position1 = contenido.find(cadena_a_buscar)
  if position1 == -1:
    #print "(funcion existe_en_zaguan): %s DOES NOT EXIST en zaguan " % (rec_key)
    return -1
  else: # return records' tag001 value
    position1 = position1 + len(cadena_a_buscar)
    position2 = contenido.find("")
    return contenido[position1:position2]
 
# Creates MARCXML from a record identified by rec_key and repositories' controlfield001
# If tag001 == '' is passed, it will create MARCXML __WITHOUT TAG 001__
# If tag001 != '' it will create the MARCXML __WITH TAG 001__
# Note it will only create MARCXML if the record belongs to collection FH (bcode3=a) and it is a
#    monografy(bib_lvl=m)
def marcxml_de_registro(rec_key, tag001):
  registro = make_fh_record(rec_key)
  registro.controlfield001 = tag001
  # now connect to the catalogues Oracle DB....
  dsn = cx_Oracle.makedsn('155.210.5.40', 1521, 'IIIDB')
  connection = cx_Oracle.connect('III', 'III', dsn)
  cursor = connection.cursor()
  cursor.arraysize = 50
  stmt = """SELECT v.rec_key, v.MARC_TAG,  v.INDICATOR1, v.INDICATOR2, v.REC_DATA FROM biblio2base b, locations2 l, var_fields2 v WHERE b.bcode3='a' AND b.bib_lvl='m' AND b.REC_KEY=l.REC_KEY AND b.REC_KEY=v.REC_KEY AND"""
  stmt += """b.REC_KEY = '%s'""" % (rec_key)
  stmt += """ORDER BY v.rec_key, v.marc_tag;"""
  cursor.execute(stmt)
  fila = cursor.fetchone()
  out = ""
  while (fila != None):
    interpreta_valores(registro, fila[0], fila[1], fila[2], fila[3], fila[4]) # llama a interpreta_valores(recordFH,rec_key,marc_tag,ind1,ind2,rec_data)
  return crea_marcxml_fh(registro)
 
# It makes a query against DB and fills the file with records in XML
# All the records require these conditions:
#      1. FECHA DE CREACION &gt; fecha_ini (y menor, logicamente, que la actual)
#      2. Que TENGAN URL apuntando a ZAGUAN (856 like %zaguan%)
#      3. Que sean FONDO ANTIGUO (bcode3=a)
#      4. Que sean MONOGRAFIAS (bib_lvl=m)
#
# Vuelca la salida a fich_salida_name
def consulta_creados(fecha_ini, fich_salida_name):
  dsn = cx_Oracle.makedsn('155.210.5.40', 1521, 'IIIDB')
  connection = cx_Oracle.connect('---------', '--------', dsn)
  #change the '---' with your user/pass values
  cursor = connection.cursor()
  # cursor = connection.cursor()
  cursor.arraysize = 50
  cadena_url = "zaguan"
  # c = time.strptime(fecha_ini,"%d/%m/%Y") # tipo fecha
  # ahora pasamos la fecha a entero
  #print fecha_ini
  dia = fecha_ini[:2]
  mes = fecha_ini[3:5]
  anyo = fecha_ini[6:]
  cadena_fecha = dia+mes+anyo
  #print cadena_fecha
  # print time.strftime("%d/%m/%Y",c)
  #print "Connection encoding = " + connection.encoding, connection.nencoding, connection.maxBytesPerCharacter
  stmt = """ SELECT
                v2.rec_key, v2.MARC_TAG,  v2.INDICATOR1, v2.INDICATOR2,
                v2.REC_DATA FROM biblio2base b, locations2 l, var_fields2 v1,
                var_fields2 v2 WHERE b.bcode3='a' AND b.bib_lvl='m' AND """
  stmt += """ b.created &gt;= to_date('%d', 'DDMMYYYY' ) AND""" % (int(cadena_fecha))
  stmt += """  b.REC_KEY=l.REC_KEY AND
	       l.location LIKE '100%' AND
	       b.REC_KEY=v1.REC_KEY AND
	       v1.MARC_TAG LIKE '856' AND"""
  stmt += """  v1.REC_DATA LIKE '%s' AND """ %('%'+cadena_url+'%')
  stmt += """  v1.REC_KEY=v2.REC_KEY ORDER BY v2.rec_key, v2.marc_tag"""
  cursor.execute(stmt)
  fila = cursor.fetchone()
  out = ""
  while (fila != None):
    record1 = make_fh_record(fila[0]) # se le pasa el rec_key
    interpreta_valores(record1, fila[0], fila[1], fila[2], fila[3], fila[4]) # llama a interpreta_valores(recordFH,rec_key,marc_tag,ind1,ind2,rec_data)
    fila2 = cursor.fetchone()
    if fila == None or fila == "": break
    if fila2 == None or fila2 == "" : break
    while fila[0] == fila2[0]:
      interpreta_valores(record1, fila2[0], fila2[1], fila2[2], fila2[3], fila2[4])
      fila2 = cursor.fetchone()
      if fila2 == None or fila2 == "" : break
    # en este punto o fila2[0]!=fila[0] o ya no hay mas filas que leer
    # si fila2[0] =! fila[0] hay que crear el nuevo registro
    #     (asignar fila2 a fila y comenzar de nuevo el bucle)
    out += crea_marcxml_fh(record1)
    # print "Creado registro '." + fila[0] + "'"
    #print out
    #print "Durmiendo 5segundos..."
    #time.sleep(50)
    fila = fila2
  #en este punto, fila=none
  escribe_a_fichero(fich_salida_name, out)
 
# A partir de un registroFH __YA CREADO__ y con la informacion
#  de alguna consulta (tras hacer un fetchrow)
# se van rellenando los campos del registro hasta completarlo.
def interpreta_valores(record, rec_key, marc_tag, indicator1, indicator2, rec_data):
    if marc_tag == '008':
	    record.controlfield008 = rec_data
    elif marc_tag == '040':
	    subfields = rec_data.split('|')
	    for sbf in subfields: # para cada uno de los subcampos...
		if sbf != "":
		    if sbf[0] == 'a': # si el primer caracter es a correspondia a un |a =&gt; nombre del autor
			    record.datafield040a  = sbf[1:].replace('&amp;', '&amp;')
		    elif sbf[0] == 'd': #fechas relacionadas con el nombre del autor
	    		    record.datafield040d = sbf[1:].replace('&amp;', '&amp;')
    elif marc_tag == '100':
	    # primero hacer un split por | de los campos que puede haber...
	    subfields = rec_data.split('|')
	    for sbf in subfields: # para cada uno de los subcampos...
		if sbf != "":
		    if sbf[0] == 'a': # si el primer caracter es a correspondia a un |a =&gt; nombre del autor
			    record.datafield100a = sbf[1:].replace('&amp;', '&amp;')
		    elif sbf[0] == 'c':
			    record.datafield100a = sbf[1:].replace('&amp;', '&amp;') + record.datafield100a
		    elif sbf[0] == 'd': #fechas relacionadas con el nombre del autor
	    		    record.datafield100a = record.datafield100a + '(' + sbf[1:].replace('&amp;', '&amp;') + ')'
    elif marc_tag == '110':
	    subfields = rec_data.split('|')
	    for sbf in subfields: # para cada uno de los subcampos...
		if sbf != "":
		    if sbf[0] == 'a': # si el primer caracter es a correspondia a un |a =&gt; nombre del autor
			    record.datafield110a = sbf[1:].replace('&amp;', '&amp;')
		    elif sbf[0] == 'b':
			    record.datafield110a = record.datafield110a + sbf[1:].replace('&amp;', '&amp;')
    elif marc_tag == '245':
	    subfields = rec_data.split('|')
	    for sbf in subfields: # para cada uno de los subcampos...
		if sbf != "":
		    if sbf[0] == 'a': # si el primer caracter es a correspondia a un |a =&gt; nombre del autor
			    record.datafield245a = sbf[1:].replace('&amp;', '&amp;')
		    elif sbf[0] == 'b':
			    record.datafield245a =  record.datafield245a + sbf[1:].replace('&amp;', '&amp;')
		    elif sbf[0] == 'c': #fechas relacionadas con el nombre del autor
	    		    record.datafield245c =  sbf[1:].replace('&amp;', '&amp;')
    elif marc_tag == '260':
	    subfields = rec_data.split('|')
	    for sbf in subfields: # para cada uno de los subcampos...
		if sbf != "":
		    if sbf[0] == 'a':
			    record.datafield260a = sbf[1:].replace('&amp;', '&amp;')
		    elif sbf[0] == 'b':
			    record.datafield260b = sbf[1:].replace('&amp;', '&amp;')
		    elif sbf[0] == 'c': #fechas relacionadas con el nombre del autor
	    		    record.datafield260c = sbf[1:].replace('&amp;', '&amp;')
    elif marc_tag == '300':
	    subfields = rec_data.split('|')
	    for sbf in subfields: # para cada uno de los subcampos...
		if sbf != "":
		    if sbf[0] == 'a':
			    record.datafield300a = sbf[1:].replace('&amp;', '&amp;')
		    elif sbf[0] == 'b': #repetible!
			    record.datafield300b = sbf[1:].replace('&amp;', '&amp;')
		    elif sbf[0] == 'c': #fechas relacionadas con el nombre del autor
	    		    record.datafield300c = sbf[1:].replace('&amp;', '&amp;')
    elif marc_tag == '500':  #repetible
	    subfields = rec_data.split('|')
	    for sbf in subfields: # para cada uno de los subcampos...
		if sbf != "":
		    if (sbf[0] == 'a') &amp; (esta(sbf[1:],record.datafield500a) == -1):
			    record.datafield500a.append(sbf[1:].replace('&amp;', '&amp;'))
    elif marc_tag == '700':  #repetible
	    subfields = rec_data.split('|')
	    for sbf in subfields: # para cada uno de los subcampos...
		if sbf != "":
		    if (sbf[0] == 'a') &amp; (esta(sbf[1:],record.datafield700a) == -1):
			    record.datafield700a.append(sbf[1:].replace('&amp;', '&amp;'))
		    elif (sbf[0] == 'e') &amp; (esta(sbf[1:],record.datafield700e) == -1):
			    record.datafield700e.append(sbf[1:].replace('&amp;', '&amp;'))
    elif marc_tag == '752':
	    subfields = rec_data.split('|')
	    for sbf in subfields: # para cada uno de los subcampos...
		if sbf != "":
		    if sbf[0] == 'a':
			    record.datafield752a = sbf[1:].replace('&amp;', '&amp;')
		    elif sbf[0] == 'e':
			    record.datafield752e = sbf[1:].replace('&amp;', '&amp;')
    elif marc_tag == '856':  #repetible
	    subfields = rec_data.split('|')
	    for sbf in subfields: # para cada |u debe haber un |z
		if (sbf != "") and (sbf.rfind('http://www.lizardtech.es') == -1) and (sbf.rfind('Para ver el documento necesita') == -1) and (sbf.rfind('img src=/screens/logo/djvu.png border') == -1) : # comprobar que no sea el link al plugin lizardtech...
		    if sbf[0] == 'u':
			    if (esta(sbf[1:],record.datafield8564u) == -1):
			    	record.datafield8564u.append(sbf[1:].replace('&amp;', '&amp;'))
				return
		    if sbf[0] == 'z':
			    record.datafield8564z.append(sbf[1:].replace('&amp;', '&amp;'))
 
# Returns -1 if 'cadena' is NOT in the strings 'array'
#           0 in other cases
def esta(cadena,array):
  i = 0
  encontrado = -1
  while (encontrado == -1) &amp; (i  " % sys.argv[0]
        return
    salida = "Obteniendo registros de FH con link a zaguan desde fecha: '%(fecha)s' \nSe crea fichero MARCXML con ellos '%(fich)s'" % { 'fecha': sys.argv[1], 'fich':sys.argv[2] }
    #os.environ["NLS_LANG"] = "SPANISH_SPAIN.AL32UTF8"
 
    print salida
    consulta_creados(sys.argv[1], sys.argv[2])
 
# Genera en sys.argv[2] el MARCXML correspondiente a los registros modificados entre la fecha determinada por sys.argv[1]
# y la fecha actual.
# formato de fecha: dd/mm/aaaa, con el mes en numero
def importar_los_modificados():
  if len(sys.argv) != 4:
    print "Uso: %s    " % sys.argv[0]
    return
  salida = "Obteniendo registros MODIFICADOS del FH con link a zaguan desde fecha: '%(fecha)s'" % { 'fecha': sys.argv[1] }
  print salida
  consulta_modificados(sys.argv[1], sys.argv[2], sys.argv[3])
 
# Realiza la consulta a la BD y va rellenando el fichero con los registros en XML
# Con los registros con:
#      1. FECHA DE CREACION &gt; fecha_ini (y menor, logicamente, que la actual)
#      2. Que TENGAN URL apuntando a ZAGUAN (856 like %zaguan%)
#      3. Que sean FONDO ANTIGUO (bcode3=a)
#      4. Que sean MONOGRAFIAS (bib_lvl=m)
#
# Vuelca la salida en fich_salida_nuevos (para ficheros que hay que cargar con bibupload -i)
#                     fich_salida_modificados (para ficheros que hay que cargar con bibupload -r)
def consulta_modificados(fecha_ini, fich_salida_nuevos, fich_salida_modificados):
  dsn = cx_Oracle.makedsn('155.210.5.40', 1521, 'IIIDB')
  connection = cx_Oracle.connect('III', 'III', dsn)
  cursor = connection.cursor()
  cursor.arraysize = 50
  cadena_url = "zaguan"
  # c = time.strptime(fecha_ini,"%d/%m/%Y") # tipo fecha
  # ahora pasamos la fecha a entero
  #print fecha_ini
  dia = fecha_ini[:2]
  mes = fecha_ini[3:5]
  anyo = fecha_ini[6:]
  cadena_fecha = dia+mes+anyo
  stmt = """ SELECT v2.rec_key, v2.MARC_TAG,  v2.INDICATOR1, v2.INDICATOR2, v2.REC_DATA FROM biblio2base b, locations2 l, var_fields2 v1,  var_fields2 v2 WHERE b.bcode3='a' AND b.bib_lvl='m' AND """
  stmt += """ b.updated &gt;= to_date('%d', 'DDMMYYYY' ) AND""" % (int(cadena_fecha))
  stmt += """  b.REC_KEY=l.REC_KEY AND
	       l.location LIKE '100%' AND
	       b.REC_KEY=v1.REC_KEY AND
	       v1.MARC_TAG LIKE '856' AND"""
  stmt += """  v1.REC_DATA LIKE '%s' AND """ %('%'+cadena_url+'%')
  stmt += """  v1.REC_KEY=v2.REC_KEY ORDER BY v2.rec_key, v2.marc_tag"""
  cursor.execute(stmt)
  fila = cursor.fetchone()
  out = ""
  out_mod = ""
  while (fila != None):
   record001tag = existe_en_zaguan(fila[0])
   if (record001tag == -1): # no existe en ZAGUAN =&gt; lo anyadimos al fichero de nuevas incorporaciones
       print "Tratando registro %s - no existe en zaguan -" % (fila[0])
       record1 = make_fh_record(fila[0]) # se le pasa el rec_key
       interpreta_valores(record1, fila[0], fila[1], fila[2], fila[3], fila[4])
       # llama a interpreta_valores(recordFH,rec_key,marc_tag,ind1,ind2,rec_data)
       fila2 = cursor.fetchone()
       if fila == None or fila == "": break
       if fila2 == None or fila2 == "" : break
       while fila[0] == fila2[0]:
          interpreta_valores(record1, fila2[0], fila2[1], fila2[2], fila2[3], fila2[4])
          fila2 = cursor.fetchone()
          if fila2 == None or fila2 == "" : break
       # en este punto o fila2[0]!=fila[0] o ya no hay mas filas que leer
       # si fila2[0] =! fila[0] hay que crear el nuevo registro (asignar fila2 a fila y comenzar de nuevo el bucle)
       out += crea_marcxml_fh(record1)
       global cuantosnew
       cuantosnew += 1
       fila = fila2
   else: # el fichero EXISTE en Zaguan =&gt; lo anyadimos al fichero de registros a modificar
       print "Tratando registro %s - existe en zaguan como id001=%s -" % (fila[0], record001tag)
       record2 = make_fh_record(fila[0]) # se le pasa el rec_key
       interpreta_valores(record2, fila[0], fila[1], fila[2], fila[3], fila[4])
       # llama a interpreta_valores(recordFH,rec_key,marc_tag,ind1,ind2,rec_data)
       fila2 = cursor.fetchone()
       if fila == None or fila == "": break
       if fila2 == None or fila2 == "" : break
       while fila[0] == fila2[0]:
          interpreta_valores(record2, fila2[0], fila2[1], fila2[2], fila2[3], fila2[4])
          fila2 = cursor.fetchone()
          if fila2 == None or fila2 == "" : break
       # en este punto o fila2[0]!=fila[0] o ya no hay mas filas que leer
       # si fila2[0] =! fila[0] hay que crear el nuevo registro (asignar fila2 a fila y comenzar de nuevo el bucle)
       record2.controlfield001 = record001tag
       out_mod += crea_marcxml_fh(record2)
       global cuantosmod
       cuantosmod += 1
       fila = fila2
  #en este punto, fila=none
  concatena_a_fichero(fich_salida_nuevos, out)
  escribe_a_fichero(fich_salida_modificados, out_mod)
 
# Crea registro vacio rellenando solo el campo 907a con el rec_key
def make_fh_record(identificador):
    controlfield001 = ""
    controlfield008 = ""
    datafield040a = ""
    datafield040d = ""
    datafield100a = ""
    datafield245a = ""
    datafield245c = ""
    datafield260a = ""
    datafield260b = ""
    datafield260c = ""
    datafield300a = ""
    datafield300b = ""
    datafield300c = ""
    datafield500a = []
    datafield538a = []
    datafield700a = []
    datafield700e = []
    datafield752a = ""
    datafield752d = ""
    datafield907a = "." + identificador
    datafield907b = ""
    datafield907c = ""
    datafield945a = ""
    datafield945t = ""
    datafield945y = ""
    datafield945z = ""
    datafield980a = ""
    datafield998a = ""
    datafield998b = ""
    datafield998f = ""
    datafield998g = ""
    datafield8564u = []
    datafield8564z = []
    datafield538a.append("System requirements: PC, World Wide Web Browser and DJVU reader")
    datafield538a.append("Available electronically via Internet")
    record = RegistroFH(
    		 controlfield001,
                 controlfield008,
                 datafield040a,
		 datafield040d,
                 datafield100a,
                 datafield245a,
                 datafield245c,
                 datafield260a,
                 datafield260b,
                 datafield260c,
                 datafield300a,
                 datafield300b,
                 datafield300c,
                 datafield500a,
                 datafield538a,
                 datafield700a,
                 datafield700e,
                 datafield752a,
                 datafield752d,
                 datafield907a,
                 datafield907b,
                 datafield907c,
                 datafield945a,
                 datafield945t,
                 datafield945y,
                 datafield945z,
                 datafield980a,
                 datafield998a,
                 datafield998b,
                 datafield998f,
                 datafield998g,
                 datafield8564u,
		 datafield8564z
                 );
    return record
 
# Devuelve el registro en MARCXML (relleno con los campos propios del PFC)
# de un registro (del tipo registroFH) __YA CREADO__
# algunos campos son repetibles.
def crea_marcxml_fh(registro):
	cadena = ""
	if len(registro.datafield8564u) == 0: return cadena
	# sigue siempre que el registro TENGA URL...
	cadena += '\n'
	if registro.controlfield001 != "":
		cadena += '''\t%(controlfield)s\n''' % { 'controlfield' : registro.controlfield001 }
	if registro.controlfield008 != "":
		cadena += '''\t%(controlfield)s\n''' % { 'controlfield' : registro.controlfield008 }
	if registro.datafield040a != "":
		cadena += '''\t\n\t\t%(datafield040a)s\n'''% { 'datafield040a' : registro.datafield040a }
		if registro.datafield040d != "":
			cadena += '''\t\t%(datafield040d)s\n''' % { 'datafield040d' : registro.datafield040d }
		cadena += '''\t\n'''
	if registro.datafield100a != "":
                cadena += '''\t\n\t\t%(datafield100a)s\n\t\n''' % { 'datafield100a' : registro.datafield100a }
        if registro.datafield245a != "":
                cadena += '''\t\n\t\t%(datafield245a)s\n\t\n''' % { 'datafield245a' : registro.datafield245a }
        if registro.datafield260a != "":
		cadena += '''\t\n\t\t%(datafield260a)s\n''' % { 'datafield260a' : registro.datafield260a }
		if registro.datafield260b!= "":
			cadena += '''\t\t%(datafield260b)s\n''' % { 'datafield260b' : registro.datafield260b }
		if registro.datafield260c!= "":
			cadena += '''\t\t%(datafield260c)s\n''' % { 'datafield260c' : registro.datafield260c }
		cadena += '''\t\n'''
	if registro.datafield300a != "":
		cadena += '''\t\t\n\t%(datafield300a)s\n\t\n''' % { 'datafield300a' : registro.datafield300a }
	# Generacion de palabras clave (almacenadas en el array de strings keywords)
	for contador in range(len(registro.datafield500a)):
                cadena += '''\t\n\t\t%(comentario_i)s\n\t\n'''	% { 'comentario_i': registro.datafield500a[contador]}
        for contador in range(len(registro.datafield538a)):
                cadena += '''\t\n\t\tTexto completo %(etiq538)s\n\t\n''' % { 'etiq538': registro.datafield538a[contador] }
        if (len(registro.datafield700a)&gt;0) and (len(registro.datafield700a) == len(registro.datafield700e)):
            for i in range(len(registro.datafield700a)):
        	cadena += '''\t\n\t\t%(700a_i)s\n\t\t%(700e_i)s\n\t\n'''	% { '700a_i': registro.datafield700a[i], '700e_i': registro.datafield700e[i]}
        if registro.datafield752a != "":
            cadena += '''\t\n\t\t%(datafield752a)s\n''' % { 'datafield752a' : registro.datafield752a }
            if registro.datafield752d != "":
               cadena += '''\t\t%(datafield752d)s\n''' % { 'datafield752d' : registro.datafield752d }
            cadena += '''\t\n'''
        for contador in range(len(registro.datafield8564u)):
	    try:
		cadena += '''\t\n\t\t%(texto)s\n\t\t%(url)s\n\t\n''' % { 'url': registro.datafield8564u[contador], 'texto' : registro.datafield8564z[contador]}
	    except:
		cadena += '''\t\n\t\tTexto completo\n\t\t%(url)s\n\t\n''' % { 'url': registro.datafield8564u[contador]}
	if registro.datafield907a != "":
            cadena += '''\t\n\t\t%(datafield907a)s\n''' % { 'datafield907a' : registro.datafield907a }
            if registro.datafield907b != "":
                  cadena += '''\t\t%(datafield907b)s\n''' % { 'datafield907b' : registro.datafield907b }
            if registro.datafield907c != "":
                  cadena += '''\t\t%(datafield907c)s\n''' % { 'datafield907c' : registro.datafield907c }
            cadena += '''\t\n'''
        if registro.datafield998a != "":
            cadena += '''\t\n\t\t%(datafield998a)s\n''' % { 'datafield998a' : registro.datafield998a }
            if registro.datafield998b != "":
                  cadena += '''\t\t%(datafield998b)s\n''' % { 'datafield998b' : registro.datafield998b }
            if registro.datafield998f != "":
                  cadena += '''\t\t%(datafield998f)s\n''' % { 'datafield998f' : registro.datafield998f }
            if registro.datafield998g != "":
                  cadena += '''\t\t%(datafield998g)s\n''' % { 'datafield998g' : registro.datafield998g }
            cadena += '''\t\n'''
 
        if len(registro.datafield8564u) &gt; 0:
        # al menos debe haber una URL para que se genere el XML...
		cadena += '''\t\n\t\tFH\n\t\n\n'''
	return cadena
 
# Concatena al fichero   el  pasado por parametro
# Si el fichero  existe, lo borra y crea uno nuevo. __NO CONCATENA__
def escribe_a_fichero(filename, contenidofichero):
  # Abrir fichero 'filename' para escritura
  file_handle = open (filename, 'w' )
  # Escribe contenidofichero al fichero 'filename'
  file_handle.write(contenidofichero)
  # Muestra el fichero por pantalla
  # print contenidofichero
  # cierra el fichero 'filename'
  file_handle.close
 
# Concatena al fichero   el  pasado por parametro
# Si el fichero  existe, lo borra y crea uno nuevo. __NO CONCATENA__
def concatena_a_fichero(filename, contenidofichero):
  # Abrir fichero 'filename' para escritura
  file_handle = open (filename, 'a' )
  # Escribe contenidofichero al fichero 'filename'
  file_handle.write(contenidofichero)
  # Muestra el fichero por pantalla
  # print contenidofichero
  # cierra el fichero 'filename'
  file_handle.close
 
# Informa de los resultados del proceso de consulta
def informa_resultados():
  print "Total de registros tratados: %d, de los cuales" %(cuantosnew + cuantosmod)
  print "NUEVOS: %d" %(cuantosnew)
  print "MODIFICADOS: %d" %(cuantosmod)
 
# MAIN -------------------------------------------------
global cuantosmod
cuantosmod = 0
global cuantosnew
cuantosnew = 0
importar_los_modificados()
informa_resultados()


The last thing that remains is to add a cron task which calls the bash script. My crontab looks like:

sudo -u apache crontab -l

# Added by Miguel -------------------
SHELL=/bin/bash
PATH=$PATH:$HOME/bin
ORACLE_HOME=/usr/lib/oracle/11.1/client64
LD_LIBRARY_PATH=/usr/lib/oracle/11.1/client64/lib
# ----------------------------------------------
#Mins  Horas  Días   Meses  Dia de la semana
# Para hacerlo todos los días a las 2:30
30 02 15 * * /home/apache/importaFHdesdeRoble/importaFH.sh > /home/apache/importaFHdesdeRoble/salidacron.txt


Please make sure that the home directory for apache has permissions like:

ls -l /home/apache
total 3132
-rw-r--r-- 1 apache apache 3137967 Feb 16 13:18 fh_conzaguan.xml
-rw-r--r-- 1 apache apache 32783 Feb 19 09:36 garciala.xml
drwxr-xr-x 7 apache apache 4096 Jun 15 02:30 importaFHdesdeRoble
-rw-r--r-- 1 apache apache 2909 Feb 23 14:37 tbzn.xml


And importaFHdesdeRoble folder is like:

[root@zaguan cdsadmin]# ls -l /home/apache/importaFHdesdeRoble/
total 504
-rw-r--r-- 1 apache apache 27588 Mar 30 13:55 importaFH.py
-rwxr-xr-x 1 apache apache 3516 Apr 2 07:53 importaFH.sh
drwxr-xr-x 2 root root 4096 Apr 16 11:35 log_hechos
drwxr-xr-x 2 apache apache 4096 Mar 30 13:55 logs
drwxr-xr-x 2 apache apache 4096 Mar 30 13:55 salidas_xml

You can download the full code here:
[1] Bash script (called importaFH.sh)
[2] Python program (called importaFH.py)