Invenio: print records in marcxml format using invenio API

Sometimes it is useful to get the marcxml output of some records.

This is an example showing how to print recid’s from 18007 to 18200 in marcxml format using Invenio API:

from invenio.search_engine import print_record
 
salida = ''
 
for recid in range(18007,18200):
    #print "Registro %s" %recid
    salida += print_record(recid,format='xm')
 
print salida

Introducing MARCXML manipulation tool

If you have to import/export your MARCXML records in Invenio, tind.io offers this great online utility: https://tools.tind.io/xml/xml-manipulation/ that allows to manipulate marcxml.

Marcxml manipulation tool tind.io

Exporting marc is not only useful on migration processes, but also when you have to perform changes to a lot of records in your Invenio system. It is way better than performing those changes at database level.

You can export records using web interface or command line. I prefer this second method, using BibExport:

First change this config file: /opt/invenio/etc/bibexport/marcxml.cfg

The MARCXML exporting method export all the records matching a particular search query, zip them and move them to the requested folder. The output of this exporting method is similar to what one would get by listing the records in MARCXML from the web search interface.

Default configurations are given below. The job would have exported all records from the Book collection into one xml.-file and all articles with the author “Polyakov, A M” into another.

[export_job]
export_method = marcxml
[export_criterias]
books = 980__a:BOOK
polyakov_articles = 980__a:ARTICLE and author:"Polyakov, A M"

the job is run by this command:

/opt/invenio/bin/bibexport -u admin -wmarcxml

Default folder for storing is:

/opt/invenio/var/www/export/marcxml

CDS Invenio: batch delete records or interval of records (from python interpreter)

Sometime ago I came up with this little hack to add invenio the functionality to delete a record from command line.

If you need to delete a lot of records (i.e. in your testing/development server), you can add this other hack to bibeditcli.py:

Delete several records from invenio: the dirty way

This works, but is not necesarily the way to go. There is another way to achieve same result (records deleted) but does not over load Bibsched with a task for each record. We’ll go over that one later, though.

First thing first: lets go the dirrrrrty way:

def cli_delete_interval(recid_inicio, recid_fin):
    """
    Delete records from recid_inicio to recid_fin, both included
    You'd better make sure...
    """
    try:
        recid_inicio = int(recid_inicio)
    except ValueError:
        print "ERROR: First Record ID must be integer, not %s:" %recid_inicio
        sys.exit(1)
    try:
        recid_fin = int(recid_fin)
    except ValueError:
        print "ERROR: End record ID must be integer, not %s." %recid_fin
        sys.exit(1)
 
    if recid_inicio > recid_fin:
        print "ERROR: First record ID must be less than last record ID."
        sys.exit(1)
 
    for recid in range(recid_inicio, recid_fin):
        (record, junk) = get_record(CFG_SITE_LANG, recid, 0, "false")
        add_field(recid, 0, record, "980", "", "", "c", "DELETED")
        save_temp_record(record, 0, "%s.tmp" % get_file_path(recid))
        save_xml_record(recid)

This is how you call this new function from python.
First, navigate to $PATH_TO_INVENIO/lib/python and run your python interpreter

[miguel@mydevinvenioinstance ~]# cd /soft/cds-invenio/lib/
[miguel@mydevinvenioinstance lib]# python
[GCC 4.1.2 20080704 (Red Hat 4.1.2-50)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>>

And then, just…

>>> import invenio
>>> from invenio.bibeditcli import cli_delete_interval
>>> # the following line will delete records from ID=5125 to ID=7899 .... 
>>> # BE CAREFUL! GREAT POWER COMES WITH GREAT RESPONSIBILITY
>>> 
>>> cli_delete_interval(5125,7899)

Delete several records from Invenio: the not-so-dirty way

If you take a look at the new cli_delete_interval we just came up with, or run it over a big interval, a whole lot of new tmp files will be generated and a lot of tasks will be sent to bibsched (one for every record.). Not efficient. Not nice.

This code is better. Just one tmp file (which will be deleted upon termination) and one single task sent to bibsched.
Please notice the # EDIT HERE! part at line 13

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
def cli_delete_interval(recid_inicio, recid_fin):
    """
    By: Miguel Martin 20120130 
    Goal:
      Delete records from recid_inicio to recid_fin, both included
      Creates just a tmp file and a task (just one) is sent to bibsched
    """
 
    from invenio.bibrecord import record_xml_output
    from invenio.bibtask import task_low_level_submission
 
    # EDIT HERE! FILEPATH MUST BE READABLE/WRITABLE! ######
    tmpfile = "/home/miguelm/tmp/borrado.xml" 
    # #####################################################
 
    try:
        recid_inicio = int(recid_inicio)
    except ValueError:
        print "ERROR: First Record ID must be integer, not %s:" %recid_inicio
        sys.exit(1)
    try:
        recid_fin = int(recid_fin)
    except ValueError:
        print "ERROR: End record ID must be integer, not %s." %recid_fin
        sys.exit(1)
 
    if recid_inicio > recid_fin:
        print "ERROR: First record ID must be less than last record ID."
        sys.exit(1)
 
    fd = open(tmpfile, "w")
    for recid in range(recid_inicio, recid_fin):
        (record, junk) = get_record(CFG_SITE_LANG, recid, 0, "false")
        add_field(recid, 0, record, "980", "", "", "c", "DELETED")
        fd.write(record_xml_output(record))
 
    fd.close()
    task_low_level_submission('bibupload', 'bibedit', '-P', '5', '-r', '%s' % tmpfile)
    #os.system("rm %s" % tmpfile)

Cheers!

— update 20130628 —

Delete all the records

If you want to wipe out all the existing bibliographic content of your site, for example to start uploading the documents from scratch again, you can launch:

       $ /opt/invenio/bin/dbexec < /opt/invenio/src/invenio-0.90/modules/miscutil/sql/tabbibclean.sql
       $ rm -rf /opt/invenio/var/data/files/*
       $ /opt/invenio/bin/webcoll
       $ /opt/invenio/bin/bibindex --reindex

Invenio 1 API: list all restricted records and (restricted/public) collections

Let’s take a quick look at some of the functions provided by Invenio1 search_engine API:

Get all the collection names…

>>> import invenio.search_engine
>>> collection_reclist_cache = invenio.search_engine.CollectionRecListDataCacher()
>>> print collection_reclist_cache.cache.keys()

Get all the restricted collections (the ones which have access restrictions configured via webaccess-viewrestrcoll authorizations

>>> import invenio.search_engine
>>> restricted_collection_cache = invenio.search_engine.RestrictedCollectionDataCacher()
>>> print restricted_collection_cache.cache

Get all the records which belong to restricted collections

>>> import invenio.search_engine
>>> restricted_collection_cache = invenio.search_engine.RestrictedCollectionDataCacher()
>>> for collection in restricted_collection_cache.cache:
...      print "Coleccion: '%s'" %collection
...      print "Registros: '%s'" %str(repr(invenio.search_engine.get_collection_reclist(collection)))

CDS-Invenio: List DELETED RECORDS from collection name – python script

Some time ago I posted some cds invenio mysql useful queries. Some people wrote some days ago asking wether I could put this into a function. Here you are 😉

List all deleted records which belong(ed) to collection ‘colname’ (tested with Invenio 0.99.x)

def listDeletedFromCollname(colname, coltag='980__c'):
  from invenio.dbquery import run_sql
  query = """ SELECT distinct bibrec_bib98x.id_bibrec 
              FROM bib98x, bibrec_bib98x
              WHERE bib98x.tag='%s' AND bib98x.value = '%s' AND bib98x.id = bibrec_bib98x.id_bibxxx       
                    AND bibrec_bib98x.id_bibrec IN (SELECT DISTINCT bibrec_bib98x.id_bibrec 
                                      FROM bibrec_bib98x, bib98x 
                                      WHERE bibrec_bib98x.id_bibxxx=bib98x.id AND
                                            bib98x.tag = '980__c' AND
                                            bib98x.value = 'DELETED') """ %(coltag,colname)
  res = run_sql(query)
  deletedlist = []
  for i in res:
    deletedlist.append(i[0])
  return deletedlist

And it is used like:

# list deleted records from a collecion defined by the query '980__c:FH'
a = listDeletedFromCollname('FH')
print a
 
# if I have my collection defined in another way, for instance using this query: '980__a:TAZ'
a = listDeletedFromCollname('TAZ','980__a')
print a

In a similar fashion we can list the ones that have NOT been deleted, as in:

def listNotDeletedFromCollname(colname, coltag='980__c'):
  from invenio.dbquery import run_sql
  query = """ SELECT distinct bibrec_bib98x.id_bibrec 
              FROM bib98x, bibrec_bib98x
              WHERE bib98x.tag='%s' AND bib98x.value = '%s' AND bib98x.id = bibrec_bib98x.id_bibxxx       
                    AND bibrec_bib98x.id_bibrec NOT IN (SELECT DISTINCT bibrec_bib98x.id_bibrec 
                                      FROM bibrec_bib98x, bib98x 
                                      WHERE bibrec_bib98x.id_bibxxx=bib98x.id AND
                                            bib98x.tag = '980__c' AND
                                            bib98x.value = 'DELETED') """ %(coltag,colname)
  res = run_sql(query)
  notdeletedlist = []
  for i in res:
    notdeletedlist.append(i[0])
  return deletedlist

Hope it’s useful 🙂