Invenio: print records in marcxml format using invenio API

Sometimes it is useful to get the marcxml output of some records.

This is an example showing how to print recid’s from 18007 to 18200 in marcxml format using Invenio API:

from invenio.search_engine import print_record
 
salida = ''
 
for recid in range(18007,18200):
    #print "Registro %s" %recid
    salida += print_record(recid,format='xm')
 
print salida

Introducing MARCXML manipulation tool

If you have to import/export your MARCXML records in Invenio, tind.io offers this great online utility: https://tools.tind.io/xml/xml-manipulation/ that allows to manipulate marcxml.

Marcxml manipulation tool tind.io

Exporting marc is not only useful on migration processes, but also when you have to perform changes to a lot of records in your Invenio system. It is way better than performing those changes at database level.

You can export records using web interface or command line. I prefer this second method, using BibExport:

First change this config file: /opt/invenio/etc/bibexport/marcxml.cfg

The MARCXML exporting method export all the records matching a particular search query, zip them and move them to the requested folder. The output of this exporting method is similar to what one would get by listing the records in MARCXML from the web search interface.

Default configurations are given below. The job would have exported all records from the Book collection into one xml.-file and all articles with the author “Polyakov, A M” into another.

[export_job]
export_method = marcxml
[export_criterias]
books = 980__a:BOOK
polyakov_articles = 980__a:ARTICLE and author:"Polyakov, A M"

the job is run by this command:

/opt/invenio/bin/bibexport -u admin -wmarcxml

Default folder for storing is:

/opt/invenio/var/www/export/marcxml

OJS (Open Journal System) OAI export MARCXML: Removing empty lines

In my previous post I explained how to show author email using OJS.

I just noticed that some records were showing empty tags, like:

<datafield tag="653" ind1=" " ind2=" " >
  <subfield code="a" ></subfield>
</datafield>

You can change OAIMetadataFormat_MARC21.inc.php to avoid it. More precisely, look for the formatElement function and change it to (notice changes in lines 15 and 19):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
/**
 * Format XML for single MARC21 element.
 * @param $tag string
 * @param $ind1 string
 * @param $ind2 string
 * @param $code string
 * @param $value mixed
 */
function formatElement($tag, $ind1, $ind2, $code, $value) {
      if (!is_array($value)) {
              $value = array($value);
      }
      $response = '';
      foreach ($value as $v) {
         if ($v != ""){
             $response .= "\t<datafield tag=\"$tag\" ind1=\"$ind1\" ind2=\"$ind2\">\n" .
                 "\t\t<subfield code=\"$code\">" . OAIUtils::prepOutput($v) . "</subfield>\n" .
                 "\t</datafield>\n";
         }
      }
      return $response;
}

If you have any other formatElement custom functions you should change em too!

OJS (Open Journal System) OAI export marcxml hacking: adding author email

Let’s see how to change the output for OAI marcxml plugin for Open Journal System.

OJS prerrequisites and considerations

For this tutorial I’ll assume you have uploaded a mag called ‘tropelias’ to your OJS and that you have the oaiMetadataFormats plugin installed and running.

The default OJS OAI base URL for that mag should be then:
http://zaguan.unizar.es/ojs/index.php/tropelias/oai?verb=

You can use the usual OAI-PMH verbs. For instance, lets see the default output for http://zaguan.unizar.es/ojs/index.php/tropelias/oai?verb=ListRecords&metadataPrefix=marcxml

Should be something like this:

<record xsi:schemaLocation="http://www.loc.gov/MARC21/slim http://www.loc.gov/standards/marcxml/schema/MARC21slim.xsd" >
<leader> cam 3u </leader>
<controlfield tag="008" >"110226 2011 eng "</controlfield>
<datafield tag="042" ind1=" " ind2=" " >
<subfield code="a" >dc</subfield>
</datafield>
<datafield tag="245" ind1="0" ind2="0" >
<subfield code="a" >Apolo y Dionisos en tres obras de Thomas Mann: Muerte en Venecia, La montaña mágica, Mario y el mago.</subfield>
</datafield>
<datafield tag="720" ind1=" " ind2=" " >
<subfield code="a" >Alfonso Matute, Nuria; Universidad de Zaragoza</subfield>
</datafield>
<datafield tag="653" ind1=" " ind2=" " >
<subfield code="a" ></subfield>
</datafield>
<datafield tag="520" ind1=" " ind2=" " >
<subfield code="a" ><p><em>Este art&iacute;culo pretende realizar un acercamiento a algunos temas omnipresentes en la literatura de Thomas Mann. Estos motivos se relacionan con la influencia de Nietzsche sobre el autor, y se apoyan en su concepci&oacute;n de lo apol&iacute;neo y lo dionisiaco, que en Thomas Mann se manifiesta en una lucha continua entre el caos y la contenci&oacute;n; lucha que se resuelve de forma inevitable con el triunfo del caos. A trav&eacute;s del estudio de elementos como el tiempo, los mundos irracionales, el h&eacute;roe decadente, la muerte, la enfermedad, las culturas opuestas, o la androginia y el homoerotismo, se ha intentado buscar esta conexi&oacute;n en tres obras de Thomas Mann: </em>Muerte en Venecia<em>, </em>La monta&ntilde;a m&aacute;gica <em>y </em>Mario y el mago<em>.</em></p> <p><em>Dieser Artikel versucht eine Ann&auml;herung an einige allgegenw&auml;rtige Aspekte in Thomas Mann&rsquo;s Literatur. Themen, die auf den sind Einflluss zur&uuml;ckgehen, den Nieztsche auf den Autor ausge&uuml;bt hat. Es wird vorwiegend der Gegensatz zwischen dem &ldquo;Apolinischen&rdquo; und dem &ldquo;Dionisischen&rdquo; behandelt, jenen Konzepten, die sich in Thomas Mann&rsquo;s Werken durch den ewigen Kampf zwischen M&auml;&szlig;igung und Chaos offenbaren. Dabei l&ouml;st sich der Kampf immer unvermeidlich mit dem Sieg des Chaos. Durch die Betrachtung dieser Elemente bzw. der Zeit, der unvern&uuml;nftigen Welten, dem dekadenten Held, der Opposition verschiedener Kulturen oder der Androgynie und homoerotischen Verwandtschaften, wird eine thematische Verbindung in den folgenden drei Werken von Thomas Mann gesucht: </em>Der Tod in Venedig<em>, </em>Der Zauberberg <em>und </em>Mario und der Zauberer<em>.</em></p></subfield>
</datafield>
<datafield tag="260" ind1=" " ind2=" " >
<subfield code="b" >Universidad de Zaragoza</subfield>
</datafield>
<datafield tag="720" ind1=" " ind2=" " >
<subfield code="a" ></subfield>
</datafield>
<datafield tag="260" ind1=" " ind2=" " >
<subfield code="c" >2011-02-26 00:00:00</subfield>
</datafield>
<datafield tag="655" ind1=" " ind2="7" >
<subfield code="a" ></subfield>
</datafield>
<datafield tag="856" ind1=" " ind2=" " >
<subfield code="q" >application/pdf</subfield>
</datafield>
<datafield tag="856" ind1="4" ind2="0" >
<subfield code="u" >http://zaguan.unizar.es/ojs/index.php/tropelias/article/view/1</subfield>
</datafield>
<datafield tag="786" ind1="0" ind2=" " >
<subfield code="n" >Tropelías : Revista de Teoría de la Literatura y Literatura Comparada; ##issue.no## 15-17 (2004)</subfield>
</datafield>
<datafield tag="546" ind1=" " ind2=" " >
<subfield code="a" >es</subfield>
</datafield>
<datafield tag="787" ind1="0" ind2=" " >
<subfield code="n" ></subfield>
</datafield>
<datafield tag="500" ind1=" " ind2=" " >
<subfield code="a" ></subfield>
</datafield>
<datafield tag="500" ind1=" " ind2=" " >
<subfield code="a" ></subfield>
</datafield>
<datafield tag="500" ind1=" " ind2=" " >
<subfield code="a" ></subfield>
</datafield>
<datafield tag="540" ind1=" " ind2=" " >
<subfield code="a" ></subfield>
</datafield>
</record>

Hacking the output: step 1

We want to change this lines:

<datafield tag="720" ind1=" " ind2=" " >
  <subfield code="a" >Alfonso Matute, Nuria; Universidad de Zaragoza</subfield>
</datafield>

To:

<datafield tag="100" ind1=" " ind2=" " >
  <subfield code="a" >Alfonso Matute, Nuria; Universidad de Zaragoza</subfield>
</datafield>

This is quite easy. Follow this steps:

cd $OJS_HOME;
vi ./plugins/oaiMetadataFormats/marcxml/OAIMetadataFormat_MARC21.inc.php

Change this line:

$this->formatElement('720', ' ', ' ', 'a', $creators) .

To:

$this->formatElement('100', ' ', ' ', 'a', $creators) .

Save and exit. Should be working 🙂

Hacking the output: step 2

Now that we know which file to edit and did the previous test, lets imagine we want to change from

<datafield tag="720" ind1=" " ind2=" " >
  <subfield code="a" >Alfonso Matute, Nuria; Universidad de Zaragoza</subfield>
</datafield>

To:

<datafield tag="100" ind1=" " ind2=" " >
  <subfield code="a" >Alfonso Matute, Nuria; Universidad de Zaragoza</subfield>
  <subfield code="a" >tropelias@unizar.es</subfield>
</datafield>

In order to show author’s email.

Follow these steps:

(1) Edit OAIMetadataFormat_MARC21.inc.php:

cd $OJS_HOME;
vi ./plugins/oaiMetadataFormats/marcxml/OAIMetadataFormat_MARC21.inc.php

(2) Delete that file’s contents and paste the following code:

<?php
 
/**
 * @file plugins/oaiMetadataFormats/marcxml/OAIMetadataFormat_MARC21.inc.php
 *
 * Copyright (c) 2003-2010 John Willinsky
 ****** Modified by Miguel Martín González (miguelm[at]unizar[dot]es)
 ****** to add authors email to output
 * Distributed under the GNU GPL v2. For full terms see the file docs/COPYING.
 *
 * @class OAIMetadataFormat_MARC21
 * @ingroup oai_format
 * @see OAI
 *
 * @brief OAI metadata format class -- MARC21 (MARCXML).
 */
 
// $Id$
 
class OAIMetadataFormat_MARC21 extends OAIMetadataFormat {
        /**
         * @see OAIMetadataFormat#toXml
         */
        function toXml(&$record, $format = null) {
 
                // Changed! Comment to avoid displaying errors in the web  ----------------------------------------
                ini_set('display_errors',true);
                // ---------------------------------------------------------------------------------------------------------
 
                $article =& $record->getData('article');
                $issue =& $record->getData('issue');
                $journal =& $record->getData('journal');
                $section =& $record->getData('section');
                $galleys =& $record->getData('galleys');
 
                // Format creators
                $creators = array();
                // Changed! Lets make an array to store the emails ---------------------------
                $emails = array();
                // --------------------------------------------------------------
                $authors = $article->getAuthors();
                for ($i = 0, $num = count($authors); $i < $num; $i++) {
                        $authorName = $authors[$i]->getFullName(true);
                        $affiliation = $authors[$i]->getLocalizedAffiliation();
                        // Changed! Lets fetch the author email and store it to our emails array ------------------------
                        $emails[] = $authors[$i]->getEmail();
                        // ------------------------------------------------------------------------------------------------------
                        if (!empty($affiliation)) {
                                $authorName .= '; ' . $affiliation;
                        }
                        $creators[] = $authorName;
                }
 
                $subjects = array_merge_recursive(
                        $this->stripAssocArray((array) $article->getDiscipline(null)),
                        $this->stripAssocArray((array) $article->getSubject(null)),
                        $this->stripAssocArray((array) $article->getSubjectClass(null))
                );
                $subject = isset($subjects[$journal->getPrimaryLocale()])?$subjects[$journal->getPrimaryLocale()]:'';
                $publisher = $journal->getLocalizedTitle(); // Default
                $publisherInstitution = $journal->getSetting('publisherInstitution');
                if (!empty($publisherInstitution)) {
                        $publisher = $publisherInstitution;
                }
 
                // Format
                $format = array();
                foreach ($galleys as $galley) {
                        $format[] = $galley->getFileType();
                }
 
                // Sources contains journal title, issue ID, and pages
                $source = $journal->getLocalizedTitle() . '; ' . $issue->getIssueIdentification();
                $pages = $article->getPages();
 
                // Relation
                $relation = array();
                foreach ($article->getSuppFiles() as $suppFile) {
                        $record->relation[] = Request::url($journal->getPath(), 'article', 'download', array($article->getId(), $suppFile->getFileId()));
                }
 
                // Coverage
                $coverage = array(
                        $article->getLocalizedCoverageGeo(),
                        $article->getLocalizedCoverageChron(),
                        $article->getLocalizedCoverageSample()
                );
               $response = "<record\n" .
                        "\txmlns=\"http://www.loc.gov/MARC21/slim\"\n" .
                        "\txmlns:xsi=\"http://www.w3.org/2001/XMLSchema-instance\"\n" .
                        "\txsi:schemaLocation=\"http://www.loc.gov/MARC21/slim\n" .
                        "\thttp://www.loc.gov/standards/marcxml/schema/MARC21slim.xsd\">\n" .
                        "\t<leader>     cam         3u     </leader>\n" .
                        "\t<controlfield tag=\"008\">\"" . date('ymd Y', strtotime($issue->getDatePublished())) . "                        eng  \"</controlfield>\n" .
                        $this->formatElement('042', ' ', ' ', 'a', 'dc') .
                        $this->formatElement('245', '0', '0', 'a', $article->getTitle($journal->getPrimaryLocale())) .
                        // Changed! Lets call a new function to format this complex output ----------------------------------------------------------
                        $this->formatElementMiguel('100', ' ', ' ', 'a', 'b', $creators, $emails) .
                        //  -----------------------------------------------------------------------------------------------------------------------------
                        $this->formatElement('653', ' ', ' ', 'a', $subject) .
                        $this->formatElement('520', ' ', ' ', 'a', $article->getLocalizedAbstract()) .
                        $this->formatElement('260', ' ', ' ', 'b', $publisher) .
                        $this->formatElement('720', ' ', ' ', 'a', strip_tags($article->getLocalizedSponsor())) .
                        $this->formatElement('260', ' ', ' ', 'c', $issue->getDatePublished()) .
                        $this->formatElement('655', ' ', '7', 'a', $section->getLocalizedIdentifyType()) .
                        $this->formatElement('856', ' ', ' ', 'q', $format) .
                        $this->formatElement('856', '4', '0', 'u', Request::url($journal->getPath(), 'article', 'view', array($article->getBestArticleId()))) .
                        $this->formatElement('786', '0', ' ', 'n', $source) .
 
                        $this->formatElement('546', ' ', ' ', 'a', $article->getLanguage()) .
                        $this->formatElement('787', '0', ' ', 'n', $record->relation) .
                        $this->formatElement('500', ' ', ' ', 'a', $coverage) .
                        $this->formatElement('540', ' ', ' ', 'a', strip_tags($journal->getLocalizedSetting('copyrightNotice'))) .
                        "</record>\n";
 
                return $response;
        }
        /**
         * Format XML for single MARC21 element.
         * @param $tag string
         * @param $ind1 string
         * @param $ind2 string
         * @param $code string
         * @param $value mixed
         */
        function formatElement($tag, $ind1, $ind2, $code, $value) {
                if (!is_array($value)) {
                        $value = array($value);
                }
                $response = '';
                foreach ($value as $v) {
                        $response .= "\t<datafield tag=\"$tag\" ind1=\"$ind1\" ind2=\"$ind2\">\n" .
                                "\t\t<subfield code=\"$code\">" . OAIUtils::prepOutput($v) . "</subfield>\n" .
                                "\t</datafield>\n";
                }
                return $response;
        }
 
         // Changed! This function is new! 
        /**
         * Format XML for complex MARC21 element (by Miguel Martin)
         * @param $tag string
         * @param $ind1 string
         * @param $ind2 string
         * @param $code1 string
         * @param $code2 string
         * @param $value mixed
         * @param $value2 mixed
         */
        function formatElementMiguel($tag, $ind1, $ind2, $code, $code2, $value, $value2) {
                if (!is_array($value)) {
                        $value = array($value);
                }
                if (!is_array($value2)){
                        $value2 = array($value2);
                }
 
                // Check that both arrays have the same length to avoid exceptions...
                if ( (count($value)) != (count($value2)) ){
                    return formatElement($tag, $ind1, $ind2, $code, $value);
                }
 
                // both arrays have the same number of elements, so we can safely proceed
                $response = '';
                $i = 0;
                foreach ($value as $v) {
                        $response .= "\t<datafield tag=\"$tag\" ind1=\"$ind1\" ind2=\"$ind2\">\n" .
                                "\t\t<subfield code=\"$code\">" . OAIUtils::prepOutput($v) . "</subfield>\n" .
                                "\t\t<subfield code=\"$code\">" . OAIUtils::prepOutput($value2[$i]) . "</subfield>\n" .
                                "\t</datafield>\n";
                        $i++;
                }
                return $response;
        }
 
}
 
?>

Save and exit. Should be working like a charm 🙂

Changing oai_dc output (OAI_PMH DublinCore)

Every OAI-PMH compliant repository has the ability to output records in several formats. To know which formats your repository uses, you can use ?verb=ListMetadataFormats (refer to OAI-PMH verbs for a deeper explanation). For instance, for my repository:

http://zaguan.unizar.es/oai2d?verb=ListMetadataFormats

<?xml version="1.0" encoding="UTF-8"?>
<oai-PMH xmlns="http://www.openarchives.org/OAI/2.0/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/ http://www.openarchives.org/OAI/2.0/OAI-PMH.xsd">
 <responseDate>2009-05-27T05:47:01Z</responseDate>
 <request verb="ListMetadataFormats">http://zaguan.unizar.es/oai2d/</request>
 <listMetadataFormats>
   <metadataFormat>
    <metadataPrefix>oai_dc</metadataPrefix>
    <schema>http://www.openarchives.org/OAI/2.0/oai_dc.xsd</schema>
    <metadataNamespace>http://www.openarchives.org/OAI/2.0/oai_dc/</metadataNamespace>
   </metadataFormat>
   <metadataFormat>
    <metadataPrefix>marcxml</metadataPrefix>
    <schema>http://www.loc.gov/standards/marcxml/schema/MARC21slim.xsd</schema>
    <metadataNamespace>http://www.loc.gov/MARC21/slim</metadataNamespace>
   </metadataFormat>
 </listMetadataFormats>
</oai-PMH>

But this output is *NOT* CDS Invenio’s default one. In order to change this output you should edit [PATH_TO_CDS_INVENIO]/cds-invenio/lib/python/invenio/oai_repository.py. I just changed a few lines in order to update the OAI_DC format from 1.1 to 2.0:

if flag:
        out = out + "   <metadataFormat>\n"
        out = out + "    <metadataPrefix>oai_dc</metadataPrefix>\n"
        <strong># out = out + "    <schema>http://www.openarchives.org/OAI/1.1/dc.xsd</schema>\n"
        # modified by miguel
        out = out + "    <schema>http://www.openarchives.org/OAI/2.0/oai_dc.xsd</schema>\n"
        # out = out + "    <metadataNamespace>http://purl.org/dc/elements/1.1/</metadataNamespace>\n"
        # modified by miguel
        out = out + "    <metadataNamespace>http://www.openarchives.org/OAI/2.0/oai_dc/</metadataNamespace>\n"</strong>
        out = out + "   </metadataFormat>\n"
        out = out + "   <metadataFormat>\n"
        out = out + "    <metadataPrefix>marcxml</metadataPrefix>\n"
        out = out + "    <schema>http://www.loc.gov/standards/marcxml/schema/MARC21slim.xsd</schema>\n"
        out = out + "    <metadataNamespace>http://www.loc.gov/MARC21/slim</metadataNamespace>\n"
        out = out + "   </metadataFormat>\n"

Now the repository says it is OAI/2.0/oai_dc compliant. So, let’s change a few things in oai_dc default’s output. To do so, you must edit [PATH_TO_CDS_INVENIO]/cds-invenio/etc/bibformat/format_templates/OAI_DC.xml. You have to be very careful when editing this file. If there is a typo or any other mistake in sintax the oai_dc output will STOP working as it did. My recommendation is to make a backup of the file before changing it. You can use a DOM VALIDATOR to prevent those sintax errors.

In my case I added the following functionalities:

  • Added the dc:format tag: In our repository all files are pdf, except for the ones in FH collection. So the code to generate this would be:
  •                 <xsl:for-each select="datafield[@tag=980]">
                            <dc:format>
                                <xsl:choose>
                                   <xsl:when test="subfield[@code='a']='FH'">
                                       <xsl:text>image/x.djvu</xsl:text>
                                   </xsl:when>
                                   <xsl:otherwise>
                                       <xsl:text>application/pdf</xsl:text>
                                   </xsl:otherwise>
                                </xsl:choose>
                            </dc:format>
                    </xsl:for-each>

  • Added the dc:language tag: We store the fulltexts’ language in two different ways. For some records we store it in marc tag 008 (more precisely three digits begining in position 36) and for the rest of our files it is stored in 041a marc tag. So the code is (if both 041 and 008 tags are set, take the value from 041):
  • <!-- language: added by miguel -->
    <xsl:variable name="controlField008" select="controlfield[@tag=008]"/>
    <xsl:variable name="dataField041a" select="datafield[@tag=041]/subfield[@code='a']"/>
                    <dc:language>
                           <xsl:choose>
                               <xsl:when test="$dataField041a">
                                    <xsl:value-of select="$dataField041a"/>
                               </xsl:when>
                               <xsl:otherwise>
                                    <xsl:value-of select="substring($controlField008,36,3)"/>
                                    <!-- <xsl:text>eng</xsl:text> -->
                               </xsl:otherwise>
                           </xsl:choose>
                    </dc:language>
    <!-- fin de language -->

  • Added the dc:type tag:
  • <!-- type: kind of document. Added by miguel -->
    <!-- dc:type uses controlled vocabulary: http://dublincore.org/documents/dcmi-type-vocabulary/ -->
    <!-- in our repository everything is "text" (the recommendation for scanned images that content texts is
         to mark them as "text" and NOT "image" -->
                    <xsl:for-each select="datafield[@tag=980]">
                          <dc:type>
                               <xsl:text>text</xsl:text>
                          </dc:type>
                    </xsl:for-each>
    <!-- fin de type -->


    ¿Want a deeper understanding of dc:type? you should take a look at this manual.

After setting these changes you MUST run

sudo -u apache inveniocfg --update-all; /sbin/service httpd restart


in order to changes make effect
.

You can download the full OAI_DC.xsl here

Just one more thing: you can download several XSL’s to convert from marcxml to other formats in http://www.loc.gov/standards/marcxml/

References:
Specially inspirational for developing this issue (for XSL naives) is the MARC21slim2OAIDC.xsl

You should also check http://purl.org/dc/elements/1.1/