Tuesday, October 11, 2011

What tools help you manipulate metadata?

What tools have you found helpful in manipulating your metadata?  (For example, people have suggested MARCedit, ImageMagick, and Oxygen.)  

 Are there barriers to using them effectively, such as cost, scripting skills, or technical support?

Let's pool our information about what works and identify potential hurdles to be overcome.


  1. the XC Metadata Services Toolkit looks like it has great potential for an inventorying/transforming tool (Jennifer Bowen has demoed it at ALA sessions)

  2. I like using MARCedit, Oxygen, Notepad, Kernow, and anything that works even if it is a spreadsheet. I think the barriers are cost in terms of licenses, getting technical support (especially if it's an open source product like Kernow) or just not getting the type of support you need for your particular project. Another barrier is related to skills, either sripting or programming.

  3. MARCedit and Oxygen are the big ones for me, and while it takes awhile to read documentation and get conversant with those tools, they don't require a lot of technical knowledge. (A bit more learning curve to write custom visual basic scripts to work with MARCEdit.) I recently had luck using Excel to edit a batch of 008 fields in thousands of MARC records - it has some regex-like functions and in this case it was good for seeing patterns and verifying successful strategies. Its usefulness, however, was contingent on the system librarian being able to take a comma separated file and use it to replace the tags in the ILS.

    As to cost - Oxygen seemed like a great bargain and still does (for academic users!), but when we got it, the fact that there'd be an annual upgrade cost wasn't exactly clear - so it's really more like a subscription. Compared to the "free/open source" alternatives, it was a necessity back then - and I assume, still is?

  4. I use oXygen more than anything else, but also use a few MarcEdit features (mostly transforms and z39.50) and Excel (and convert the Excel spreadsheets to XML with oXygen).

    There is really not much of a learning curve with basic MarcEdit tools or Excel. oXygen is intimidating at first, and I'm sure I don't fully utilize its functionality, but it gets better with familiarity. oXygen is cheap, MarcEdit is free, and everyone should have some Excel-type program, so it's not the programs that are expensive. For me, the expense is time and the XSLT learning curve.

    I'm not a programmer and am essentially self-taught in XSLT and still learning. A few LIS courses that included database design and programming with SQL and PHP have proved very valuable in re XLST scripting for the theory and logic they taught. Having some practical knowledge of HTML has also been helpful. And I have been very fortunate to work in libraries that have patiently given me the time and opportunity to learn.

    I find half the fun in what I do is thinking outside the box to find ways to get over or around the hurdles.

  5. This may be tangential to the intent of your question, but just in case it is useful... OCLC Research uses a variety of things in a variety of languages on our 32-node compute cluster. But we are likely to mostly move to a Hadoop, Pig and Hive based set of tools, with likely some User Defined Functions particular to the kinds of things we want to do. This is because much of our processing runs against the entirety of WorldCat, which these days is well north of 200 million records.

  6. I use oXygen to a certain extent, but certainly not as much as I used to. We use XForms to edit EAD finding aids (with EADitor) and are developing XForms applications to edit numismatic metadata (see Numishare), but the American Numismatic Society collection database is still in Filemaker and exported to XML when records are saved. We hope to migrate this to the Numishare/XForms codebase eventually. Eventually all of our XML metadata standards, no matter how complex will be edited with XForms applications. No more manual editing of anything...

  7. I'm using a mixture of stuff, including the standard fare Excel, Notepad++ (which allows me to do regex stuff), to MaRCEdit (wonderful tool), as well as some PHP + XSLT, and a touch of Perl.
    I'm looking more and more to use Google Refine, which allows you to create new columns based on results from webservices, (which is cool), and continues on from GridWorks I think. Probably overkill, but I can definitely see some uses.

  8. Google Refine is great for spreadsheet-like data -- I mostly work with DSpace metadata csv files and it's really good at splitting multi-value columns, doing consolidation of almost-identical values etc.

  9. I use MARCEdit, Oxygen and have dabbled with Google Refine. Google Refine is good at cleaning data and also reconciling URIs from vocabularies like LCSH, LCNAF and geoNames

  10. MarcEdit is a good open source tool for manipulating MARC metadata, but MARC Report/MARC Global are better quality/easier to use/better supported - at a small cost for what you get (in my opinion). Details are available from www.marcofquality.com.
    I use these extensively for
    global edits, to derive management information and for transformation of MARC ISO2709 to XML formats (principally DC and RDF).

    Another brilliant tool is EditPad Pro - a powerful text editor (http://www.editpadpro.com/).

  11. Notepad++ for straight up editing tool. Useful advanced features and elegant interface. Spent a little time with Oxygen and couldn't really determine an advantage in using it over my existing workflow.

    Macro Express Pro + Excel csv files = magic.
    Use ME Pro for batch editing across a series of metadata records. Have custom macros for: global find and replace across folder of xml; insert/edit/delete sections of xml based on content of elements; crosswalk MARCXML into selected data elements of other standards; create csv look-up tables and leverage existing csv data dictionaries to autopopulate portions of xml.

    Macro Express Pro is a bit cumbersome but is a good match for my skill set. I have a feeling many people are doing similar things with XSLT, it's just not something I've made to time to learn yet. Would love to attend a workshop on XSLT for Librarians.

    This is a valuable thread. Already learned a lot. Should make for a valuable workshop.

  12. This comment has been removed by the author.

  13. At Yale, we are using a new in-house tool that we call LadyBird. It is only partly a metadata creation and editing tool; it's other part is a router of digital objects and metadata: it makes bags (using Bagit) and ships those bags to various repositories and ships just the metadata to various indexes.

    It is in production in Yale's Manuscripts and Archives Dept., the Beinecke Library, and the Visual Resources Collection. It is not yet ready to do production work with complex digital objects such as ebooks, eserials, digitized archival collections as collections.

    We think of LadyBird as being at version 0.3. The problem we are trying to solve by developing LadyBird may or may not be common to others. We have many digitization projects in many units and several repositories/interfaces. We want LadyBird to be a common tool and connection between these many production units and many repositories/interfaces.

    Submitted by
    Matthew Beacom

  14. I think MARCEdit is a great tool, but I wish it worked better on my Mac. It tends to crash with larger data sets, and some of the plug-ins don't work. :-(

    We use some in-house-built tools based on perl libraries and MARC4J. If you have some java skills, MARC4J is a great library to work with.

    Thanks to whomever recommended MarcReport; I hadn't seen this before. I'm always surprised by how fews tools there are for us to view and edit MARC records. They've been around long enough.

  15. Thank you to Suzanne for posting my earlier comment about LadyBird. LadyBird began because a programmer and a metadata specialist talked to each other. Here's a description of the LaydBird case.

    There are multiple library units and digital projects at Yale that are creating digital assets, now primarily simple images of objects such as a slide, a photograph, a mss page, a book plate, a map, a poster, etc. Additionally, in these units and projects complex digital objects are being created such as representations of books, serials, a video archive, an audio archive as well as the collecting of born digital assets. Without centralized digital collections support, each unit with various configurations of IT support has created its own tools, processes, and policies to manage their digital assets, preserve them, and present them on the Web. The units vary widely-for example, from a Visual Resources Collection that digitizes mostly reproductions of image of visual culture from patron request, to the Beinecke Rare Book and Manuscript Library that digitizes archival objects and pages from rare books based on patron request, to grant funded projects involving the digitization of hundreds of audiotapes involving staffing across multiple units.

    -- Matthew Beacom

  16. Here is a statement about the problem that LadyBird tries to solve.

    The Problem:

    Although a digital asset management system (DAMS) was instituted at the University level and the Library is a participant, the DAMS is only one repository/interface at Yale, and it does not support the library's full-range of metadata needs. Additionally, the Library's many projects each need the tools, personnel, and processes to route their digital objects to the DAMS. Content owners in the Library do not always have control over when their digital collections are updated on the Web. Content owners who hire student workers do not have robust enough tools to support different levels of permissions to the digital objects and their metadata records. In summary, multiple sources of digitized materials, multiple tools used, multiple workflows, multiple staffing configurations, multiple metadata schemes, multiple repositories/end-user interfaces resulting in a Gordian Knot of complexity that stifles understanding, production throughput, development, usability, and, in a word, success.

    --Matthew Beacom

  17. One last bit on LadyBird.

    The Solution:

    Cut the Gordian Knot by building a single, universal tool that supports detailed user permissions and the adding, editing, and deleting of digitized assets with the creation of metadata records that are to be included in the DAMS and other systems, such as a preservation repository. The tool also facilitates the smooth transition from creating the metadata record for the digital asset to web presentation. That's LadyBird. LadyBird gives Yale a simplified workflow from multiple sources of digitized materials to multiple repositories/interfaces by offering one tool to link those sources and repositories/interfaces.

    The LadyBird facts:

    * Programmed in Microsoft .net/c#
    Version 2.0 and 4.0
    * Microsoft SQL Server 2008 R2 for database
    * 517,098 digital objects currently in the system

    Matthew Beacom

  18. At KU Medical Center, I use:
    regex in Aptana Studio (Eclipse). Regex, Nokogiri, and XSLT with Ruby. MarcEdit. Google Spreadsheets, Excel. As someone else mentioned, I would use MarcEdit more I could get it to work well on a Mac.

    I'd like to use Google Refine and XC Toolkit in the future once I find time to learn them.


  19. I am trying to collect info on embedding metadata in document headers, and specifically in batch embedding from a spreadsheet to thousands of documents at a time.

    I'm especially interested in whether anyone has had success with Image Magick tie-ins, because that looks like it has a good community around using it.

    Meanwhile, I've used Oxygen XML Editor before. It was good for setting up xml schemas, and then it did a little "steering" or guiding when I wrote xml to conform to those schemas or to conform to published online schemas, like Dublin Core, or MODS. It's conceptually hard for me to picture it used on a daily basis for metadata creation or manipulation in a library setting. The interface is closer to black letters on white, and probably someone using it would have to not have a mental barrier to tech. I liked the interface. I never tried to do any batch or automated manipulations in it, but I'm probably going to experiment and check out Oxygen XML Developer (surely this is what was recommended to you, since Editor is very basic)

  20. With our moving to Ex Libris Alma in the coming months, the combination of Primo/Alma (both being in the cloud), along with working fully in a Unix environment, will provide us with a level of XML integration we could only dream of a couple of years ago. Using Macs as our primary front-end OS, the ability to process and move data seamlessly between systems will give us a good deal of processing leverage. Within the Mac environment, we use Aperture, Automator, Applescript, Oxygen, Coda, and other high-end tools for data and digital object manipulation.

  21. Two weeks ago in the UK the 'Discovery' (http://discovery.ac.uk) held a small workshop on tools and technologies for interacting with bibliographic data, and I've just published a blog post that tries to capture some of the range of tools discussed http://blog.discovery.ac.uk/2011/10/19/emerging-bibliographic-tools-and-technologies/

    I'm also in the process of writing a more formal 'guide' that captures how each technology was used by a range of projects funded by the Discovery initiative.

  22. MarcEdit, XSL, Excel, an in-house-designed web-based SQL metadata tool, even regular expression search/replace in Word. Anything that works. I often use different tools in succession on the same file. For any particular desired result, different tools may be best suited for different steps in the process. Broader experience (lots of tools) is often more useful than expertise in a single tool.

  23. I have worked on many metadata projects at smaller institutions which did not have the budget to purchase special metadata creation/clean-up tools nor the IT support to install open source options. I have found the combination of Notepad++ and Excel (or OpenOffice Calc) to be very effective. The ability to record macros allows for librarians without major coding skills or access to the server the ability to easily manipulate metadata in bulk.

  24. Take a look at Recollection from LC/NDIIPP + Zepheira for some kinds of manipulation of relatively small sets of MODS or Excel data: http://recollection.zepheira.com/

    "A free platform for generating and customizing views (interactive maps, timelines, facets, tag clouds) that allow users to experience your digital collections."

    You can leave your munged data and views up for the world, export your munged data back for local use, and if desired delete your data from Recollection.

  25. A follow-up on the eXtensible Catalog (XC) Metadata Services Toolkit (MST), which someone already mentioned: The MST provides an open-source platform for managing and processing large batches of metadata. MST services can clean up (normalize) data in batches of records, transform metadata from one schema to another, FRBRize metadata by transforming it to the XC Schema, and aggregate multiple records that represent the same resource (we’re developing this last one as we speak). Most existing MST services work with MARC (MARCXML) data and one with Dublin Core, but services can be easily developed for other schemas as well, using the MST’s Service Writer’s Toolkit. We really hope that other libraries will begin to develop and share their own services for the MST.

    Dave Lindahl is planning to attend the session at DLF, as well as CurateCamp, so he will be able to answer any questions about this software.

    Jennifer Bowen
    University of Rochester

  26. This comment has been removed by a blog administrator.

  27. As part of open government initiatives, the National Archives has begun to share applications developed in-house on GitHub, a social coding platform. GitHub is a service used by software developers to share and collaborate on software development projects and many open source development projects.

    NARA currently has two applications on GitHub, the "File Analyzer and Metadata Harvester" and the "Video Frame Analyzer."

    View the NARAtions blog post for more information and links to the GitHub repository: http://blogs.archives.gov/online-public-access/?p=6270

  28. MARC::Record (Perl module) is excellent to use. Also popular: MARC4J (Java), RubyMARC and pyMARC.

    These OS should be available on github or cpan.org

    Barrier: programming languages ability


  29. I wasn't sure if these tools made it on to your list?



    Also, here's another example of a project (in Australia) where a team did a tool review (in 2008). It's possibly useful for context, and maybe for someone to try to update it.