Chemstation Concepts Explained: File Structure (Note: This only applies to revs of chemstation after B.02.01(?) all the way up to current revs of Openlab Chemstation.) I thought I would make a series of blog posts centered around different Chemstation topics.
- Hi Malcolm, To read ChemStation.d format you first have to convert them to MassHunter.d format with Agilent's own conversion tool. Your Agilent contact should be able to tell you where to get it. We don't have an API to read ChemStation.d.
- Read in CSV files from openchrom (assuming we have a batch). The_csv_files_from_openchrom files(pattern = '.csv$'). The_chrom_data convert all cols to numeric. AsNumeric d) modifyList(d,.
Convert Chemstation D Files Online
After raw data have been captured by mass spectrometers in biological LC-MS/MS experiments, they must be converted from vendor-specific binary files to open-format files for manipulation by most software. This protocol details the use of ProteoWizard software for this conversion, taking format features, coding options, and vendor particularities into account. This protocol will aid researchers in preparing their data for analysis by database search engines and other bioinformatics tools.
Tandem mass spectrometry data sets are captured to binary files or databases by the software controlling the instruments. The ProteoWizard library and tools are designed to extract data from these proprietary formats for export in community standard formats or for direct access via its API (). It is distinctive for its support of instruments from many instrument vendors ().
The three protocols included here are intended to complement each other. Basic Protocol 1 is intended for first-time users of ProteoWizard who feel most comfortable with graphical user interfaces, while Alternate Protocol 1 will assist researchers who are comfortable in a command-line environment. Both protocols address the process of conversion from instrument vendor-specific raw data formats (Table 1) to mzML (), mzXML (), mz5 (), or MGF (http://www.matrixscience.com/help/data_file_help.html) format. Basic Protocol 2 is intended to assist researchers who need to convey their data to search engines requiring simpler text formats.
|AB SCIEX||WIFF; T2D (with DataExplorer)|
|Agilent||MassHunter (.d directories)|
|Bruker||FID; .d directories; XMASS XML|
|ISB Seattle Proteome Center||mzXML|
|Steen & Steen Laboratory||mz5|
BASIC PROTOCOL 1: TRANSCODING MS DATA FROM RAW FORMAT VIA MSCONVERT GUI
LC-MS/MS instruments collect data into binary files that are not easily read or manipulated. Most instrument vendors, however, have provided software libraries that allow access to data within these files on Microsoft Windows operating systems. ProteoWizard includes the MSConvert GUI tool to enable easy translation of these raw data to a variety of formats. Most users will opt to perform some level of data filtering at this step, such as reporting peak lists rather than peak profiles in the resulting files.
Researchers will need to install ProteoWizard on their computers prior to attempting this protocol, which was developed using version 3.0.5471. The software is available from http://proteowizard.sourceforge.net/. Conversion from native vendor formats requires the use of Microsoft Windows. The GUI version of msConvert demonstrated here can only be run under Microsoft Windows. While many of the software libraries for data access employ 32-bit code, one may deploy ProteoWizard on 64-bit versions of Microsoft operating systems. If converting from one open format to another (e.g. mzXML to mzML) one may use the command line version described below on Microsoft Windows, Linux or Macintosh platforms.
From the ProteoWizard folder, execute MSConvertGUI. The graphical user interface should appear (see Figure 1).
The main MSConvert graphical user interface
In the top-left corner click the Browse button to bring up a source selection dialog. This may take a few seconds to load depending on the size of the initial directory.
Using the left panel of the dialog (see Figure 2), navigate to the directory containing the source (raw files or directories) you wish to convert. Currently msConvert supports the formats listed in Table 1. The software automatically scans the selected directory for any files or folders which can be read as sources. It is possible to show only sources of a certain type by filtering from the drop down menu at the bottom of the Open Data Source dialog. Select multiple files as you would in Windows Explorer by holding down the Ctrl key while clicking, holding down shift while clicking the ends of a range of files, or simply dragging a box over the files you want to select.
The data source selection dialog
Once all sources have been selected click “Open.” If only one source was selected it will appear in the text box beside the browse button and must be added to the list by clicking “Add” right underneath. If multiple sources were selected they will be added to the list automatically.
Once at least one source has been added to the list the “Output Directory” box will automatically be filled in if it had been left blank. If output files are to be separate from the original source location, click the browse button and navigate to the desired output folder.
Select the output format for the converted files. By default files will be written using the format’s standard extension, however the utilized extension can be changed using the text box next to the format selection box. This will not change the contents of the output file but will simply change the extension used. This feature can be useful for keeping track of filters used in multiple conversions of the same set of sources within the same output folder.
MSConvertGUI is capable of using a number of different compression methods when converting files. The default options should be sufficient for normal use, however if a specific type of compression or writing method is required it can be enabled using the checkboxes located below the format selection box.
The GUI for MSConvert allows users to apply a subset of available filters while converting files (see Table 2). These can be accessed by selecting the desired filter from the drop down box in the filter area. Once selected a number of options will appear. Fill in the options as desired and click the button below the filter area labeled “Add” to add it to the list. Filters can be removed from the list simply by selecting the unwanted filter and clicking “Remove.” For example, to employ peak picking of all MS levels, one can select “Peak Picking” from the drop down box and then click “Add.”
index <index_value_set> Selects spectra by index - an index value 0-based numerical order in which the spectrum appears in the input. *msLevel <mslevels> Selects only spectra with the indicated MS levels. *chargeState <charge_states> Keeps spectra that match the listed charge state. Use 0 to include spectra with no charge state at all. precursorRecalculation Recalculates the precursor m/z and charge for MS2 spectra based on the MS1 data. Only works on orbitrap and FT data. precursorRefine Recalculates the precursor m/z and charge for MS2 spectra based on the MS1 data. Works on Orbitrap, FT, and TOF data. *peakPicking <prefer_vendor>, <ms_levels> Performs centroiding on spectra with the selected MS levels. If used with other filters this must be set first. *scanNumber <scan_numbers> Selects spectra by scan number. scanEvent <scan_event_set> Selects spectra by scan event. *scanTime <scan_time_range> Selects spectra within a given time range. sortByScanTime Reorders spectra, sorting them by ascending scan start time. stripIT Rejects ion trap data spectra with MS level 1. metadataFixer Add or replace a spectra’s TIC/BPI metadata, usually after peak picking where the change from profile to centroided data may make the TIC and BPI values inconsistent with the revised scan data. titleMaker <format_string> Adds or replaces spectrum titles according to format string given. *threshold <type>, <threshold>, <orientation>, [<mslevels>] Keeps data whose values meet various threshold criteria. Details for the different criteria can be found in the GUI tooltips. *mzWindow <mzrange> Keeps mz/intensity pairs whose m/z values fall within the specified range. mzPrecursors <precursor_mz_list> Retains spectra with specific precursor m/z values. defaultArrayLength <peak_count_range> Keeps only spectra with peak counts within a given range. *zeroSamples <mode>, [<MS_levels>] Deals with zero values in spectra - either removing them, or adding them where they are missing. mzPresent <tolerance>, <type>, <threshold>, <orientation>, <mz_list>, [<include_or_exclude>] Keeps data whose values meet various threshold criteria. Contains a wider array of options than the “threshold” filter. MS2Denoise [<peaks_in_window>, [<window_width_Da>, [multicharge_fragment_relaxation]]] Noise peak removal for spectra with precursor ions. MS2Deisotope [<hi_res>, [<mz_tolerance>]] Deisotopes MS level 2 spectra using Markey method. *ETDFilter [<removePrecursor>, [<removeChargeReduced>, [<removeNeutralLoss>, [<blanketRemoval>, [<matchingTolerance> ]]]]] Filters ETD MSn spectrum data points, removing unreacted precursors, charge-reduced precursors, and neutral losses. chargeStatePredictor [<overrideExistingCharge>, [<maxMultipleCharge>, [<minMultipleCharge>, [<singleChargeFractionTIC>, [<algorithmMakeMS2>]]]]] Predicts MSn spectrum precursors to be singly or multiply charged depending on the ratio of intensity above and below the precursor m/z. *activation <precursor_activation_type> Keeps only spectra whose precursors have the specified activation type. analyzer <analyzer> Keeps only spectra with the indicated mass analyzer type. polarity <polarity> Keeps only spectra with scan of the selected polarity.
The following data filters enables users to control which data appear in translated data files and how they are recorded there. Filters marked with a star are available in the GUI. All other filters are available through the command line version of MSConvert (described in Alternate Protocol 1 below). To view a full list of options and filters for the command line version of MSConvert load the program with the argument “--help”.
Once all options have been filled click “Start.” A new window will appear with a list of files to be converted and current progress. Details on the conversion of the currently selected file will be shown in the text box at the bottom of the new menu. Once all files have finished converting it is safe to close the progress window and use the resulting files.
ALTERNATE PROTOCOL 1: TRANSCODING MS DATA FROM RAW FORMAT VIA MSCONVERT
Command line tools may be more convenient for researchers who want to streamline raw file management by creating batch scripts. This alternative protocol illustrates how to accomplish this with the fully-featured “msConvert” executable included in ProteoWizard.
The same necessities found in Basic Protocol 1 apply to this protocol, as well. Notably, some attempts to run command line ProteoWizard conversion under WINE (http://www.winehq.org) emulation have been successful, enabling users to convert vendor files within Linux environments.
Locate the path to ProteoWizard executable software. This can either be added to the “PATH” variable for command line operations, or one can simply write the explicit path to the msConvert binary, e.g. “C:ProgramFiles(x86)ProteoWizardProteoWizard3.0.5471msconvert.exe”.
For each filter to be included, add text structured like the following: --filter “peakPicking true 1-”. The first piece signals that a new data filter is being added. The section in double quotes supplies the name of the filter (taken from Table 2) and the configuration options for it.
Specify the output format options, such as specifying the overall format by one of the following flags: “--mzML”, “--mzXML”, “--mz5”, “--mgf”. If subsequent software allows it, specify “zip” encoding for data within the files through use of the “-z” option. The full list of these options can be found by running the msConvert executable without any parameters.
When a complex set of parameters needs to be applied to a conversion, enumerate the options in a text file, invoking those options with the -c option, as in “msconvert data.RAW -c config.txt”, where config.txt contains options like this:
Specify which files are to be converted. Both absolute and relative paths are permitted, and wildcard characters such as ‘*’ and ‘?’ can be used to process multiple files in a single pass.
Combine these elements into a single command line, such as the following: “C:ProgramFiles(x86)ProteoWizardProteoWizard3.0.5471msconvert.exe” --filter “peakPicking true 1-” --mz5 -z *.raw
BASIC PROTOCOL 2: CONVERTING mzML DATA TO SIMPLE TEXT FORMATS FOR SEARCH ENGINES
Once data have been translated to an open format, an additional step may be necessary to prepare data for handling by some database search engines. In particular, older versions of Sequest () and of Spectrum Mill (http://www.chem.agilent.com/en-US/products-services/Software-Informatics/Spectrum-Mill/Pages/default.aspx) may need MS/MS scans to be presented as individual text files in DTA or PKL format. This protocol details the process for this conversion step.
This protocol depends upon mzXML2Search, a tool available as part of the Trans-Proteomic Pipeline (http://sourceforge.net/projects/sashimi/). The step employing mzXML2Search may be executed under Windows or Linux, as the XML-based formats do not require the instrument vendor libraries. Despite the name of the software, mzXML2Search may be applied to both mzML files and mzXML files.
Determine whether PKL, MGF, or DTA files are required by the search engine to be employed. Correspondingly, replace “[option]” in the command line below by “-pkl”, “-mgf”, or “-dta”. Note: MGF format can be exported directly by msConvert via Basic Protocol 1.
Users of DTA and PKL file formats should keep in mind that exporting each spectrum as a separate text file can take up considerable space on a file system. Newer branches of classic database search algorithms (such as Comet, from the University of Washington ()) can work directly with mzML format or concatenated files rather than requiring the creation of thousands of text files.
Next, determine which source format is available; the software is applicable to mzML or mzXML. In the command line below, “[format]” should be replaced by “*.mzML” or “*.mzXML”.
Execute mzXML2Search with this command line: “mzXML2Search [option] [format]”.
GUIDELINES FOR UNDERSTANDING RESULTS
In most cases, file transcoding is a relatively rapid step. Frequently these conversions can be conducted in less than one minute for each LC-MS/MS experiment. The run time of conversion, however, may be lengthened by some of the available filters, particularly calculation-heavy options.
In some cases, raw data may have become garbled. This is generally true of “mis-injections”, which produce raw file stubs. Garbles may also occur as a file is transmitted from computer to computer. For some types of errors, msConvert is able to return an error message that indicates the nature of the fault. When converting a large set of raw data, it is generally advisable to determine whether or not the output files are equal in number to the input files.
In most cases, one can expect that the files generated by conversion will be larger than the raw data used as sources. In the case of XML-based formats, this is generally due to the explicit labeling of each element in the file; where a vendor-format binary file would store a double-precision floating point number for a peptide m/z value, for example, the XML file might store the number plus a label for what that number represents. The use of zip compression and efficient database formats can greatly reduce expected file sizes (, 5).
Best Practices by Instrument Vendor
The ProteoWizard project has attempted to support all biological mass spectrometer vendors equally, but the completeness of implementation for each vendor can vary. Support for Thermo and Agilent instruments is quite comprehensive. Bruker support is quite robust, but producing peak profiles rather than peak lists is currently problematic for some instrument models. AB SCIEX produces two challenges; the software library does not allow msConvert to infer precursor charges, and only the basic peak centroider is available for peak-listing. Researchers may instead opt to use the free AB SCIEX MS Data Converter (see Internet Resources), which can address both these challenges. Waters instruments have posed long-standing challenges for data access. At present, the data access layer implemented in ProteoWizard cannot offer access to Waters peak-listing, precursor charge state inference, or spectral summation functions. A project nearing completion in the Tabb Laboratory should soon address these challenges by the implementation of open-source alternatives to this functionality in ProteoWizard. Many users may find that export to database search engines is feasible through the use of the Waters Protein Lynx Global Server.
Metadata Content among File Formats
Though files in open formats are often required for the vast majority of downstream operations, it is important to archive vendor data files. Typically, vendor files contain extensive metadata about instrument operating parameters. For example, Thermo RAW files contain several hundred parameters (voltages, pressures, flow rates, etc.) that are not typically extracted. Notably, the ProteoWizard programs ThermoRawMetaDump and msAccess are able to export these parameters. These metadata may be used in conjunction with the mass spectra themselves to evaluate instrument performance more comprehensively.
Importantly, the mzML data standard was designed to fully hold nearly any piece of metadata. However, the data access APIs used by msConvert are not able to access all of it for all of the vendor files. Consequently, mzML is not a complete transformation of a vendor file. Beyond mzML, formats described in Basic Protocol 2 contain almost no metadata and should not be considered archival formats.
Options for File Size Reduction
As noted above, conversion from a vendor format to an open format may lead to an increase in file size. ProteoWizard provides several filters which may be used to reduce file size. These options may alter your data, thus impacting downstream processing, and should be used carefully. The most common size-reduction technique employed is centroiding (peak-picking). High resolution MS data often sample each peak multiple times. Consequently, a given peak will be represented by several data points. This sampling provides detailed information about peak shape and can be useful for de-convolving overlapping peaks. However, these data can consume considerable space. Centroiding operations attempt to determine the center of a peak and to then compress the data by including only a single measurement for each peak, instead of several data points. Centroiding should be performed carefully as it may inadvertently shift the inferred m/z of a peak. In addition, it may impact quantification. As discussed above, the “peakPicking” option of msConvert gives you the option of using either vendor algorithms for this purpose (when available) or the ProteoWizard internal function. Centroiding is a lossy operation.
In addition to centroiding, other space savings may come from writing files using 32-bit numbers and zlib compression for m/z and intensity values. Though most software can handle zlib compressed data, some software packages lack this ability. Consequently, one should test his or her pipeline with zlib compressed data before batch converting a large number of files. These two operations are not lossy.
To use these options on the command line, one would execute
msConvert also supports the Levander lab’s msNumPress Library, which can be executed using command line flags such as –numpressAll. Compression with numpress is lossy.
Additional options for altering your data to decrease file size include the --filter “zeroSamples removeExtra” option. When <mode> is “removeExtra”, consecutive zero intensity peaks are removed from spectra. For example, a peak list “100.1,1000 100.2,0 100.3,0 100.4,0 100.5,0 100.6,1030” would become “100.1,1000 100.2,0 100.5,0 100.6,1030” and a peak list “100.1,0 100.2,0 100.3,0 100.4,0 100.5,0 100.6,1030 100.7,0 100.8,1020 100.9,0 101.0,0” would become “100.5,0 100.6,1030 100.7,0 100.8,1020 100.9,0.” This particular data processing is relatively innocuous for most applications. More aggressive filter options include denoising and thresholding. These operations explicitly discard data and should be used cautiously.
While this protocol addresses the use of msConvert and its GUI counterpart for file format handling, incorporating ProteoWizard directly in software as a reader of mass spectrometry formats is very powerful. Software that adopts this strategy will gain the ability to read any format that ProteoWizard can, thus limiting the need to convert formats to those cases where operating system requirements (such as processing on a Linux cluster) is necessary. Simply employing msConvert and msConvertGUI as described in this protocol, however, greatly reduces the challenges for handling mass spectrometry data.
D.L.T. was supported by U01 CA152647, while J.D.H. was supported by U24 CA159988. P.M. was supported by CCNE-T U54 CA151459, PSOC-MCSTART U54 CA143907, and by the Canary Foundation.
ProteoWizard is available from its website:
AB SCIEX provides the AB SCIEX MS Data Converter:
Chambers, M. C., Maclean, B., Burke, R., Amodei, D., Ruderman, D. L., Neumann, S., Gatto, L., Fischer, B., Pratt, B., Egertson, J., et al. 2012. A cross-platform toolkit for mass spectrometry and proteomics. Nature biotechnology 30:918–920.
Jerry D. Holman, Department of Biomedical Informatics, Vanderbilt University School of Medicine, 465 21st Ave S, Mail Stop 8575, Nashville, TN 37232-8575, 615-936-2476 voice, 615-343-8372 fax.
David L. Tabb, Department of Biomedical Informatics, Vanderbilt University School of Medicine, 465 21st Ave S, Mail Stop 8575, Nashville, TN 37232-8575, 615-936-0380 voice, 615-343-8372 fax.
Parag Mallick, Department of Radiology, Stanford School of Medicine, 3155 Porter Drive, Room 2240, Palo Alto, CA 94304, 650-724-0923 voice, 650-721-3298 fax.
- Chambers MC, Maclean B, Burke R, Amodei D, Ruderman DL, Neumann S, Gatto L, Fischer B, Pratt B, Egertson J, et al. A cross-platform toolkit for mass spectrometry and proteomics. Nature biotechnology. 2012;30:918–920.[PMC free article] [PubMed] [Google Scholar]
- Eng JK, Jahan TA, Hoopmann MR. Comet: an open-source MS/MS sequence database search tool. Proteomics. 2013;13:22–24. [PubMed] [Google Scholar]
- Eng JK, McCormack AL, Yates JR. An approach to correlate tandem mass spectral data of peptides with amino acid sequences in a protein database. Journal of the American Society for Mass Spectrometry. 1994;5:976–989. [PubMed] [Google Scholar]
- Kessner D, Chambers M, Burke R, Agus D, Mallick P. ProteoWizard: open source software for rapid proteomics tools development. Bioinformatics (Oxford, England) 2008;24:2534–2536.[PMC free article] [PubMed] [Google Scholar]
- Martens L, Chambers M, Sturm M, Kessner D, Levander F, Shofstahl J, Tang WH, Römpp A, Neumann S, Pizarro AD, et al. mzML--a community standard for mass spectrometry data. Molecular & cellular proteomics: MCP. 2011;10:R110000133.[PMC free article] [PubMed] [Google Scholar]
- Pedrioli PGA, Eng JK, Hubley R, Vogelzang M, Deutsch EW, Raught B, Pratt B, Nilsson E, Angeletti RH, Apweiler R, et al. A common open representation of mass spectrometry data and its application to proteomics research. Nature biotechnology. 2004;22:1459–1466. [PubMed] [Google Scholar]
- Wilhelm M, Kirchner M, Steen JAJ, Steen H. mz5: space- and time-efficient storage of mass spectrometry data sets. Molecular & cellular proteomics: MCP. 2012;11:O111011379.[PMC free article] [PubMed] [Google Scholar]
I have faced situations where I have had to export data automated and routinely from chemstation software. My first question is which version of chemstation are you running? There is a difference between MSD Chemstation and Chemstation (i.e. for HPLC etc..). Also, I have found it to be much more efficient to use chemstation's built in capabilities to export data and then manage the data in other ways. I have a set up where every time my GCMS runs, chemstation runs a series of macros that basically export all the data about the run into a MySQL DB, and execute a bunch of housecleaning stuff. Then it is easy to manipulate the data. In MSD Chemstation's data analysis program you can create a custom report which allows you to generate a CRD file which can be opened in Excel, or in my case I use an XLS to CSV converter and parse out all of the ____ in the CRD file. At this point, one can simply create an Excel sheet that uses the csv file as it's datasource. This can be formatted any way Excel is capable. This will work, but it is not the best way. Better to import the csv file into a database. Now you have power and manipulate the data in any direction imaginable. Anyway, I might be able to help with more information about your system. I use both types of chemstation. I am not an expert in the macro language. I have found documentation lacking on it, and I have no time to take a class. I have had some success reverse engineering the existing macros (chemstation seems to run entirely on macros). Really though, I like to get the data away from chemstation as fast as possible. It's just old technology that is only good for interfacing the the instruments. IMHO