Discussion:
[DIWG] netCDF I/O performance concerns
Moroni, David F (398M)
2014-09-26 22:26:43 UTC
Permalink
Dear NetCDF Group,

Please refer to the discussion below as to the nature of the netCDF performance concerns identified by Bill Rossow, who has been CC'd here.

Please address any questions you have as to the nature of this issue with Bill directly.

Respectfully,
David

==================================================
David Moroni
Ocean Wind and Scatterometry Data Engineer
Physical Oceanography Distributed Active Archive Center
Jet Propulsion Laboratory
4800 Oak Grove Dr
M/S 158-242
Pasadena, CA 91109
Phone: 818.354.2038
Fax: 818.353.2718
==================================================

From: <Moroni>, David F Moroni <***@jpl.nasa.gov<mailto:***@jpl.nasa.gov>>
Date: Friday, September 26, 2014 3:20 PM
To: William Rossow <***@gmail.com<mailto:***@gmail.com>>
Cc: "esdswg-***@lists.nasa.gov<mailto:esdswg-***@lists.nasa.gov>" <esdswg-***@lists.nasa.gov<mailto:esdswg-***@lists.nasa.gov>>
Subject: Re: [DIWG] netCDF I/O performance concerns

Hi Bill,

To better diagnose your issue, it's good to know your baseline technical approach.

For starters, it seems like you are working with large volume data files, where as you put it "data size precludes the second" option of storing all of the data into an array. NetCDF allows generous flexibility to precisely read in specific data variables as well as their specific array elements (think of the concept of subsetting) without having to read in the entire file or the entire variable array into memory. You can go even further with this and use OPeNDAP servers, where you can remotely download and read into memory the very specific bits of data that are needed for processing without having to download the entire netCDF file.

I've worked with both flat binary files and netCDF files simultaneously with the various types of data processing code, and aside from memory consumption issues which can be addressed by either hardware or software configurations, I've never observed significant differences in processing speed based upon the source of the data.

This is why I'm intrigued by the issue you've observed and think it warrants further investigation into the technical side of your approach.

Cheers,
David

From: William Rossow <***@gmail.com<mailto:***@gmail.com>>
Date: Friday, September 26, 2014 3:03 PM
To: David F Moroni <***@jpl.nasa.gov<mailto:***@jpl.nasa.gov>>
Subject: Re: netCDF I/O performance concerns

David, All too technical... the basic point is that both of the approaches you mention are impractical... data size precludes the second and performance really drags with the first.

On Fri, Sep 26, 2014 at 6:00 PM, Moroni, David F (398M) <***@jpl.nasa.gov<mailto:***@jpl.nasa.gov>> wrote:
Bill,

So to understand more clearly, are you reading the data from the netCDF file iteratively using "for" or "do" loops, or are you first storing the netCDF data into either a static or dynamic array?

The first option above would conserve memory but would take longer to process due to more CPU cycles being required; the latter option would be much faster (i.e., should be more comparable to reading directly from a flat binary file) but would require more system memory.

Also, when you encountered these performance issues, were you using netCDF data following the "classic" (i.e., version 3 or below) or the "extended" (i.e., netCDF-4 following the hierarchical HDF-5 data model)?

There are distinct differences between these types of netCDF data models. The netCDF "classic" model is essentially a flat binary file wrapped with self-describing ASCII metadata and an ASCII header. The netCDF "extended" model is hierarchical, which uses groups to store the data arrays. I haven't examined this myself, but it could be that there are some performance differences between the "extended" and "classic" data models due the simplicity of a flat data structure versus a multi-tiered data structure.

Cheers,
David


From: William Rossow <***@gmail.com<mailto:***@gmail.com>>
Date: Friday, September 26, 2014 2:45 PM
To: David F Moroni <***@jpl.nasa.gov<mailto:***@jpl.nasa.gov>>
Cc: "esdswg-***@lists.nasa.gov<mailto:esdswg-***@lists.nasa.gov>" <esdswg-***@lists.nasa.gov<mailto:esdswg-***@lists.nasa.gov>>
Subject: Re: netCDF I/O performance concerns

David, This issue is not a strict computer performance or software interaction issue; it is an issue that arises for performing calculations involving many co-located, coincident variables that are extensive (large space-time scope). It is just that all of these "graphics" formats are arranged badly in this case.

On Fri, Sep 26, 2014 at 5:08 PM, Moroni, David F (398M) <***@jpl.nasa.gov<mailto:***@jpl.nasa.gov>> wrote:
Hi Bill,

In response to the issue you raised with netCDF I/O performance with your Fortran code, I've already contacted Ethan Davis at Unidata to see if this has already been captured as a known issue and if there is a fix for this.

In the meantime, I thought it would be worthwhile to connect you with the ESDSWG Interoperability working group to ensure this matter is also on their radar screen. There may be a member within this group who has wrestled with the same issue you've encountered, so my hope is that a solution may already exist, but if not at least it could potentially be on the horizon.

Best Regards,
David

==================================================
David Moroni
Ocean Wind and Scatterometry Data Engineer
Physical Oceanography Distributed Active Archive Center
Jet Propulsion Laboratory
4800 Oak Grove Dr
M/S 158-242
Pasadena, CA 91109
Phone: 818.354.2038<tel:818.354.2038>
Fax: 818.353.2718<tel:818.353.2718>
==================================================





--
Dr. William B. Rossow
Distinguished Professor of Remote Sensing
CREST at The City College of New York
Steinman Hall (T-107)
140th Street and Convent Avenue
New York, NY 10031
1-212-650-5389<tel:1-212-650-5389>
***@ccny.cuny.edu<mailto:***@ccny.cuny.edu>



--
Dr. William B. Rossow
Distinguished Professor of Remote Sensing
CREST at The City College of New York
Steinman Hall (T-107)
140th Street and Convent Avenue
New York, NY 10031
1-212-650-5389
***@ccny.cuny.edu<mailto:***@ccny.cuny.edu>
Roy Mendelssohn
2014-09-28 15:56:59 UTC
Permalink
Given that there is essentially no information being given about the file (such as a header dump - ncdump -hk filename - so as to get both the header info and the filetype) and no information on how he is trying to read the data, it is almost impossible to say what is going on.

-Roy
Post by Moroni, David F (398M)
Dear NetCDF Group,
Please refer to the discussion below as to the nature of the netCDF performance concerns identified by Bill Rossow, who has been CC'd here.
Please address any questions you have as to the nature of this issue with Bill directly.
Respectfully,
David
==================================================
David Moroni
Ocean Wind and Scatterometry Data Engineer
Physical Oceanography Distributed Active Archive Center
Jet Propulsion Laboratory
4800 Oak Grove Dr
M/S 158-242
Pasadena, CA 91109
Phone: 818.354.2038
Fax: 818.353.2718
==================================================
Date: Friday, September 26, 2014 3:20 PM
Subject: Re: [DIWG] netCDF I/O performance concerns
Post by Moroni, David F (398M)
Hi Bill,
To better diagnose your issue, it's good to know your baseline technical approach.
For starters, it seems like you are working with large volume data files, where as you put it "data size precludes the second" option of storing all of the data into an array. NetCDF allows generous flexibility to precisely read in specific data variables as well as their specific array elements (think of the concept of subsetting) without having to read in the entire file or the entire variable array into memory. You can go even further with this and use OPeNDAP servers, where you can remotely download and read into memory the very specific bits of data that are needed for processing without having to download the entire netCDF file.
I've worked with both flat binary files and netCDF files simultaneously with the various types of data processing code, and aside from memory consumption issues which can be addressed by either hardware or software configurations, I've never observed significant differences in processing speed based upon the source of the data.
This is why I'm intrigued by the issue you've observed and think it warrants further investigation into the technical side of your approach.
Cheers,
David
Date: Friday, September 26, 2014 3:03 PM
Subject: Re: netCDF I/O performance concerns
Post by Moroni, David F (398M)
David, All too technical... the basic point is that both of the approaches you mention are impractical... data size precludes the second and performance really drags with the first.
Post by Moroni, David F (398M)
Bill,
So to understand more clearly, are you reading the data from the netCDF file iteratively using "for" or "do" loops, or are you first storing the netCDF data into either a static or dynamic array?
The first option above would conserve memory but would take longer to process due to more CPU cycles being required; the latter option would be much faster (i.e., should be more comparable to reading directly from a flat binary file) but would require more system memory.
Also, when you encountered these performance issues, were you using netCDF data following the "classic" (i.e., version 3 or below) or the "extended" (i.e., netCDF-4 following the hierarchical HDF-5 data model)?
There are distinct differences between these types of netCDF data models. The netCDF "classic" model is essentially a flat binary file wrapped with self-describing ASCII metadata and an ASCII header. The netCDF "extended" model is hierarchical, which uses groups to store the data arrays. I haven't examined this myself, but it could be that there are some performance differences between the "extended" and "classic" data models due the simplicity of a flat data structure versus a multi-tiered data structure.
Cheers,
David
Date: Friday, September 26, 2014 2:45 PM
Subject: Re: netCDF I/O performance concerns
Post by Moroni, David F (398M)
David, This issue is not a strict computer performance or software interaction issue; it is an issue that arises for performing calculations involving many co-located, coincident variables that are extensive (large space-time scope). It is just that all of these "graphics" formats are arranged badly in this case.
Post by Moroni, David F (398M)
Hi Bill,
In response to the issue you raised with netCDF I/O performance with your Fortran code, I've already contacted Ethan Davis at Unidata to see if this has already been captured as a known issue and if there is a fix for this.
In the meantime, I thought it would be worthwhile to connect you with the ESDSWG Interoperability working group to ensure this matter is also on their radar screen. There may be a member within this group who has wrestled with the same issue you've encountered, so my hope is that a solution may already exist, but if not at least it could potentially be on the horizon.
Best Regards,
David
==================================================
David Moroni
Ocean Wind and Scatterometry Data Engineer
Physical Oceanography Distributed Active Archive Center
Jet Propulsion Laboratory
4800 Oak Grove Dr
M/S 158-242
Pasadena, CA 91109
Phone: 818.354.2038
Fax: 818.353.2718
==================================================
--
Dr. William B. Rossow
Distinguished Professor of Remote Sensing
CREST at The City College of New York
Steinman Hall (T-107)
140th Street and Convent Avenue
New York, NY 10031
1-212-650-5389
--
Dr. William B. Rossow
Distinguished Professor of Remote Sensing
CREST at The City College of New York
Steinman Hall (T-107)
140th Street and Convent Avenue
New York, NY 10031
1-212-650-5389
_______________________________________________
netcdfgroup mailing list
For list information or to unsubscribe, visit: http://www.unidata.ucar.edu/mailing_lists/
**********************
"The contents of this message do not reflect any position of the U.S. Government or NOAA."
**********************
Roy Mendelssohn
Supervisory Operations Research Analyst
NOAA/NMFS
Environmental Research Division
Southwest Fisheries Science Center
***Note new address and phone***
110 Shaffer Road
Santa Cruz, CA 95060
Phone: (831)-420-3666
Fax: (831) 420-3980
e-mail: ***@noaa.gov www: http://www.pfeg.noaa.gov/

"Old age and treachery will overcome youth and skill."
"From those who have been given much, much will be expected"
"the arc of the moral universe is long, but it bends toward justice" -MLK Jr.
Loading...