[SciPy-user] scipy.io.read_array: NaN in data file

Wed Mar 11 12:26:46 EDT 2009

Dharhas,
To find duplicates, you can use the following function (on SVN r2111).  
find_duplicated_dates will give you a dictionary, you can then use the  
values to decide what you want to do. remove_duplicated_dates will  
strip the series to keep only the first occurrence of duplicated dates.

def find_duplicated_dates(series):
     """
     Return a dictionary (duplicated dates <> indices) for the input  
series.

     The indices are given as a tuple of ndarrays, a la :meth:`nonzero`.

     Parameters
     ----------
     series : TimeSeries, DateArray
         A valid :class:`TimeSeries` or :class:`DateArray` object.

     Examples
     --------
     >>> series = time_series(np.arange(10),
                             dates=[2000, 2001, 2002, 2003, 2003,
                                    2003, 2004, 2005, 2005, 2006],  
freq='A')
     >>> test = find_duplicated_dates(series)
      {<A-DEC : 2003>: (array([3, 4, 5]),), <A-DEC : 2005>: (array([7,  
8]),)}
     """
     dates = getattr(series, '_dates', series)
     steps = dates.get_steps()
     duplicated_dates = tuple(set(dates[steps==0]))
     indices = {}
     for d in duplicated_dates:
         indices[d] = (dates==d).nonzero()
     return indices

def remove_duplicated_dates(series):
     """
     Remove the entries of `series` corresponding to duplicated dates.

     The series is first sorted in chronological order.
     Only the first occurence of a date is then kept, the others are  
discarded.

     Parameters
     ----------
     series : TimeSeries
         Time series to process
     """
     dates = getattr(series, '_dates', series)
     steps = np.concatenate(([1,], dates.get_steps()))
     if not dates.is_chronological():
         series = series.copy()
         series.sort_chronologically()
         dates = series._dates
     return series[steps.nonzero()]

On Mar 11, 2009, at 9:13 AM, Dharhas Pothina wrote:

>
> In this particular case we know the cause:
>
> It is either :
>
> a) Overlapping files have been appended. ie file1 contains data from  
> Jan1 to Feb1 and file2 contains data from jan1 to March1. The  
> overlap region has identical data.
>
> b) The data comes from sequential deployments and there is an small  
> overlap at the beginning of the second file. ie file1 has data from  
> Jan1 to Feb1 and file2 contains data from Feb1 to March1. There may  
> be a few data points overlap. These are junk because the equipment  
> was set up in the lab and took measurements in the air until it was  
> swapped with the installed instrument in the water.
>
> In both these cases it is appropriate to take the first value. In  
> the second case we really should be stripping the bad data before  
> appending but this is a work in progress. Right now we are  
> developing a semi-automated QA/QC procedure to clean up data before  
> posting it on the web. We presently use a mix of awk and shell  
> scripts but I'm trying to convert everything to python to make it  
> easier to use, more maintainable, have nicer plots than gnuplot and  
> to develop a gui application to help us do this.
>
> - dharhas
>
>>>> Timmie <timmichelsen at gmx-topmail.de> 3/11/2009 4:35 AM >>>
>> Well, because there's no standard way to do that: when you have
>> duplicated dates, should you take the first  one? The last one ? Take
>> some kind of average of the values ?
> Sometimes, there are inherent faults in the data set. Therefore, a  
> automatic
> treatment may introduce further errors.
> It's only possible when this errors are occuring somewhat  
> systematically.
>
>
>
>
> _______________________________________________
> SciPy-user mailing list
> SciPy-user at scipy.org
> http://mail.scipy.org/mailman/listinfo/scipy-user
>
> _______________________________________________
> SciPy-user mailing list
> SciPy-user at scipy.org
> http://mail.scipy.org/mailman/listinfo/scipy-user