Given some lines of data that look like

1 2
IP147.006.000.012.047.1704111352 IP147.006.000.033.000.1713322001

We need to extract the last 10 digits (i.e. everything after the final . in the string) and make sure it is a “valid date” using Python.

A “valid” date should be of the format

yy[year]mm[month]dd[date]hh[hour]mm[mins]

This means line #2 is invalid because 13 is not a valid month.

Our example lines contain “fields” or “columns” that are “delimited” by the . character meaning we could split() on . and take the last item.

split() returns a list.

>>> line = 'IP147.006.000.033.000.1705172001'
>>> line.split('.')
['IP147', '006', '000', '033', '000', '1705172001']
>>> line.split('.')[-1]
'1705172001'

We could also use “string” slicing to get the last 10 characters.

>>> line[-10:]
'1705172001'

However it was then stated that each line should be validated to match an exact pattern of

  • IP147 followed by
  • 4 groups of 3 digits followed by
  • a 10 digit date

… which are all delimited by .

We could of course do such validation in “plain” Python using split() and str.isdigit()

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
def is_valid(line): parts = line.split('.') return len(parts) == 6 and parts[0] == 'IP147' and \ len(parts[1]) == 3 and parts[1].isdigit() and \ len(parts[2]) == 3 and parts[2].isdigit() and \ len(parts[3]) == 3 and parts[3].isdigit() and \ len(parts[4]) == 3 and parts[4].isdigit() and \ len(parts[5]) == 10 and parts[5].isdigit() lines = [ 'IP147.006.000.012.047.1704111352', 'IP147.006.000.033.000.1713322001', 'IP148.006.000.033.000.1712322001' ] for line in lines: print(is_valid(line))

Which would output

True
True
False

#3 fails as it starts with IP148 instead of IP147

Regular Expressions

Another option is to use a Regular Expression!!!! by importing the re module.

>>> import re
>>>
>>> line = 'IP147.006.000.012.047.1704111352',
>>>
>>> re.search(r'^IP147(?:\.\d{3}){4}\.(\d{10})$', line)
<_sre.SRE_Match object at 0x7f17b8b58b70>
>>> re.search(r'^IP147(?:\.\d{3}){4}\.(\d{10})$', line).group(1)
'1705172001'

Okay.. so what on earth is this r'^IP147(?:\.\d{3}){4}\.(\d{10})$' monstrosity?!

Because a backslash is used in regex

  • to disable the special meaning of “special” regex characters
  • to denote special character classes e.g. \d to match a digit or \b to match a word boundary

… and the backslash is also used for escaping inside Python strings we use a raw string (created with r'') to store our pattern which allows the backslash to get passed directly through to the regex.

Let’s breakdown the pattern.

  • ^ matches the start of the string
  • IP147 matches the string IP147
  • (?: starts a non-capturing group
  • \. matches a literal . character
  • \d matches a digit
  • {3} means the previous atom (in this case \d) exactly 3 times
  • ) closes our non-capturing group
  • {4} means the previous atom (in this case (?:\.\d{3})) exactly 4 times
  • \. matches a literal . character
  • ( starts a capturing group
  • \d{10} matches exactly 10 digits
  • ) closes the capturing group
  • $ matches the end of the string

. has “special” meaning in regex as it matches “any” character (except a newline) which is why we must escape it using a backslash to match a literal .

\.\d{3} matches a . character followed by 3 digits and we want to match this sequence 4 times.

This is why we use (?:) to group \.\d{3} together as a single atom to have the {4} apply to it. As we only need to group and not capture we’ve used a non capturing group.

Creating a capture group (as we’ve done with (\d{10})) allows us to refer to it and access its contents at a later date.

The call to .group(1) returns the content of the first capture group i.e. our 10 digit “date”.

strptime()

Now that we can test if a line matches the exact pattern needed whilst extracting the 10 digit date string we now need to make sure it is a “valid” date.

We could attempt to do it ourselves however a much simpler approach would be to use the datetime module that comes with Python.

The datetime.strptime(date_string, format) class method takes a date string and a format and attempts to create a datetime object from it.

>>> from datetime import datetime
>>>
>>> datetime.strptime('1705172001', '%y%m%d%H%M')
datetime.datetime(2017, 5, 17, 20, 1)
>>> datetime.strptime('1713172001', '%y%m%d%H%M')
Traceback (most recent call last):
ValueError: unconverted data remains: 01

When date_string doesn’t match our format it raises a ValueError exception meaning we can use strptime() inside a try statement to catch any potential error.

If an exception is raised then the code inside our except block will be executed.

>>> try:
...     datetime.strptime('1713172001', '%y%m%d%H%M')
... except:
...     print('Invalid date.')
... 
Invalid date.

Armed with this knowledge let’s add this to our validation code.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
import re from datetime import datetime lines = [ 'IP147.006.000.012.047.1704111352', 'IP147.006.000.033.000.1713322001', 'IP148.006.000.033.000.1712322001' ] pattern = r'^IP147(?:\.\d{3}){4}\.(\d{10})$' for line in lines: match = re.search(pattern, line) if match: date = match.group(1) try: datetime.strptime(date, '%y%m%d%H%M') print('True ', date) except: print(False, date) else: print(False, line)

So we first check if we got a match as re.search will return None if it doesn’t match.

None is a “Falsey” value meaning it evaluates to False

If the line matches we extract the date which is contained in the first capture group and we attempt to strptime it.

If the strptime fails it is an invalid date and if the line did not match it’s also invalid.

Let’s run it and check the output.

$ python is-valid-date.py
True  1704111352
False 1713322001
False IP148.006.000.033.000.1712322001

The reason for using 'True ' instead of just True was to keep the output aligned.